Genealogy Math Problem


By Patrick   Follow   Tue, 20 Mar 2012, 2:07pm   1,220 views   7 comments
Watch (0)   Share   Quote   Permalink   Like   Dislike  

I've been trying to figure out how all the Kill*leas (Killalea, Killelea, or Killilea) of the world are related, and so far got 7 Kill*lea men of unknown relation to do the standard Y-chromosome test. The results are here:

http://patrick.net/killeleas/baile.php#dna

Six of them look to be pretty closely related, and are all probably in haplogroup R1b. But how many generations removed are they from their common ancestor? Most Time-to-most-recent-common-ancestor (TMRCA) calculators look only at the number of mismatches, but don't take into account how likely or unlikely a mismatch is for that "locus". The importance of any particular match or mismatch varies enormously. Fortunately, I have some good statistics on the frequency distribution of locus values, here:

http://patrick.net/killeleas/yfreq.html

So you can see that the fact that the first six men all have a value of 12 for locus 438 means very little, because 94% of men in haplogroup R1b have a value of 12 for locus 438.

Conversely, the fact that 4 of the first six men have a value of 13 for locus GATA-H4 is very important and seems to show descent from a recent common ancestor, because very few men have a value of 13.

So how do you combine probability distributions with the results to get a meaningful measurement of difference between the men, and ultimately a time to most recent common ancestor?

Viewing Comments 1-7 of 7     Last »     See most liked comments

  1. PockyClipsNow


    Follow
    Befriend
    19 threads
    1,582 comments
    Los Angeles, CA

    1   2:31pm Tue 20 Mar 2012   Share   Quote   Permalink   Like   Dislike  

    Maybe its time to get a day job....

  2. Patrick


    Follow
    Befriend (55)
    5,664 threads
    6,346 comments
    male
    Menlo Park, CA
    Premium

    2   3:01pm Tue 20 Mar 2012   Share   Quote   Permalink   Like   Dislike  

    It was time to get a day job two years ago, but I'm still having too much fun. The only think I lack is money.

  3. leo707


    Follow
    Befriend (12)
    11 threads
    4,088 comments
    Oakland, CA
    leo707's website

    3   3:19pm Tue 20 Mar 2012   Share   Quote   Permalink   Like   Dislike  

    I have no idea, but it looks like an awesome project.

  4. NorCalBear


    Follow
    Befriend
    2 threads
    27 comments
    Pleasanton, CA

    4   5:50pm Tue 20 Mar 2012   Share   Quote   Permalink   Like   Dislike  

    for distance estimation with known population level frequencies and individual alleles, you need to look at Bayesian priors:

    http://en.wikipedia.org/wiki/Posterior_probability_distribution

    calibrating the "time" component in the TMRCA will still require the ability to convert "step events" when an STR changes from length X to length X+1 or X-1 to how often those take place over time, the so called molecular clock

    You should be able to estimate the Events to MRCA by calculating the most parsimonious ancestor (least number of steps to all extant Kill*la samples), then for the comparison between the MRCA and each Kill*la sample you can get the mean and variance estimator for "steps to MRCA"

    The should be published estimators for the mean and variance for STS events per unit time.

    the TMRCA estimate is then biased (lower) since:
    1) small number of samples, might not have found all the Kill*la variation
    2) most parsimonious ancestor is shortest path

    you can use both the variance on the steps to MRCA as well as STS events per unit time combined to get upper and lower confidence bounds on the TMRCA estimate

    I thought that many of these calculations were being done automagically at places supporting the surname project:

    http://www.dnaancestryproject.com/ydna_intro_surname.php

    no?

  5. Patrick


    Follow
    Befriend (55)
    5,664 threads
    6,346 comments
    male
    Menlo Park, CA
    Premium

    5   6:03pm Tue 20 Mar 2012   Share   Quote   Permalink   Like   Dislike  

    NorCalBear says

    I thought that many of these calculations were being done automagically at places supporting the surname project:

    http://www.dnaancestryproject.com/ydna_intro_surname.php

    no?

    Thanks for the lead. I'll poke around dnaancestryproject.com, since my own understanding of Bayesian statistics is poor.

    I know that familytreedna.com does have a calculator that takes into account the probability of each locus value, but they do not let you use it unless you pay for their DNA test first. Not very friendly of them.

  6. Patrick


    Follow
    Befriend (55)
    5,664 threads
    6,346 comments
    male
    Menlo Park, CA
    Premium

    6   6:09pm Tue 20 Mar 2012   Share   Quote   Permalink   Like   Dislike  

    Darn, looks like dnaancestry.com is just as money-centered, and does not provide any resources to anyone unless you get a test from them. Let me know if I'm wrong.

  7. NorCalBear


    Follow
    Befriend
    2 threads
    27 comments
    Pleasanton, CA

    7   9:34pm Tue 20 Mar 2012   Share   Quote   Permalink   Like   Dislike  

    P,

    im the guy who writes the code himself in c++ (or python perl, php, whatever, its been a while since a few others, but I'm not opposed to pushing stack in assembler, given hours of concentration) perhaps we can take the population genetics discussion off line, you know my contacts

    dnaancestry is likely very $ centered, the web business model is all trying to figure a way to make $, we all know that problem

    -A(Norcalbear)

Premium member Patrick is moderator of this thread.

Email

Username

Watch comments by email
Home   Tips and Tricks   Questions or suggestions? Mail p@patrick.net   Thank you for your kind donations

Page took 71 milliseconds to create.