forgot password?   register

#housing #investing #politics more»
736,612 comments in 75,779 posts by 10,921 registered users, 1 online now: curious2

new post

Genealogy Math Problem

By Patrick   2012 Mar 20, 7:07am   1,986 views   7 comments   watch (0)   quote      

I've been trying to figure out how all the Kill*leas (Killalea, Killelea, or Killilea) of the world are related, and so far got 7 Kill*lea men of unknown relation to do the standard Y-chromosome test. The results are here:

Six of them look to be pretty closely related, and are all probably in haplogroup R1b. But how many generations removed are they from their common ancestor? Most Time-to-most-recent-common-ancestor (TMRCA) calculators look only at the number of mismatches, but don't take into account how likely or unlikely a mismatch is for that "locus". The importance of any particular match or mismatch varies enormously. Fortunately, I have some good statistics on the frequency distribution of locus values, here:

So you can see that the fact that the first six men all have a value of 12 for locus 438 means very little, because 94% of men in haplogroup R1b have a value of 12 for locus 438.

Conversely, the fact that 4 of the first six men have a value of 13 for locus GATA-H4 is very important and seems to show descent from a recent common ancestor, because very few men have a value of 13.

So how do you combine probability distributions with the results to get a meaningful measurement of difference between the men, and ultimately a time to most recent common ancestor?

Comments 1-7 of 7     Last »

1   PockyClipsNow   60/60 = 100% civil   2012 Mar 20, 7:31am  ↑ like   ↓ dislike   quote    

Maybe its time to get a day job....

2   Patrick   1844/1844 = 100% civil   2012 Mar 20, 8:01am  ↑ like   ↓ dislike   quote    

It was time to get a day job two years ago, but I'm still having too much fun. The only think I lack is money.

3   leo707     2012 Mar 20, 8:19am  ↑ like   ↓ dislike   quote    

I have no idea, but it looks like an awesome project.

4   NorCalBear     2012 Mar 20, 10:50am  ↑ like   ↓ dislike   quote    

for distance estimation with known population level frequencies and individual alleles, you need to look at Bayesian priors:

calibrating the "time" component in the TMRCA will still require the ability to convert "step events" when an STR changes from length X to length X+1 or X-1 to how often those take place over time, the so called molecular clock

You should be able to estimate the Events to MRCA by calculating the most parsimonious ancestor (least number of steps to all extant Kill*la samples), then for the comparison between the MRCA and each Kill*la sample you can get the mean and variance estimator for "steps to MRCA"

The should be published estimators for the mean and variance for STS events per unit time.

the TMRCA estimate is then biased (lower) since:
1) small number of samples, might not have found all the Kill*la variation
2) most parsimonious ancestor is shortest path

you can use both the variance on the steps to MRCA as well as STS events per unit time combined to get upper and lower confidence bounds on the TMRCA estimate

I thought that many of these calculations were being done automagically at places supporting the surname project:


5   Patrick   1844/1844 = 100% civil   2012 Mar 20, 11:03am  ↑ like   ↓ dislike   quote    

NorCalBear says

I thought that many of these calculations were being done automagically at places supporting the surname project:


Thanks for the lead. I'll poke around, since my own understanding of Bayesian statistics is poor.

I know that does have a calculator that takes into account the probability of each locus value, but they do not let you use it unless you pay for their DNA test first. Not very friendly of them.

6   Patrick   1844/1844 = 100% civil   2012 Mar 20, 11:09am  ↑ like   ↓ dislike   quote    

Darn, looks like is just as money-centered, and does not provide any resources to anyone unless you get a test from them. Let me know if I'm wrong.

7   NorCalBear     2012 Mar 20, 2:34pm  ↑ like   ↓ dislike   quote    


im the guy who writes the code himself in c++ (or python perl, php, whatever, its been a while since a few others, but I'm not opposed to pushing stack in assembler, given hours of concentration) perhaps we can take the population genetics discussion off line, you know my contacts

dnaancestry is likely very $ centered, the web business model is all trying to figure a way to make $, we all know that problem


Comments 1-7 of 7     Last »

users   about   suggestions   contact  
topics   random post   best comments   comment jail  
patrick's 40 proposals  
10 reasons it's a terrible time to buy  
8 groups who lie about the housing market  
37 bogus arguments about housing  
get a free bumper sticker:

top   bottom   home