I've been trying to figure out how all the Kill*leas (Killalea, Killelea, or Killilea) of the world are related, and so far got 7 Kill*lea men of unknown relation to do the standard Y-chromosome test. The results are here:

http://patrick.net/killeleas/baile.php#dna

Six of them look to be pretty closely related, and are all probably in haplogroup R1b. But how many generations removed are they from their common ancestor? Most Time-to-most-recent-common-ancestor (TMRCA) calculators look only at the number of mismatches, but don't take into account how likely or unlikely a mismatch is for that "locus". The importance of any particular match or mismatch varies enormously. Fortunately, I have some good statistics on the frequency distribution of locus values, here:

http://patrick.net/killeleas/yfreq.html

So you can see that the fact that the first six men all have a value of 12 for locus 438 means very little, because 94% of men in haplogroup R1b have a value of 12 for locus 438.

Conversely, the fact that 4 of the first six men have a value of 13 for locus GATA-H4 is very important and seems to show descent from a recent common ancestor, because very few men have a value of 13.

So how do you combine probability distributions with the results to get a meaningful measurement of difference between the men, and ultimately a time to most recent common ancestor?

Follow

Befriend

19 threads

1,582 comments

Los Angeles, CA

Maybe its time to get a day job....

Follow

Befriend (55)

5,659 threads

6,342 comments

male

Menlo Park, CA

Premium

It was time to get a day job two years ago, but I'm still having too much fun. The only think I lack is money.

Follow

Befriend (12)

11 threads

4,088 comments

Oakland, CA

leo707's website

I have no idea, but it looks like an awesome project.

Follow

Befriend

2 threads

27 comments

Pleasanton, CA

for distance estimation with known population level frequencies and individual alleles, you need to look at Bayesian priors:

http://en.wikipedia.org/wiki/Posterior_probability_distribution

calibrating the "time" component in the TMRCA will still require the ability to convert "step events" when an STR changes from length X to length X+1 or X-1 to how often those take place over time, the so called molecular clock

You should be able to estimate the Events to MRCA by calculating the most parsimonious ancestor (least number of steps to all extant Kill*la samples), then for the comparison between the MRCA and each Kill*la sample you can get the mean and variance estimator for "steps to MRCA"

The should be published estimators for the mean and variance for STS events per unit time.

the TMRCA estimate is then biased (lower) since:

1) small number of samples, might not have found all the Kill*la variation

2) most parsimonious ancestor is shortest path

you can use both the variance on the steps to MRCA as well as STS events per unit time combined to get upper and lower confidence bounds on the TMRCA estimate

I thought that many of these calculations were being done automagically at places supporting the surname project:

http://www.dnaancestryproject.com/ydna_intro_surname.php

no?

Follow

Befriend (55)

5,659 threads

6,342 comments

male

Menlo Park, CA

Premium

NorCalBear says

Thanks for the lead. I'll poke around dnaancestryproject.com, since my own understanding of Bayesian statistics is poor.

I know that familytreedna.com does have a calculator that takes into account the probability of each locus value, but they do not let you use it unless you pay for their DNA test first. Not very friendly of them.

Follow

Befriend (55)

5,659 threads

6,342 comments

male

Menlo Park, CA

Premium

Darn, looks like dnaancestry.com is just as money-centered, and does not provide any resources to anyone unless you get a test from them. Let me know if I'm wrong.

Follow

Befriend

2 threads

27 comments

Pleasanton, CA

P,

im the guy who writes the code himself in c++ (or python perl, php, whatever, its been a while since a few others, but I'm not opposed to pushing stack in assembler, given hours of concentration) perhaps we can take the population genetics discussion off line, you know my contacts

dnaancestry is likely very $ centered, the web business model is all trying to figure a way to make $, we all know that problem

-A(Norcalbear)