Saturday, June 28, 2014

Overcoming DNA test errors.

I'm investigating my assumptions here and will follow up soon.

I recently learned sequencing machines have a 1% error rate.  That means out of the 700,000 or so identified SNPs in the Ancestry.com DNA tests about 7,000 of them are incorrectly identified unless the testers have taken steps to validate inconsistencies or the published error rates are out of date.  Even a .01% error rate would be 70 errors or just under 2 per chromosome.

Right now the tested people have no way to challenge and get obvious errors corrected.  It calls into question my supposition about how moderate, low, and very low confidence levels are determined.  I may have to actually look at my sister's and parents' DNA results.

On some of the matches with hints, meaning both tested individuals are identified in trees and both trees mention the same individual as an ancestor of the tested individual in the tree,  the confidence levels for my sister and I are higher than for my mother.  That doesn't make sense since my relationship to my cousins passes through one of my parents so the confidence level for mother should never be lower than mine once we're looking at a single chromosome to identify an individual as a cousin.

While there is a non-zero probability that my sister and I would share exactly the same mutation that undoes a mutation away from the ancestor's sequence it's much more likely our mom's DNA test has a sequencing error and the SNP is misidentified in her test.

Thursday, June 26, 2014

On supporting ancestry trees with DNA tests.

The following has some errors.  I will try to explain in a later post.

It has been awhile.  I grew tired of programming at work then coming home and programming some more.  I've been working of my genealogy for awhile now.  I'd like to bounce some ideas off you.

When I had my DNA tested by Ancestry.com then extracted the results as a text file I found it contained some documentation followed by the documentation and the start of the data as indicated below:
#Genetic data is provided below as five TAB delimited columns.  Each line
#corresponds to a SNP.  Column one provides the SNP identifier (rsID where
#possible).  Columns two and three contain the chromosome and basepair position
#of the SNP using human reference build 37.1 coordinates.  Columns four and five
#contain the two alleles observed at this SNP (genotype).  The genotype is reported
#on the forward (+) strand with respect to the human reference.
rsid    chromosome    position    allele1    allele2
rs4477212    1    82154    T    T
rs3131972    1    752721    A    G
rs12562034    1    768448    G    G
...
rs6517463    21    39752673    0    0
One average each generation has a single mutation across the 23 pairs.  Most of the mutations will happen in the non-coding regions so will have no impact on the SNPs mentioned above.  Most mutations will be transitions, between A and G or C and T, rather than transpositions.  Deletions and insertions are very rare.  The allele position is based upon alignment against the model so deletions are indicated by a 0. I don't know how insertions are handled, maybe a repetition of rsID and position.

There were a bit over 700,000 line for 22 chromosomes.  Chromosomes 23, 24, and 25
represent the x unique, y unique, and shared x/y alleles though not necessarily in that order.  That's about 30,000 identified bases per strand.  Ancestry selected the sites tested because they represent common variations within the human population.  4 to the 30,000 power seems large enough that no two individuals should ever have identical chromosomes but that doesn't really make sense given the way we come by them.

Except for very rare cases one shares 23 chromosomes with one's father and 23 with one's mother.  Even if both parents had completely unique chromosomes I will share 50% with each.  My sister also shares 50% of her chromosomes with each but not the same ones.  At least the Y I get from my father's father where my sister gets one of my father's mother's X chromosomes.  All sisters will share the same X chromosome from their paternal grandmother.

Most of the time the siblings will share about 12 chromosomes with each other.  The number can go up or down based upon a normalized distribution.  Cousins will share, based upon a normal distribution, around 5 chromosomes and second cousins will share around 2 chromosomes.  It is highly unlikely there will be any point mutations within the documented sites within 4 generations.

Without knowing Ancestry's algorithm I am assuming confidence level 95% means one unmodified chromosome, Moderate means one mutation, low means two mutations, and very low means 3 mutations.

I have 16 unique great-great-grandparents.  I only have 30 unique 3rd great grandparents.  It's likely I don't share any DNA with some of them, at least through them.  Who know, I may share 2 chromosomes with a few of them. There's no way I can share DNA through all of my 4th great grandparents because I only have 46 chromosomes to share amongst them.

I happen to have access to my sister's, my two parents', and my DNA tests.  Just from the DNA hints on ancestry.com I can say that when my parent's confidence level in relationship to an individual is higher than mine and both are no more than 95% then a point mutation at a tested site has happened.  When several cousins share the same confidence level for the very same portion of the tree no mutation has happened and the same chromosome is being shared.  When the confidence drops the mutation happened in the unshared region of the tree.