Tuesday, 2 February 2016

A second study tracking DNA segments through time and space

I recently had my first confirmed match with a known genealogical cousin at AncestryDNA. This was also my first ever shaky leaf hint at AncestryDNA which was very exciting and it was good to see how the system worked in practice. In this case the hint was spot on and correctly predicted that we are related through our mutual great-great-grandparents Charles James Wiggins and Mary Ann Thorn. Charles Wiggins was born in 1828 in Clapham in South London. Mary Ann Thorn was born in Colchester, Essex. They married in 1848 in Lewisham, London, and went on to have 11 children, though their youngest daughter Catherine died in infancy.
My cousin has now transferred her data to Family Tree DNA which has allowed me to do some comparisons with my other family members who have tested there. One of the interesting aspects of autosomal DNA testing is that it allows us to study the process of inheritance, and to see how the segments are passed on from one generation to the next. (Unfortunately this is not possible at AncestryDNA because they do not provide us with a chromosome browser.)

I did a previous exercise in tracking segments through the generations when I got my first confirmed match with a fourth cousin at Family Tree DNA, and I thought it would be interesting to repeat the exercise now that I have a second confirmed match with a cousin.

The first chromosome browser view below is a comparison between my dad and our new genetic cousin. They are second cousins once removed. For the first three comparisons I've screened out the segments under 5 cMs in size. The segment on chromosome is only 6.54 cMs and the one on chromosome 16 is 7.94 cMs. Small segments are often false positives but I'll leave them in here for the purposes of this exercise.


The second chromosome browser view is a comparison between me and my genetic cousin. We are third cousins. As can be seen I've inherited all four of the large segments from my dad and one of the two smaller segments.


The third chromosome browser view is a comparison between my son and the cousin. They are third cousins once removed. Here you can see that in just one generation four of the five segments I inherited from Charles James Wiggins and Mary Ann Thorn have been lost completely.



I thought it would also be interesting to take a look at the smaller segments. The chromosome browser view below is from the perspective of my son, and shows the segments that he shares with his third cousin once removed (blue) and with his paternal grandfather (orange). Any segment that my son has received from Charles James Wiggins and Mary Ann Thorn must have been inherited from his paternal grandfather. However, we can see here that on chromosomes one and two there are floating segments that do not line up with the segments he's inherited from his maternal grandfather. The other segments do at least match in the right place but I suspect they would probably all disappear with phasing.

 My son's maternal grandmother has also been tested at Family Tree DNA so let's have a look now at what happens when I do a comparison between my son and his maternal grandparents. His maternal grandfather is shown in orange and his maternal grandmother in blue in the view below. You can see how the segments from his two grandparents fit together like large teeth. There's a slight oddity on chromosome 13 where a stray small false segment seems to have crept in. However, the basic principle remains that segments are generally passed on in large chunks and are not broken down into a myriad of tiny segments in the noisy pattern that we see in the chromosome browser view above.


This exercise also gives me the opportunity to compare the centiMorgan count at both AncestryDNA and Family Tree DNA. I can only do comparisons between myself and my third cousin as we are the only ones who are in both databases. Ancestry phase the data before calculating the matches and do not report segments under 5 cMs. Family Tree DNA do not phase the data and report segments right down to 1 cMs in size.

According to AncestryDNA I share 110 cMs spread across 8 segments. There appears to be something involved in the process of phasing which breaks large segments into smaller units which explains the higher than expected segment count at AncestryDNA. At Family Tree DNA I share 144.49 cMs over 14 segments with my third cousin. If I subtract all the segments under 5 cMs this brings the count down to 124.28 cMs. If I remove the 6.54 segment from the calculations the count comes down to 117.74 which is pretty close to the figure from AncestryDNA. If the AncestryDNA phasing is correct then it seems likely that the 6.54 cM segment is a false positive.

Exercises like this help us to understand the inheritance process and I would be very interested to see similar studies from other genetic genealogists.

Update
The blog post Behind the new AncestryDNA feature: amount of shared DNA by Anna Swayne explains how the company's phasing and Timber algorithms work.

© 2016 Debbie Kennett

Thursday, 28 January 2016

Full Genomes Corporation collaborates with Novogene to offer low-cost whole genome ancestry test for US $895

The following information was provided by Full Genomes Corporation. With thanks to Justin Loe.

Full GenomesTM Corporation, the first company to offer a high-resolution and comprehensive Y chromosome test in January 2013, announced today that it is collaborating with Novogene, a leading genomics solution provider with the largest Illumina-based sequencing capacity in China, to offer GenomeGuide, one of the first whole genome tests for ancestry purposes, for under $1,000.

GenomeGuide, now available to consumers at US $895, includes raw data (BAM file), variant summary reports from SnpEff and VEP that are compatible with third party tools, such as Promethease, autosomal and X-chromosomal variant identification (Variant Call Format) files, and mitochondrial and Y chromosome reports (for males). As with other Full Genomes products, GenomeGuide is intended for ancestry/research-use only, and should not be relied upon for medical or diagnostic purposes.

Novogene, the only Illumina Genome Network partner in China, will deliver high-quality WGS data using the Illumina Hi-Seq X Ten system capable of sequencing up to 18,000 human genomes per year at the lowest cost per genome, and will apply its advanced bioinformatics capabilities and expertise to provide variant analysis.

"The advent of new technology has enabled Full Genomes to offer GenomeGuide, a new whole genome ancestry test which will be the most comprehensive ancestry test on the market today," stated Justin Loe, CEO of Full Genomes. "Full Genomes is committed to providing responsible and detailed genetic reports to the customer," he added, "and we are incorporating the latest technology to enable the consumer to receive comprehensive information on their ancestry. With their advanced Illumina technology, outstanding informatics/analysis, and highly responsive and effective support, we are confident that Novogene will deliver the high-quality WGS results our customers expect."

"We look forward to collaborating with Full Genomes and to helping enable the delivery of this highly cost competitive ancestry research to consumers," stated Dr. Ruiqiang Li, Founder and Chief Executive Officer of Novogene. "As one of the first companies in the world to purchase Illumina's HiSeq X Ten in early 2014, we have extensive experience with the system and are uniquely positioned to provide Full Genomes and its customers with the highest quality WGS data."

About Full Genomes Corporation

Full Genomes incorporated in 2012 for the purpose of making full genomic sequencing for genealogical use available to the general public at a leading price point. Full Genomes introduced Y Elite, a comprehensive (next generation sequencing) of the Y chromosome in 2013 to the genetic genealogy market. Since then, a variety of customers in the U.S. and overseas, as well as a number of institutions have used FGC's Y chromosome product. Full Genomes' proprietary DNA analysis capabilities for the Y chromosome have been recognized and have been used for a variety of research projects. Full Genomes has partnered with various vendors and organizations for sequencing and chip development with the goal of advancing new products and DNA services addressing additional markets.

About Novogene

Novogene Bioinformatics Technology Company, Ltd., headquartered in Beijing with branches in the US and UK, is a leading genomics solution provider with cutting edge bioinformatics expertise and the largest Illumina-based sequencing capacity in China. Committed to quality service and scientific excellence, Novogene has achieved rapid growth and industry recognition by working in partnership with diverse healthcare, educational and research institutions around the globe to realize the unlimited potential of the rapidly evolving world of genomics. The company has completed numerous major service projects with findings published by top-ranked journals such as Science and Nature. Novogene is the first company in China to purchase Illumina's HiSeq X 10 system and is the only Illumina Genome Network partner in China. Novogene Corporation is Novogene's U.S. subsidiary, based in San Diego, CA. To learn more, visit http://en.novogene.com.

For more information, contact:

Justin Loe
Chief Executive Officer
424-333-8537
justin.loe@fullgenomes.com
Full Genomes Corporation

Joyce Peng, Ph.D.
Global Marketing Director and General Manager
1-626-222-5584
joyce.peng@novogene.com
Novogene Corporation

Autosomal DNA triangulation. Part 2: the phenomenon of triangulated segments

In the first article in this two-part series I covered the basics of autosomal DNA inheritance and how triangulation can be successfully used to assign segments to specific ancestors within the last five generations or so in combination with chromosome mapping. I'm now going to take a look at the interesting phenomenon of "triangulated segments"  segments of DNA that are shared in common with multiple people.

We would naturally expect several of our close relatives (grandparents, aunts and uncles, and first cousins) to share some segments in common but once you get out to the fifth or sixth cousin level the chances of sharing any DNA with a specific cousin are very low. If you do match a fifth cousin or a more distant cousin you would only expect to share a single IBD segment.  Despite the low odds, because we have so many distant cousins, we will inevitably find some rare examples of people who match three or more fifth cousins who are descended from the same ancestral couple. This is much more likely to happen if the couple had a large family and their descendants also went on to have lots of children. However, because we all inherit different combinations of segments of DNA from our ancestors you would expect to match other fifth and sixth cousins on different segments rather than all on the same segment. We all inherit different pieces of our ancestors' genetic jigsaw puzzle rather than all sharing the same piece.

While I'm not aware of any scientific papers that have studied the frequency of triangulated segments, AncestryDNA have done some interesting computer simulations which shed some light on the matter. They found that a group of three first first cousins shared a matching segment of over 5 cMs in length over 80% of the time. Three second cousins shared the same segment around 60% of the time, but for for third cousins the rate was just 15%. The chances that three fourth cousins would all share the same matching segment were found to be around 1% (see figure below). From this we can infer that it would be similarly extremely unlikely for three or more fifth, sixth or more distant cousins to all to match on the same segment through IBD descent from a specific ancestor. This of course assumes that they have inherited enough DNA from their mutual ancestor to show up as a match at all.
Figure reproduced by kind permission of Ancestry DNA from the customer help article
 
"Do all members of a DNA Circle share the same matching segment?
The results of the AncestryDNA simulations have been replicated in the real-life findings from their DNA Circles feature. They say:
Data we've gathered from our DNA Circles also shows that it is unlikely to find matching segments among three or more people. Only 4 percent of DNA Circles (that have between 3 and 30 members) have three or more people that share the same matching segment. In the remaining 96 percent of DNA Circles, no more than two people in the Circle share any one particular segment. In other words, even in DNA Circles with 30 descendants, usually only two or three descendants will all inherit the same segment.1
The results of the AncestryDNA research into triangulated segments has not been published in a peer-reviewed scientific journal. We don't know the assumptions that were made in the models and we have to accept the results in good faith. We also don't have access to the shared segment data for our matches at AncestryDNA so that the findings can be independently verified. However, AncestryDNA do have a good team of scientists working for them. I'm also not aware of any credible research that has disproved their findings.

In contradiction to what we might expect from the AncestryDNA findings, there is a lot of anecdotal evidence that people are seeing lots of matches which all fall on the same segment on the same chromosome. We don't have much data on the extent of the phenomenon in terms of the number of people in each "triangulated group" and the size of the segments that they share. The situation is also complicated because each company has a different matching threshold, which means that we might not see the full extent of the problem.

The hypothesis has been proposed that if multiple people share the same segment and they all match each other then they must all share a recent genealogical ancestor. It is then just a question of comparing family trees and trying to find surnames in common. If you match lots of people on the same segment this process of "triangulation" should in theory be fairly straightforward because it's just a question of looking for recurring surnames and locations in common, and the more family trees you have to compare the easier this should be. However, there are many pitfalls with this approach as we will see.

When I look at my own data I can see that on every chromosome I have at least one segment where I appear to match multiple people. This scenario is best visualised by using Don Worth's Autosomal DNA Segment Analyser, a free utility which is available from the DNAGedcom website. ADSA uses the "in common with" files from Family Tree DNA. This is not "true" triangulation, which requires checking that the people you match also match each other. However, it is in most cases a pretty reasonable approximation. The screenshot below shows the most extreme example from my own data which occurs on chromosome 18. As can be seen, I have a big group of people who all overlap in the same region, though the amount of sharing is quite small and in most cases under 10 cMs. (On the ADSA diagram below my mum's sharing is shown in black and my dad is shown in pink.)
However, I have not been able to identify a common ancestor or even a common surname with any of the people in my own "triangulated groups". It doesn't help that most of the people in these groups are in America with all-American ancestry and no indication of places of origin in the UK. I do have some known relatives who emigrated to America and Canada in the nineteenth and twentieth centuries but I have not come across any other emigrants further back in my family tree. You would have thought that if I was related to lots of people who had emigrated to America in the last 400 years it would have been possible to identify at least some of the connections. This rather suggests to me that if these segments are IBD, they are a signal of very distant shared ancestry that perhaps predates the colonisation of America. I have one triangulated group that looks distinctly Irish in flavour, and I do have some Irish ancestry but Irish research prior to 1800 is difficult at the best of times and especially so when the only surname you've identified is Sullivan!

My own "triangulated segments" are all quite small, but there is anecdotal evidence of people who have matches with large numbers of people on the same segments and where the shared segments are much larger than the ones I'm seeing. Some people have observed "triangulated groups" with segments over 20 cMs in size, which in theory should be more recent in origin and where it should be much easier to identify a common ancestor. But if it really is so difficult for three or more fourth, fifth or distant cousins to match on the same segment why is that we are seeing so many examples of this happening? I can offer a few suggestions that might help explain this phenomenon.  

Lack of phasing
The first difficulty when considering our matches is that our data at Family Tree DNA and 23andMe is not phased. Phasing is the process of sorting out the DNA letters we receive from our parents and assigning them to the maternal and paternal chromosomes. Our autosomal chromosomes come in pairs. We receive one set of 22 autosomes from our mum and another set from our dad. If you look at your raw data you'll see that for each chromosome you have a long list of As, Cs, Ts and Gs divided into two columns. However, the columns with our data for each chromosome aren't conveniently sorted so that all the DNA letters in one column represent all the letters you got from your dad and all the letters in the other column are the letters you got from your mum. The letters are all jumbled up so both columns are a mishmash of all the As, Cs, Ts and Gs that you get from both your parents. The computer algorithms are looking for consecutive runs of As, Cs, Ts and Gs that all match each other but they're looking in both columns to find the matches. If the algorithms find enough matching letters in a run then we can be reasonably certain that the segment is IBD – a true match inherited through successive unbroken generations from grandparent, to child to grandchild and so on. Strictly speaking what we are seeing are not segments of DNA but sets of alleles that form haplotypes.

Phasing matters most with the smaller segments under 15 cMs where there is a law of diminishing returns. As the segments get smaller the chances that the segments will be false positive pseudosegments (mishmashes of As, Cs, Ts and Gs from both the maternal and paternal chromosome) will tend to increase. Independent research from genetic genealogists suggests that 15 cM is the threshold where segments can be assumed to be IBD with reasonable confidence, whereas only 42% of 7 cM segments are likely to be IBD. Even when phasing is done there is still the possibility of false matches with the smaller segments. A study by Durand et al (2014) found that over 67% of phased 2-4 cM segments were false positives (matches found in the child but not in the parents).2

The most accurate phasing is done with parent/child trios. There are various computer programs and third-party tools that will do this (eg, the GedMatch tools and David Pike's tools). However, this sort of analysis is something that only the most advanced genetic genealogists are likely to undertake, and even if you were to phase your own data none of the companies currently provide the facility for you to use a phased genotype. It is also possible to do algorithm-based phasing from reference sequences, and this can be done with a very high degree of accuracy. AncestryDNA use this type of population-based approach, but they have developed their own sophisticated proprietary program. The error rate for the AncestryDNA phasing engine is only about 1% when compared with parent/child trios. However, phasing is computationally challenging and expensive, and AncestryDNA are currently the only company who are able to do this. In theory the matches we get from AncestryDNA should be much more accurate than the matches we get from 23andMe and Family Tree DNA.

It has been suggested that segments which "triangulate" must be IBD but I see no rationale for this assumption and, to the best of my knowledge, this hypothesis has not been tested. We already know that some small segments don't triangulate with close relatives. I have some examples in my blog post on Tracking DNA segments through time and space. If this can happen with small segments perhaps it can also happen with large segments too.

Genotypes versus whole genome sequencing
The second point we have to consider is that the currently available autosomal DNA tests are not sequencing all six billion DNA letters in our genomes. The testing is done on an Illumina chip which looks at around 700,000 different letters scattered across the genome.3 This process is known as genotyping. The Illumina chip that all the companies use was designed for health purposes and not for genealogy. The SNPs that are included are those that are useful for genome-wide association studies (GWAS) where the goal is to look for SNPs shared at the population level not SNPs that are shared at the family level. The density of SNPs on the chip varies and some regions are better covered than others. Segments containing rare alleles are much easier to identify than segments with alleles which have a high frequency in the population. The segments that are used for matching purposes in autosomal DNA tests do not therefore provide a complete sequence of all the letters in the "segment" but merely a run of consecutive SNPs with many missing intervening letters. This introduces the possibility of errors, particularly for shorter segments. In addition, two separate segments could be stitched together to give the appearance of one single segment because the intervening SNPs that might break up the sequence aren't on the chip.

Shared descent through multiple ancestral pathways
We saw in my previous blog post how pedigree collapse and endogamy can affect relationship predictions within recent generations but pedigree collapse and endogamy affect all our family trees sooner or later. We are all endogamous. It is just a matter of degree. The number of ancestors doubles with every generation. You only have to go back 20 generations before you find that you theoretically have 1,048,576 genealogical ancestors. You eventually reach a point where your theoretical number of ancestors exceeds the entire number of people who have ever lived on the planet. We have a world population of over seven billion people but we all trace our ancestors back to a historical population of just one billion in 1850.

What this means in practice is that everybody is related to everybody else and we are all related much more recently than we intuitively realise. For many people this endogamy will not be documented in their family trees. The only example I can find in my own family tree of two ancestors who were already related when they married dates back to the late seventeenth century in North Molton, Devon. My ggggggg-grandparents Daniel Locke and Mary Bright were first cousins when they married in 1667. However, even though I cannot trace all the distant relationships it is an escapable fact that all my ancestors who were marrying in rural villages in Devon, Somerset, Gloucestershire, Essex and Hertfordshire back in the 1700s must have been closely related to each other and were probably third, fourth, fifth and more distant cousins many times over. I also have lots of London ancestors and they would be more distantly related because of the sheer size of the London population and the fact that people migrated to London from all over Britain and elsewhere. Eventually all our ancestral lines will come together in a tangled and complex network of relationships connected on many different pathways. To get an idea of what our collapsed pedigrees might look like have a look at this wonderful 80-generation pedigree chart for a border collie dog showing 90% pedigree collapse.

Two peer-reviewed papers studying present-day populations have confirmed the mathematical predictions of  our ubiquitous recent shared ancestry. Henn et al (2012) found tens of thousands of 2nd to 9th degree cousin pairs within a dataset of 5,000 Europeans. They also found that some highly endogamous populations such as Native Americans and the Kalash of Pakistan were effectively the genomic equivalents of second cousins.4 Ralph and Coop (2013) studied genomic data for a population of 2,257 Europeans. The found that "a pair of modern Europeans living in neighboring populations share around 2-12 genetic common ancestors from the last 1,500 years, and upwards of 100 genetic ancestors from the previous 1,000 years".5

These findings have also been replicated by AncestryDNA who found that their customers who are in DNA Circles were getting roughly the expected number of matches four and five generations ago with third and fourth cousins but progressively more matches than would be expected  with fifth and sixth cousins six and seven generations ago.
Figure reproduced by kind permission of AncestryDNA from the customer help article
 "Why do DNA Circles only go back six generations?"
AncestryDNA conclude:
Our research shows that descendants of an ancestor who lived more than six generations ago have more DNA in common with other descendants of that ancestor than they’d be expected to. This discrepancy increases the more generations you go back in time and suggests that descendants are actually related through multiple ancestors.6
If researchers are seeing such relatively high levels of IBD sharing in the present-day population we can assume that 300 years ago, when we trace back to a very much smaller population, everyone must have been much more closely related than we are today. We don't have a time machine so that we can travel back to the 1700s and get autosomal DNA tests done on all 1024 of our gggggggg grandparents. However, if we could, we might expect that a very high percentage of our ancestors would show up as matches to each other with many of the relationships being as close as fourth, third or second cousins. The effect would be compounded by the fact that the ancestral population of Europe went through an extreme genetic bottleneck in the fourteenth century when the Black Death wiped out over half of the population of Europe. The cumulative effect of this population structure is that the genomes of our ancestors 300 years ago would perhaps have the same characteristics as that of a highly endogamous population today. They would share more segments in common than would be expected for the degree of relationship and, if many of those ancestors were second or third cousins, then a number of them might be expected to match on the same segment. If there were lots segments shared by many people in the historical population then it's easy to understand how these segments could also be found in their descendants today, but these segments would be passed on through a variety of different pathways, making it very difficult, if not impossible, to determine the individual lines of descent.

As an example, if you have ancestors in the 1700s, A, B, C, D and E, who all share the same 8 cM segment there are five possible pathways in which that segment could have been passed on to you. You might match a cousin who has Ancestors A, F, G, H and I in her tree who are similarly all related and share that same 8 cM segment. She too has five possible pathways in which that segment could have been passed down to her. It may be that you can both identify Ancestor A in your genealogical trees, and you assume that you are genetically related because you both share descent through ancestor A. You each have a 1 in 5 chance of inheriting that segment from Ancestor A, but there is only a 1 in 25 chance that you have both inherited the same segment from Ancestor A. The most probable scenario is that you have both inherited the segment from different ancestors and neither of you has inherited the segment from Ancestor A. You will share a common genetic ancestor but that ancestor will be the progenitor of ancestors A, B, C, D, E, F, G H and I and not Ancestor A, and might well be beyond the reach of genealogical records.

It therefore seems likely that if lots of people match on the same segment this indicates that the segment is prevalent in the population from which they descend as a result of historical endogamy rather than an indication that they all share that same segment from a recent genealogical ancestor within the last 300 years or so. Indeed many of the tools produced by population geneticists use haplotype frequency as a way of detecting IBD. For example, Browning and Browning (2011) say “Haplotype frequency is critical because a shared common haplotype is unlikely to reflect recent IBD, whereas a shared haplotype that is very rare is likely to be identical by descent”.7

Pile-up regions
In addition to the problem of historical endogamy, which makes it very difficult to infer distant relationships with the currently available tests, it is also known that there are some regions of our genome which are prone to what the population geneticists call "excess IBD sharing" or what are more colloquially known as pile-up regions. These are segments which are widely shared at the population level. For a summary of some of the research into this subject see the section on excess IBD sharing in the ISOGG Wiki page on identity by descent. AncestryDNA use a proprietary algorithm known as Timber to filter out segments which occur at high frequency in the database. In some cases AncestryDNA found segments that were shared by thousands of people which suggested that these people weren't recently related to each other but shared DNA because they were descended from the same gene pool. There was not a direct correlation between the size and frequency of a shared segment and some of the segments that were filtered out using this method were quite large. See the blog post from Julie Granka on Filtering DNA matches at AncestryDNA with Timber which includes a table showing the size of the segments that were removed by this process. There is always going to be a trade off between false positive and false negative matches and no algorithm is perfect. Some genetic genealogists who have tested parents and children at AncestryDNA have reported that up to 35% of their child's matches, including some fourth cousin matches, do not appear in the match list of either parent. This discrepancy has not yet been explained but appears to be related to the use of the Timber algorithm.

Conclusion
Autosomal DNA testing for genetic genealogy is still very much in its infancy, and we clearly have a lot to learn about the interpretation of results, particularly for endogamous communities and for the more distant relationships beyond the fourth or fifth cousin level where family trees start to get very patchy and where relationship predictions become more difficult. The lack of phasing at 23andMe and Family Tree DNA means that our matches, particularly those with the smaller segments, are unreliable and there are both false positives and false negatives. Many of the ambiguities in our results would disappear if we were to move to whole genome sequencing, and preferably with phased genotypes too. That is unlikely to happen in the next few years but will no doubt be routine at some point in the not too distant future.

In view of the known levels of historical endogamy in the human population and the almost impossible mathematical odds of multiple fifth and sixth cousins matching on the same IBD segment through descent from the same ancestral couple, I would suggest that any segment that triangulates with multiple distant cousins is unlikely to be indicative of a recent genealogical relationship. The common ancestor will probably have lived much further back in time and may well be beyond the reach of genealogical records. Our focus should perhaps instead be on all the rare haplotypes in our match list  the segments that we share with just one of our distant matches. I hope that it might be possible to find a way of testing some of these competing hypotheses.

In the meantime, it's very important that we don't jump to conclusions based on patterns seen in the data. In science and in genealogy it is important to look not for evidence that will prove your hypothesis but for evidence that will disprove your hypothesis.

Further reading
- "Puzzled Researcher". Chromosome pile-ups in genetic genealogy: examples from 23andMe and FTDNAGenealogy and Genomics blog, 31 January 2015

- Ann Turner. What a difference a phase makes. A guest post on Blaine Bettinger's The Genetic Genealogist blog, 30 March 2015.

- Ann Turner, The trouble with triangulation: preliminary notes. 4 April 2015.

- "Puzzled Researcher". Genealogy and autosomal DNA matches: common errors in proving an ancestor and the allure of easy gateway ancestors Genealogy and Genomics blog, 19 April 2015.

- Ann Turner. Anatomy of an IBD segment. A guest post on Jim Bartlett's Segmentology blog, 1 October 2015

Notes and references
1.  AncestryDNA. Do all members of a DNA Circle share the same matching segment? An article in the AncestryDNA help menu "Learn more about DNA Circles" which is accessible to AncestryDNA customers only. 
2. Durand EY, Eriksson N, McLean CY (2014). Reducing pervasive false positive identical-by-descent segments detected by large-scale pedigree analysisMolecular Biology and Evolution 31(8): 2212-2222.
3. Both AncestryDNA and FTDNA test around 690,000 SNPs. The 23andMe v4chip has around 577,000 SNPs.
4.  Henn BM, Hon L, Macpherson JM, Eriksson N, Saxonov S, et al (2012). Cryptic distant relatives are common in both isolated and cosmopolitan genetic samplesPLoS ONE 7(4): e34267. See also the blog post from 23andMe How many relatives do you have? summarising the findings of this paper.
5. Ralph P, Coop G (2013). The geography of recent genetic ancestry across Europe. PLoS Biol 11(5).
6. AncestryDNA. Why do DNA circles only go back six generations? From the AncestryDNA help menu "Learn more about DNA Circles" which is accessible to AncestryDNA customers only.
7. Browning BL and Browning SR (2011). A fast, powerful method for detecting identity by descent. American Journal of Human Genetics 88(2):173-82. See also: Browning SR, Browning BL (2012). Identity by descent between distant relatives: detection and applications. Annual Review of Genetics 2012; 46: 617-33. In the latter article the authors state: "The key idea behind IBD segment detection is haplotype frequency. If the frequency of a shared haplotype is very small, the haplotype is unlikely to be observed twice in independently sampled individuals, so one can infer the presence of an IBD segment. This criterion can be applied in several ways. The first is length of sharing, which is a proxy for frequency. If two densely genotyped haplotypes are identical at all or most (allowing for some genotyping error) assayed alleles over a very large segment of a chromosome, then the haplotypes are likely to be identical by descent across the whole segment. The second is direct use of haplotype frequency: Shared haplotypes with estimated frequency below some threshold are determined to be identical by descent. The third makes use of a population genetics model to infer probability of IBD. Given the frequency of the shared haplotype and a probability model for the IBD process along the chromosome, one can estimate the probability that the individuals are identical by descent at any position on the segment."

© 2016 Debbie Kennett

Autosomal DNA triangulation. Part 1: the basics

There have been a lot of discussions in the genetic genealogy community in the last few months in the ISOGG Facebook group and on the ISOGG DNA Newbie list on the subject of triangulation for autosomal DNA. As a contribution to the debate I thought I would take the opportunity to share my own understanding of all the issues involved. This is the first of two articles on the subject. I will start by covering some of the basic principles of autosomal DNA inheritance and triangulation. In the second part I will look at the phenomenon of triangulated segments.

One of the difficulties I've found is that people are using the term triangulation in different ways to mean different things. Triangulation is a term that has been adapted from surveying. It was first used in genetic genealogy in the context of Y-chromosome DNA and mitochondrial testing by Bill Hurst who proposed the following definition on the Rootsweb Genealogy DNA list in 2004:
Triangulation: In genetic genealogy, the determination of the Y-chromosome DNA of a male ancestor by finding an exact match between direct paternal descendants of two sons of the ancestor. Similarly, the determination of the mitochondrial DNA of a female ancestor by finding an exact match between direct maternal descendants of two daughters of the ancestor. 
Autosomal DNA is of course a lot more complicated than Y-DNA and mtDNA. We receive one set of 22 autosomes from our father and one set of autosomes from our mother. However, before the DNA is passed on from the parents to the child it undergoes a process known as recombination, which means that it gets shuffled up before it’s passed on. We receive 50% of our DNA from our mother and 50% from our father, but the DNA that we receive from our parents is a patchwork of the DNA from all four of our grandparents. Sometimes we will inherit an entire chromosome from one of our grandparents but more often than not our 22 autosomes will get split up into one or two large segments on each chromosome. You can see this process in action in the screenshot below. The comparison has been done using the Family Tree DNA chromosome browser and is from the point of view of my son. His DNA is compared with me (pink), his father (green), his maternal grandfather (orange) and his maternal grandmother (blue).  You can see that the segments of DNA he has inherited from his maternal grandparents have been broken up into large chunks, though on chromosome 18 he has inherited the entire chromosome from his maternal grandmother, and he has received his entire chromosome 22 from his maternal grandfather.


While we share large chunks of DNA in common with our grandparents the number of shared segments and the size of those segments gets smaller with each passing generation, and we eventually reach a point where we have genealogical ancestors from whom we have inherited no DNA at all.

Autosomal DNA triangulation works on the same principles as triangulation for Y-DNA and mtDNA. We start with the known and work back to the unknown, and we combine DNA evidence with sound genealogical evidence to draw a conclusion. For autosomal DNA we are looking at specific segments of DNA and trying to determine the ancestor or ancestral couple from whom we inherited that DNA. For this process to work we need relatives who are closely related to us with known genealogies. If you test two known first cousins and they have the expected amount of DNA in common for a first cousin relationship you can assign the shared segments to their mutual grandparents. Similarly if you test two second cousins and they share the appropriate percentage of DNA in common you can infer that the shared segments have been inherited from their mutual great-grandparents.

The technique can also be used with third, fourth and fifth cousins but it is important that both parties have sound genealogies and are able to trace back their ancestors on all their family lines for the appropriate number of generations in order to rule out the possibility of a relationship on a different pathway. The assignment of segments to fourth and fifth cousins is more secure if the match can also be triangulated with other close family members (eg, a parent, an aunt or uncle, a first or a second cousin).

Triangulation can be used in combination with chromosome mapping, a technique which is deployed by some of the more advanced genetic genealogists in our community. Chromosome mapping opens up exciting possibilities, and has the potential to enable us to make a partial reconstruction of the genome of our ancestors. This will eventually allow us in some cases to determine which traits, such as hair colour and eye colour, we can attribute to specific ancestors. Such an exercise has already done by AncestryDNA who were able to reconstruct about 50% of the genome of David Speegle, a man who lived in Alabama in the early 1800s. David Speegle was chosen for the exercise because he had two wives, Winifred Crawford and Nancy Garren, and also an exceptionally large number of children who in turn went on to have lots of children. This meant that Speegle had many surviving descendants in the AncestryDNA database which made the task of reconstruction a lot easier. I hope that AncestryDNA will eventually publish this research in a scientific journal. In the meantime it's instructive to look at this video which explains the methodology and to study David Speegle's chromosome map (starting at around 1 minute 58 seconds) showing all the reconstructed segments scattered across his 22 autosomes. It is ironic that AncestryDNA uses a chromosome map to demonstrate this concept but that they do not currently provide their customers with the matching segment data or a chromosome browser so that we can replicate the methodology ourselves.

Another interesting study has been done by genetic genealogist Kitty Cooper. She has created a chromosome map showing the segments she has been able to attribute to her great-great-grandparents Jørgen and Anna Wold of Drammen, Norway. Reconstructing the genomes of our ancestors is rather like trying to do a giant genetic jigsaw puzzle. One of your cousins might have the segment containing the alleles for your ancestor's brown eyes, and another cousin might have the segment with the alleles for his brown hair. We don't all inherit the same piece of the jigsaw puzzle but instead we all inherit different pieces which can be joined together to reconstruct the bigger picture.

While it is possible to use known, close autosomal DNA matches for chromosome mapping and assign segments of DNA to our ancestors to about the fifth or sixth generation, it is much more difficult to map segments to more distant ancestors. The first problem is that our family trees become more difficult to research as we go further back in time. Two fifth cousins will share their great-great-great-great-grandparents in common. However, we have 64 great-great-great-great-grandparents. Very few people are able to identify all 64 of them, and only a minority of family historians are able to identify all of their 32 great-great-great-grandparents. It therefore becomes very difficult to conclude that the match is on the specific line of interest and that we are not matching because of shared descent on a different line which we haven't yet researched. In addition, because of the random way in which autosomal DNA is inherited, the relationship predictions become less reliable for the more distant relationships. The companies will therefore give you a range of relationships within which the match is likely to fall rather than a precise relationship. For example 23andMe assigns the more distant relationships as third to distant cousin or fourth to distant cousin. At FTDNA the more distant relationships are split into fourth cousins to remote cousins and fifth cousins to remote cousins. AncestryDNA gives predictions for fourth to sixth cousins or fifth to eighth cousins.

The second problem is that as we go further back in time we start to find some ancestors from whom we have inherited no DNA at all. While we will probably have inherited DNA from all of our 32 great-great-great-grandparents there might be one or two of our 64 great-great-great-great-grandparents from whom we have not received any DNA at all. For a good explanation of this process see the blog post by Graham Coop on How many genetic ancestors do we have?  See also the useful table by Bob Jenkins in his article How many genetic ancestors do you have? What this means is that once you go back beyond about 10 generations (roughly 300 years) only a small fraction of your ancestors have contributed directly to your DNA.1 If you wish to triangulate a match with a fourth or more distant cousin you must first of all hope that both of you have inherited some DNA from the ancestor of interest. You must also hope that both of you have inherited the same segment of DNA on the same chromosome. As we are only likely to share one segment with a fifth cousin, if we share any DNA at all, you can see that when there are 22 autosomes to choose from the chances that you will both share a segment on the same chromosome are likely to be very slim indeed. If fifth cousins do have any detectable IBD sharing it is has been estimated that it will usually be composed of a single segment with a mean length of 8.3 cM (∼8Mb).2

All three testing companies have provided percentages showing the chances of matching a known cousin at the differing degrees of relationship. I've compiled the statistics into the table below.3

Relationship23andMe
(unphased)
Family Finder
(unphased)
AncestryDNA
(phased)
2nd cousin> 99% > 99% 100%
3rd cousin~ 90%> 90% 98%
4th cousin~ 45%> 50% 71%
5th cousin~ 15%> 10% 32%
6th cousin or more distant< 5%Remote
(typically less than 2%)
11%

AncestryDNA phases the genotypes before doing the matching process. (Phasing is the process of assigning alleles to the maternal and paternal chromosomes and will be discussed in more detail in the second article in this series.)  As can be seen from the table, phasing provides a better chance of matching at the fourth and fifth cousin level, but even with phased data it is clear that the odds of two fifth or sixths cousins sharing enough DNA on a specific line to show up as a match are still very slim.

Although the odds of matching a specific fifth or sixth cousin are actually very low, because we have so many fifth, sixth and more distant cousins these more distant relationships will dominate our match lists. Henn et al (2012) produced a model to estimate the expected number of cousins at different degrees of relationship and the figure is reproduced below courtesy of a Creative Commons Licence.4


A model produced by researchers at AncestryDNA, based on birth and census data from the last 200 years, produced some rather different statistics. They found that a typical British person had "five first cousins, as well as 28 second, 175 third, 1,570 fourth, 17,300 fifth, and 174,000 sixth cousins" making a grand total of 193,000 living cousins. Whatever the numbers might be, it is clear from the maths alone that because we all have such huge numbers of seventh, eighth and more distant cousins, the vast majority of our more distant matches are much more likely to fall in this range than to be fifth or sixth cousins.

Pedigree collapse and endogamy
Relationship predictions can be confounded by recent pedigree collapse. This is the phenomenon whereby the same ancestral couple appears twice or more in your family tree. For example, if your parents were first cousins you would have six great-grandparents rather than eight and 48 great-great-great-great-grandparents instead of 64. If your parents were second cousins you would have 14 great-great-grandparents instead of 16, and 56 great-great-great-great-grandparents instead of 64. This means that you will inherit more DNA from the ancestors who appear twice on your family tree, and there is a greater chance that their DNA will be preserved.

Endogamy is another confounding factor for relationship predictions. Endogamy is the practice of marrying within the same ethnic, cultural, social, religious or tribal group. Sometimes endogamy is enforced as a result of geographical isolation. Within an endogamous group there are multiple marriages between first, second and third cousins and everyone is effectively related to everyone else multiple times over within a very recent timeframe. Ashkenazi Jews are one example of an endogamous population. They can be traced back to a recent bottleneck with an effective population size of about 350 between 25 and 32 generations ago. The bottleneck was followed by rapid exponential expansion.5 With autosomal DNA tests people who are descended from an endogamous population will have significantly more matches than someone from a non-endogamous population. They will have a larger total cM count with their matches and will share more segments in common.6 As an example, I have no recent endogamy in my family tree, and I have 526 Family Finder matches at Family Tree DNA. In contrast, a British Jewish friend of mine now has 6150 matches.

Conclusion
We have seen how autosomal DNA triangulation can be a very useful tool when DNA evidence is combined with sound genealogical research to draw conclusions about close relationships up to about the fourth or fifth cousin level. In the second and final article of this series I will look at the interesting phenomenon of triangulated segments of DNA  segments of DNA which appear to be shared by multiple people descended from a single common ancestor. But do these segments have any genealogical relevance?

See also
Part 2: Autosomal DNA triangulation  – the phenomenon of triangulated segments

Useful resources

- The Autosomal DNA Portal in the ISOGG Wiki
- Genetic genealogy and the single segment A blog post from geneticist Steve Mount with some interesting insights into autosomal DNA matches
Expand and support your research with AncestryDNA Circles An excellent presentation from computational biologist Dr Ross E Curtis which explains the basics of autosomal DNA inheritance and the methodology behind AncestryDNA's Circles feature. If you have tested at Ancestry also check out the articles in the "Getting started with DNA Circles" menu. 
AncestryDNA's DNA Circles White Paper

Footnotes and references
1. See also Speed D and Balding DJ (2015). DNA and pedigree ancestors. Supplement S2 for the paper Relatedness in the post-genomic era. Nature Reviews Genetics 16: 33-44.
2. Browning SR, Browning BL (2012). Identity by descent between distant relatives: detection and applications. Annual Review of Genetics 46: 617-33.
3. The 23andMe data was extracted from the FAQ The probability of detecting different types of cousins. The FTDNA data was taken from the Learning Center article What is the probability that my relative and I share enough DNA to be detected by Family Finder? The AncestryDNA data was extracted from Table 1 in the help article "Should other family members get tested?" This is available to AncestryDNA customers only and can be accessed through the AncestryDNA "Matching Help and Tips" menu.
4. Henn BM, Hon L, Macpherson JM, Eriksson N, Saxonov S et al (2012). Cryptic distant relatives are common in both isolated and cosmopolitan genetic samplesPLoS ONE 7(4): e34267.
5. Carmi S, Hui KY, Kochav E et al (2014). Sequencing an Ashkenazi reference panel supports population-targeted personal genomics and illuminates Jewish and European originsNature Communications 5: 4835.
6. Paull JM, Tannenbaum GS, Briskman J (2014). Why autosomal DNA test results are significantly different for Ashkenazi Jews. Avotaynu XXX (1): 12-18.

© 2016 Debbie Kennett