Coalescent Theory (3) -- the configuration of genealogy and sample sequences

04 Apr 2015

The ancestral recombination graph (ARG) maintains a set of nodes organized in certain ways that reflect the genealogical history of the simulated sample chromosomes (current generation). A node may represent one of the sample chromosomes or one of their common ancestor. Each node also corresponds to a DNA sequence that contains DNA segments from the sample chromosomes.

An ARG for four sequences. (A) Going backwards in time (from bottom to top), the graph shows how lineages that lead to modern-day chromosomes (bottom) either “coalesce” into common ancestral lineages (dark blue circles), or split into the distinct parental chromosomes that were joined (in forward time) by recombination events (light blue circles). Each coalescence and recombination event is associated with a specific time (dashed lines), and each recombination event is also associated with a specific break point along the chromosomes (here, and ). Each non-recombining interval of the sequences (shown in red, green, and purple) corresponds to a “local tree” embedded in the ARG (shown in matching colors). Recombinations cause these trees to change along the length of the sequences, making the correlation structure of the data set highly complex. The ARG for four sequences is denoted in our notation. (B) Representation of in terms of a sequence of local trees and recombination events . A local tree is shown for each non-recombining segment in colors matching those in (A). Each tree, , can be viewed as being constructed from the previous tree, , by placing a recombination event along the branches of (light blue circles), breaking the branch at this location, and then allowing the broken lineage to re-coalesce to the rest of the tree (dashed lines in matching colors; new coalescence points are shown in gray). Together, the local trees and recombinations provide a complete description of the ARG. The Sequentially Markov Coalescent (SMC) approximate the full coalescent-with-recombination by assuming that is statistically independent of all previous trees given . (C) An alignment of four sequences, , corresponding to the linearized ARG shown in (B). For simplicity, only the derived alleles at polymorphic sites are shown. The sequences are assumed to be generated by a process that samples an ancestral sequences from a suitable background distribution, then allows each non-recombining segment of this sequence to mutate stochastically along the branches of the corresponding local tree. Notice that the correlation structure of the sequences is fully determined by the local trees; that is, is conditionally independent of the recombinations given the local trees .

The coalescent-with-recombination is conventionally described as a stochastic process in time, but Wiuf and Hein [36] showed that it could be reformulated as a mathematically equivalent process along the genome sequence. Unlike the process in time, this “sequential” process is not Markovian because long-range dependencies are induced by so-called “trapped” sequences (genetic material nonancestral to the sample flanked by ancestral segments). As a result, the full sequential process is complex and computationally expensive to manipulate. Interestingly, however, simulation processes that simply disregard the non-Markovian features of the sequential process produce collections of sequences that are remarkably consistent in most respects with those generated by the full coalescent-with-recombination [37], [38]. In other words, the coalescent-with-recombination is almost Markovian, in the sense that the long-range correlations induced by trapped material are fairly weak and have a minimal impact on the data.

http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004342

It only traces back DNA segments that appear in the sample.

skip list. Check out MIT Video Lecture on Skip Lists