IBD, IBS and related methods

29 Apr 2015

IBD vs. IBS

Simply put, people are more interested in "recent" IBD segments which are more than a combination of IBS alleles, i.e., two IBD segments are identical because they are inherited from a relatively recent common ancester (<= 25 generations).

The term IBS (Identical by State) refers to the situation that two DNA segments are identical at all possible allele positions, no matter how that matching occurred. The two segments can be IBD (identical by descent) if the matching alleles are directly inherited from a common ancestor in a genealogical timeframe, or not if the segment is ubiquitous within a specific population. To understand the latter situation, think about right/left-handedness. Ten percent of the population are left-handed. If you are right-handed and meet another person that is right-handed, does that mean you are relatives? Of course not. You might further argue that IBS segments essentially IBD, because every allele that modern people (Homo sapiens sapiens) got from Lucy’s people (Australopithecus afarensis). True, studies have shown that “unrelated” subjects commonly have long segments of shared haplotypes and most of us are IBD for 1-2% of our genome. It is just that when we talk about IBD, we often restrict ourselves to a genealogical time frame (about 500 to possibly slightly more than 1000 years, or the last 20 generations).

There are several ways certain sequences become very common in a population. That sequence could be very old and because it provided a very slight advantage to those that had it, it remained intact for thousands of years. But because the advantage was small enough, those that didn’t have the exact same sequence could also thrive and reproduce but at a slightly lower rate. Another way is when a population goes through an evolutionary bottleneck. That is, the population rapidly shrinks for whatever reason — disease, famine, or isolation such as when small group migrates to a new land — and then expands to fill its niche again. When this happens, the assortment of alleles that are available is sharply reduced. In these cases, many people can have similar alleles that are IBS but their common ancestor would have to be traced all the way back to the bottleneck. When the bottle neck is far enough back that no (or few) genealogical records persist, those segments are now considered IBS.

Here is a tricky example. If your father and your mother have the same DNA sequence in their chromosomes and you and your sister get one copy of that sequence from different parents, then that segment between you and your sister is IBS, not IBD. But there is no way to determine which one is the case. The more similar or inbred the population that your parents come from, the more likely this ambiguity exists.

Linkage vs. association analysis

To identify gene variations that cause human diseases, geneticists use two basic approaches that use logic and information based on IBD and IBS respectively. Linkage is the tendency for genes and other genetic markers to be inherited together because of their location near one another on the same chromosome, so linkage analysis focuses on well-characterized pedigrees to see if the same chromosome is inherited by the descendants of an individual that have the same trait. Association analysis looks to see whether specific gene variations are more common among people who are unrelated but share a certain trait. There is nonetheless a close relationship between the two approaches, because the “unrelated” individuals in a population are unrelated only in a relative and approximate sense (as discussed above). Basically, chromosomes sampled from “unrelated” individuals in a population will be much more distantly related than those sampled from members of traditional pedigrees. As a consequence, association analysis allows for finer localization/mapping of human disease loci (i.e., genes) than linkage analysis.

To explain the idea behind, imagine that a disease causing mutation has just occurred in a population. The chromosome on which this mutation occurred contains specific alleles in neighboring polymorphic loci. At first, the mutation will only be observed in conjuction with these alleles, so the association with these alleles will be high. Through time these associations will dissipate because of recombination, but the closest loci will experience the fewest recombinations and hence retain the highest levels of nonrandom associations. Thus, by looking for significant correlations between disease state (e,g., affected or not) and genetic markers (causal variations or their close loci), we can hope to identify the region in which the disease causing genetic mutation lies. The resolution of linkage analysis is determined by the number of meiosis oberved in pedigree, which is not very big (often the pedigree information we can obtain only trace back a few generations). Limited meiosis also means the recombination rate in nearby regions will also be very low. Thus, linkage analysis is likely to discover chromosomes regions that cover several Mb of DNA sequence and contain several hundred genes. Given an unrelated population, there will have been more meiosis. The more meiosis, the more opportunities for recombination to take place, and the shorter the relative region is. Thus, association studies can, in principle at least, map a trait of interest more to a smaller and more accurate segment (measured in Kbs).

The following figure (borrowed from Kirk’s Yale talk slide deck) summarizes what we have discussed so far. Linkage analysis using multiple families has been a powerful tool for identifying rare and highly penetrant genetic variants. However, most linkage studies require a large number of families with affected individuals to map the disease causing variant, and even so, the causative variant may only bemapped to a larger genomic region. In recent years genome-wide association studies (GWAS) using unrelated individuals and high throughput single nucleotide polymorphism (SNP)-chips has been successful in identifying common variants with a relative low penetrance. However, the variants identified by GWAS to date only explain 18–24% of the heritability. Many believe that much of the missing heritability is probably explained by common variants of smaller effect sizes and some heritability may be explained by rare variants of larger effect size.
IBDvsIBS

length distribution of tracts of identity by state (IBS)

Arthitecture of long-range IBD segments

Understanding the identity of alleles across individuals by descent from a common ancestor is central to genetics. The transmission of haploid copies of the genome with almost no mutation from parent to multiple offspring and their descendents gives rise to this identity and facilitates linkage in pedigrees and association mapping in less-related individuals. Generally, two contemporary homologous copies of a locus will differ only at sites of mutations along the respective lineages leading to them from the copy of that genomic region at the locus-specific most recent common ancestor (MRCA). For the average pair of copies, these lineages are thousands of generations long, but relatives may have a very recent MRCA for many loci.

The quantification of identity by descent (IBD) has been extensively studied. Standard assumptions in population genetics postulate that the chances of lineages leading into the past to meet at each generation are inversely proportional to the effective population size, Ne (Fisher 1930; Wright 1931), and under the classical Wright–Fisher model the time from MRCA is geometrically distributed, averaging 2Ne (Tajima 1983). Copies of a locus that are transmitted by a parent to a pair of sibling carriers have a chance of ½ to be IBD, and kth cousins share an IBD locus across any of the four pairs of their respective copies with probability ½2k, a negligibly small number for k = 20 and beyond. However, in the unlikely event that such remote relatives do share an autosomal locus IBD, flanking genomic loci are also likely to be shared, spanning a continuous IBD region to the nearest sites of crossover in any of the meioses from the relatives to their MRCA. Under the assumption of independent recombination events, the genetic length of this region would have an exponential distribution with mean (100 cM)/(k + 1), unless bounded by the end of the chromosome. Across the 22 autosomal chromosomes, which together contain 3,400 cM (Kong et al. 2002), there are on average 22 + (6800/100) × (k + 1) regions with unchanged transmission patterns; each being an opportunity for IBD, (22 + 68 × [k + 1]) × 2−2k, such regions are in fact expected to be IBD. Based on these considerations, relatives can be conveniently partitioned into three broad categories:

The distance between adjacent polymorphisms is inversely correlated with local TMRCA (time to most recent common ancestor); an L-base-long locus in a pairwise alignment that coalesced t generations ago is expected to contain 2Lmiu*t polymorphisms, miu being the mutation rate per generation.

References