Long shared haplotypes and the discovery of rare variants

Detecting rare variants by IBD Mapping

My recent talk on IBD mapping and its application. Although genome-wide association studies (GWAS) have revealed numerous common susceptibility variants for complex diseases, only a small fraction of disease heritability can be explained. The target sets of variants of interest for association studies have increasingly included rarer variants and SNPs (low frequency, e.g., ≤ 5%). Rare variants may be a major source of genetic variations that predisposes individuals to disease (both complex diseases and Mendelian disorders), but their detection is often difficult. The low penetrance or effect size of rare variants, as well as genetic heterogeneity, often means that they are unlikely to be detected by classical linkage analysis, and their low allele frequency significantly reduces the power of association studies. Since standard analyses of GWAS data (SNP association mapping) are not powerful enough to detect the effects of rare variants, other approaches have been developed for this purpose. One of them is to re-analyze GWAS data using identity-by-descent (IBD) mapping. IBD mapping uses detected IBD segments (see another post on this topic) to determine which regions of the genome are likely to harbor disease-susceptibility variants. In this sense it is similar to linkage mapping, but IBD mapping is performed on unrelated individuals at the population level (e.g., case and control) instead of using pedigree information.

IBD segments are long haplotypes these individuals share by inheritting from their recent common ancestors 10–100 generations ago. This level of relatedness cannot be capaturable using traditional pedigree-based linkage analysis because pedigrees are unknown or too large for standard linkage calculations, but it can be exploited through pedigree-free IBD segement detection, resulting in several advantages of the IBD approach over standard GWAS approach to identifying disease-susceptibility variants. Firstly, the patterns of IBD sharing make it possible to disentangle multiple causal variants within an associated region. In other words, IBD mapping tends to have higher power than association testing when multiple rare variants within a gene contribute to disease susceptibility. The multiple-testing adjustment for population-data-based IBD mapping (such as proposed by Browning and Thompson, 2012 or Gusev et al., 2011) is less than for genome-wide single-marker association testing (such as the Bonferroni correction) but more than for family-based linkage analysis. The actual adjustment depends on the resolution of IBD detection. Moreover, the IBD approach delineates more precisely the region (i.e., narrows down to a part of the genome) that contains disease-causing loci and therefore should be further sequenced. It also indicates exactly which individuals are carriers, and this is helpful in determining which individuals to sequence. IBD can also be used to determine haplotype phase and to detect sequencing errors that might otherwise lead to false-positive results.

Since all variants are expected to have a single origin, and the specific chromosome that the variant arose upon is uniquely identifiable by patterns of marker variants nearby the site of origin, it follows that rare SNPs should be covered by rare haplotypes of the common SNPs. In the absence of selective pressures, rare SNPs arealmost always of recent origin, it follows that the marker haplotype marking the new variant may still be unaffected by recombination occurring after the origin of the rare variant and thus still be recognizable. HapMap Phase 3 identified that lower frequency variants are indeed, on average, display a greater extent of haplotype sharing than more common variants.

If two individuals have a common ancestor more than ten generations ago, they tend to share very short tracts of genetic material, then the chance that they share a long haplotype will be very rare and the haplotype is very likely to be IBD. If case pairs can be detected with long shared haplotypes inherited from distant common ancestors, then rare variants influencing disease risk can be localized. Thus, IBD mapping helps reduce the massive multiple testing problem by prioritizing regions. IBD analysis limitations: it is only suited to discover rare variants if all variants act in the same direction in one gene. For example, the identified rare variants in BRCA1 and BRCA2 gene all increase risk of breast cancer, and the four rare variants identified in IFIH1 gene all protect against type I diabetes. If some rare variants increase risk while others in the same gene decrease risk then the signal in the region will be attenuated. In addition, IBD analysis is sensitive to genotyping error, resulting in reducing signal strength. The linkage signal detected depends on a lot of markers or long haplotypes, containing up to hundreds of SNPs, a single error occurring in reading a single marker significantly reduces the signal.

The idea of using IBD haplotype sharing to detect signals of disease-causing variants in population samples is not new, but the greatly increased density of SNP markers now makes it possible to detect much smaller segments of IBD. Now we can determine pairwise IBD sharing in a large sample over the whole genome to a resolution of approximately 2 cM (length of IBD segments).

Mathematically IBD means that the probability of observing an allele in one individual is not independent from observing an allele in another individual. It is this property that is exploited in linkage analysis. Pairwise relatedness is measured in probabilities of alleles from the two individuals being IBD. Two types of statistics have been proposed for IBD mapping. The first type use IBD detected between pairs of individuals (Purcell et al. 2007): the rate of IBD in case/case pairs is compared to the rate of IBD in either control/control pairs or non-case/case pairs (control/control and control/case pairs). The second type of statistics cluster haplotypes into IBD classes at a locus: all haplotypes within a class are IBD with each other at the locus; clusters are tested for association with case-control status. An individual person is usually a member of two IBD classes at a locus because he/she has a pair of chromosomes, although the two classes will be the same if the individual is homozygous by descent.

Most of current molecular genotyping techniques cannot tell which one of the two parental chromosomes (maternal and paternal) an allele inherits from, although this distinction is often important for understanding disease causality. Instead, they mix DNA pieces from two chromosomes and only provide genotypes of diploid (mixture of haplotypes) in no order. For example, if you are AA GG AA CC you will match me when I am AG AG AG CT, but there is no guarantee that A G A C came from the same chromosome (e.g., it could be A–G on one chromosome and A–C on the other). Such results lead to haplotypic ambiguity/uncertainty when ≥2 makers/loci are heterozygous and their genetic phase is unknown.Some molecular techniques can directly measure haplotypes, but expensive (money, labor, time ….), especially for genome-wide study. Thus, currently researchers heavily rely on imputation, which requires pre-phasing or pre-haplotyping.

Several methods to detect IBD segments have been published: these include PLINK, GERMLINE, BEAGLE IBD and BEAGLE fastIBD. The models employed by PLINK and GERMLINE assume SNPs are in linkage equilibrium, so “pruning” of SNPs is required to avoid false positives due to under-estimates of population haplotype frequencies. However pruning of SNPs in incomplete linkage disequilibrium (LD) discards potentially useful information and reduces power. BEAGLE IBD and fastIBD implement a variable length Hidden Markov Model to account for LD and model haplotype frequencies more accurately. BEAGLE fastIBD runs considerably faster than BEAGLE IBD (of the order of 1000 times faster with large GWAS datasets). This is mainly because 1) it does not formally model IBD status (‘IBD’/’not IBD’) between pairs of individuals using a Hidden Markov Model as in BEAGLE IBD; 2) it stores haplotype frequencies in a data dictionary (as in GERMLINE) which means computational time scales with sample size n like nlog n instead of n^2.

Detecting rare variants by imputation

One of the crucial questions has to do with the necessary size of the reference panel needed in order to get good imputation quality. The required reference panel size increases as the frequency of the SNPs (or haplotypes in the above discussions) that are being imputed decreases. An analysis by Liu et al. considered imputation of both common and rare SNPs that were genotyped in about 2,000 subjects. They found that with this many individual samples (and withholding genotyped data so that “dosage R2,” i.e., R2 between imputed dosage variable and true genotype could be directly examined), a reference panel of ~2,000 samples could impute with R2 0.8 SNPs with minor allele frequency in the range from 1 % to 3 %.

Note that alleles with frequency 1 % would be seen an expected 40 times in a reference panel of 2,000 individuals. This is roughly the same number of minor alleles that would be expected to be seen in a reference panel of 40–100 samples (e.g., approximately HapMap sized) for SNPs in frequency range 0.2 to 0.5, and indeed SNPs in this frequency range are generally highly imputable with HapMap data [25]. Naively scaling this to rarer SNPs implies that reference panels of size at least 20,000 samples are required to impute SNPs with minor allele frequencies at 1/10 of 1 % with this same level of imputation accuracy.

In this naive scaling, however, the issue of haplotype uncertainty as embodied in the R2 h calculation has been neglected. Without recombination there is a limit on the total number of haplotypes (and hence also a limit on the number of rare haplotypes), so if many rare haplotypes are to be defined by common SNPs, then recombination must be present. If a rare SNP falls on one or more rare haplotypes of common SNPs, it may not be well imputed by the common SNPs since imputation accuracy declines with the amount of recombination.

Detecting shared haplotypes that are identical by descent (IBD) could facilitate discovery of these mutations. While common variants are mostly shared across ethnic groups, rare variants are more likely to be recent in history and population-specific. Some are recent founder mutations, shared by a number of individuals whose relationship may not be socially known. Recent founder mutations playing a role in a disease should aggregate more in cases than in controls and the haplotypes in which they reside should have been affected by only a limited number of recombination events, unlike haplotypes in the general population. Making use of this distinction may aid the detection of recent haplotypes among cases and facilitate detection of founder mutations.

Long shared haplotypes and the discovery of rare variants

Detecting rare variants by IBD Mapping

Detecting rare variants by imputation

References