genetic maps and mapping

21 Apr 2015

Overview

Like a physical map, a genetic map tells people where they are and helps them get where they want to go. It contains a set of landmarks that helps scientists navigate around the genome, such as short DNA sequences, regulatory sites that turn genes on and off, and genes themselves. A genome map today is nothing close to a highly precise road map though. It is more like a map of North America made when Europeans were just beginning to explore the continent. Some parts of the genome have been mapped in great detail, while others remain relatively uncharted territory. It may turn out that a few landmarks on current genome maps appear in the wrong place or at the wrong distance from other landmarks. But over time, as scientists continue to explore the genome frontier, maps will become more accurate and more detailed.

Most everyday maps have length and width, latitude and longitude, like the world around us. But a genome map only has one dimension. It is linear, like the DNA molecules that make up the genome itself. A genome map looks like a straight line with landmarks noted at irregular intervals along it, much like the towns along the map of a highway. The landmarks are usually inscrutable combinations of letters and numbers that stand for genes or other features—for example, D14S72, GATA-P7042, and so on. Thus, a genome map is close to a genome sequence in portraying a genome, but they are different. Consider the following sequence:
sequence
A map of that sequence might look like this:
map
A genome map simply identifies a series of landmarks in the genome; for example, GCC and CCCC on the map above. In contrast, the corresponding sequence spells out the order of every DNA base in the genome. In this sense, the sequence is the most detailed possible map. It also implies that we can determine the location of a gene (i.e., to “map” the gene) without sequencing it, but by use of certain mapping techniques (discussed below). Because mapping involves less information to collect and organize than sequencing does, creating a reasonably comprehensive genome map is generally quicker and cheaper than sequencing the entire genome, particularly for humans and other species with large genomes. That said, the ideal genome map is one that depends on physical distance in the DNA sequence measured in base pairs (bp) or kilobases (kb). This type of genome maps is thus referred to as the physical map.

Studying the human genome is therefore a two-pronged effort, aiming at both a comprehensive genome map and a complete genome sequence. At first glance this strategy may seem redundant, since a sequence is simply the most detailed map possible. Why not just sequence the genome? Why keep mapping the human genome if it’s already been sequenced? One reason is that a map can actually help you sequence the genome. If you’re sequencing a genome with the clone-by-clone method, you need a map to determine where each clone belongs in the genome. The more detailed and accurate your map, the easier it is to snap those pieces of genomic jigsaw puzzle into place. With whole-genome shotgun sequencing, a map is no longer central to the strategy, but one can still be used to help match the big pieces of assembled sequence to their proper place in the genome. Another reason is that you need a map to understand the genome sequence. As a map may tell you nothing about the sequence of the genome, a sequence may tell you nothing about the map. A sequence is just a long, long string of DNA bases or “letters.” For the most part, scientists can’t look at a sequence and see immediately which parts are genes or other interesting features, and which parts are “junk.” But the landmarks on a genome map provide clues about where the important parts of the genome sequence can be found. Linear arrangement implies not merely order of loci but the additivity of their distances.

How do geneticists indicate the location of a gene?

One type of map uses the cytogenetic location to describe a gene’s position. The cytogenetic location is based on a distinctive pattern of bands created when chromosomes are stained with certain chemicals. A gene’s cytogenetic location (e.g., 17q12) is a combination of numbers and letters that provides a gene’s “address” on a chromosome. It can also be written as a range (e.g., 17q12-q21), if less is known about the exact location. This address is made up of several parts:

Sometimes, the abbreviations “cen” or “ter” are also used to describe a gene’s cytogenetic location. “Cen” indicates that the gene is very close to the centromere. For example, 16pcen refers to the short arm of chromosome 16 near the centromere. “Ter” stands for terminus, which indicates that the gene is very close to the end of the p or q arm. For example, 14qter refers to the tip of the long arm of chromosome 14. (“Tel” is also sometimes used to describe a gene’s location. “Tel” stands for telomeres, which are at the ends of each chromosome. The abbreviations “tel” and “ter” refer to the same location.)

Another type of map uses the molecular location, a precise description of a gene’s physical position on a chromosome. Hence this type of maps are often referred to as physical (sequence) maps. The molecular location is based on the sequence of DNA building blocks (base pairs) that make up the chromosome. The Human Genome Project determined the sequence of base pairs for each human chromosome. This sequence information allows researchers to provide a more specific address than the cytogenetic location for many genes. A gene’s molecular address pinpoints the location of that gene in terms of base pairs. It describes the gene’s precise position on a chromosome and indicates the size of the gene. Knowing the molecular location also allows researchers to determine exactly how far a gene is from other genes on the same chromosome. This type of information, also known as genetic linkage, indicates the tendency of alleles that are located close together on a chromosome to be inherited together during meiosis. Genes whose loci are nearer to each other are less likely to be separated onto different chromatids during chromosomal crossover, and are therefore said to be genetically linked. In other words, the nearer two genes are on a chromosome, the lower is the chance of a swap occurring between them, and the more likely they are to be inherited together. Simply put, two linked loci do not recombine freely. Linkage information can help us search for disease causing variant(s) in human genome that are difficult to find directly. With linkage inforamtion, we can locate them through their linked markers.

The LD Structure of the Human Genome and the relationship to genetic mapping

Toward the end of the 1990s it was recognized that the LD structure of the human genome had to be characterized before association-mapping studies could be efficiently devised. The understanding of the LD structure has increased remarkably since the somewhat alarming findings of Kruglyak (10) that implied, through coalescent simulation neglecting population bottlenecks, that LD was unlikely to be extensive enough to exploit association in a costeffective way to predict the location of disease-influencing variants. Subsequently many papers have shown that in reality, even in outbred heterogeneous populations, LD is rather extensive (on average ~50 kb).

The linkage map (take a look at the interesting and inspiring story behind the construction of the first linkage map) use a unit of recombination frequency to indicate distance along a chromosome. The unit is called centimorgan (abbreviated cM). One centimorgan is equal to a 1% chance that a marker at one genetic locus on a chromosome will be separated from a marker at a second locus due to crossing over in a single generation. The number of base-pairs 1 centimorgan corresponds to varies widely across the genome (different regions of a chromosome have different propensities towards crossover). In human genome, 1 centimorgan on average represents a distance of about 7.5x10E5 base pairs. Linkage mapping requires family material so the coinheritance of markers and disease can be tracked. However, relatively few meiotic breaks occur over the small number of generations available in most pedigrees. Moreover, family records are usually available only for a few generations. This limits the extent to which a disease can be fine mapped. At best linkage mapping can be expected to map a gene to a region no smaller than about one megabase.

Unlike the linkage map, the linkage disequilibrium (LD) map determines the distance between pairs of loci based on allelic association calculated from population data. This association is also called linkage disequilibrium (LD) and measured in linkage disequilibrium units (LDU). One LDU corresponds to the length of a chromatid in which on the average one crossover event has taken place in t generations, and so the resolution is t times as great as for the linkage map. This greatly increases the power to localize a gene within a candidate region. LD mapping or association mapping depends on the pattern of LD in single haplotypes, interpreted by an LDU map. Both linkage and LD mapping take as their prime objective the identification of a gene by determining a candidate region in the genome in the absence of information about the structure or function of the gene product. Currently single nucleotide polymorphisms (SNPs) are commonly used because of their much greater number and ease of typing. We can “tag” the majority of SNPs and haplotypes using a reduced set of SNPs and thus recover the majority of the information encoded in human genome. We can also pool multiple SNPs together to form duplications, deletions, and insertions. An international effort called HapMap is being made to create an SNP database. Next let’s introduce the construction of these two types of maps in details.

Linkage map

A linkage map is a genetic map of a species or experimental population that shows the position of its known genes or genetic markers relative to each other in terms of recombination frequency, rather than a specific physical distance along each chromosome (so a genetic map is not a physical map). The greater the frequency of recombination (segregation) between two genetic markers, the further apart they are assumed to be. Conversely, the lower the frequency of recombination between the markers, the smaller the physical distance between them. Linkage mapping is critical for identifying the location of genes that cause genetic diseases. Genetic maps have been helping researchers locate other markers, such as other genes by testing for genetic linkage of the already known markers.The place on a genetic map where a gene is located is the gene locus (plural loci).

By working out the number of recombinants it is possible to obtain a measure for the distance between the genes. This distance is expressed in terms of a genetic map unit (m.u.), or a centimorgan(after Thomas Hunt Morgan the first Drosophila geneticist) and is defined as the distance between genes for which one product of meiosis in 100 is recombinant. A recombinant frequency (RF) of 1% is equivalent to 1 m.u. But this equivalence is only a good approximation for small percentages; the largest percentage of recombinants cannot exceed 50%, which would be the situation where the two genes are at the extreme opposite ends of the same chromosomes (explained later). In other words, recombination rates of 50% can be achieved by genes that reside on different chromosomes simply by independent assortment, while genes that are far apart but on the same chromosome can produce similar results (if recombination occurs). A recombinant frequency significantly less than 50% shows that the genes are linked. A recombinant frequency of 50% generally means that the genes are unlinked on separate chromosomes.

Fun fact: Crossing over does not occur in Drosophila (i.e., fruit fly) males, so even genes on opposite ends of a big chromosome are completely linked in the production of male gametes.

If two, four, or any other even number of crossovers occur, the resulting gametes will still retain the parental combination of coupled alleles at the two loci under analysis as shown in the figure below. As a consequence, the observed recombination frequency will be less than the actual recombination frequency. Since any crossover events would result in an exchange of genes but only an odd number of crossover events are visible, the frequency of meiotic crossover appears to be 50% (a 50-50 chance between even and odd number of crossover events).

A more accurate map of linked genes can be derived from analyzing the data from three (or more) linked genes (3 point/locus crosses). By adding a third gene, we are able to distinguish double crossovers with no crossover. However, the analysis becomes more complicated: we now have 8 possible recombinant products that can be obtained as shown below.
We first make a cross between individuals that are AABBCC and aabbcc. Next the F1 is testcrossed to an individual that is aabbcc. Using the data, the basic problem is to determine the order of the genes (there are 3 possible orders, depending on which gene is in the middle), make a map showing the distances between all pairs of genes, and determine the value of interference.

The concept of interference refers to the tendency of a "crossover" between two pairs of genes to reduce the chance that one of those genes will cross over with a different gene. The calculation for interference involves subtracting a ratio of the observed crossover frequency to the expected crossover frequency (called the coefficient of coincidence) from one.

We can solve this problem following several rules:

  1. Organize the data (test cross data showing the number of offspring of each of the 8 possible types) into reciprocal pairs. For example, a + c and + b + are a reciprocal pair, and so are a b c and + + +. The number of offspring for each member of a pair will be similar.
  2. Determine which pair is the parental class: it is the LARGEST class.
  3. Determine which pair is the double crossover (2CO) class: the SMALLEST class.
  4. Determine which gene is in the middle. If you compare the parentals with the 2COs, the gene which switched partners is in the middle. If you think two genes have switched sides, it is the other gene that is in the middle.
  5. Determine map distances for all three pairs of genes. Count the number of offspring that have had a crossover between the genes of interest, then divide by the total offspring and multiply by 100.
  6. Figure the expected double crossovers: (map distance for interval I / 100 ) * (map distance for interval II / 100 ) * (total offspring) = exp 2CO.
  7. Figure the interference as: I = 1 - (obs 2CO / exp 2CO). The observed 2CO class comes from the data.

Linkage disequilibirum unit (LDU) map

LD is the association between alleles at two linked loci that reflects, in part, their proximity and the correspondingly low probability of recombination breaking the haplotype on which they are found. There is considerable interest in describing LD patterns in the human genome to extend the resolution of the linkage map, and to aid SNP association mapping, whose strength depends, for a population, on the number of founding individuals (and therefore the number of founding haplotypes), the time since founding (and therefore the number of generations over which recombination has driven the decay of LD), along with a number of other factors such as mutation, drift, and selection. Many metrics have been used to measure LD between pairs of SNPs, for example covariance (D), association (rho), correlation (r), regression (b), frequency difference (f),Yule metric (y), and population-attributable risk (delta). The association metric rho, which measures the degree of association between markers from observed or estimated haplotype frequencies underpins the construction of LD maps. This metric equates to the absolute value of the D prime metric and it has been shown to have the greatest efficiency for modeling the exponential decline of association with distance in a large sample of haplotypes (Morton et al., 2001).

A linkage disequilibrium map determines the distance between pairs of loci not from recombination but from allelic association, also called linkage disequilibrium (LD), which is measured in linkage disequilibrium units (LDU). One LDU corresponds to the length of a chromatid in which on the average one crossover event has taken place in t generations, and so the resolution is t times as great as for the linkage map. This greatly increases the power to localize a gene within a candidate region at the expense of typing a larger number of markers and using methods that at this early stage are less transparent to many geneticists.

Collins et al. (2001) showed that LD hot and cold spots could be delineated by estimating epsilon in the Malécot equation (1948) from pairwise measures of association. In this manner, areas of low and high LD are defined by large and small values of epsilon that coincide with recombination hot and cold spots and form the basis of an LD map. The Malécot equation is given by where rho is the association metric mentioned above, indicating the probability of association. L is the residual association at large distance, M describes the amount of association at zero distance and is a measure of phylogeny where a value of 1 is consistent with monophyletic origin and are less than 1 otherwise, and epsilon(i) gives the exponential decline of association with physical distance di in kilobases between the ith pair of SNPs. Thus, the Malécot model describes the decay of LD with distance, governed by levels of recombination. Using the Malecot model, Maniatis et al. () constructed the first LD map with additive distances expressed in linkage disequilibrium units (LDUs). The LDU distance between the ith pair of neighboring SNPs is given by epsilon(i)di where the vector product epsilond is equivalent to thetat, where theta is the frequency of recombination and t is the effective number of generations over which recombination has accumulated after one or more population bottlenecks (Zhang et al., 2004). One LDU corresponds to one swept radius, defined as the average extent of LD useful for association mapping (the distance in kilobases at which disequilibrium has declined to 1/e~0.37 of its starting value). Although epsilon*d is primarily a function of recombination and time, it is also influenced by gene conversion, genetic drift, selection, mutation, and the partly cumulative effects of multiple historical bottlenecks that inflate LD because of founder effects. As a result, the contours of population-specific LDU maps are highly concordant. A LDU map curve (as on a plot whose x- and y-axis represent a marker’s physical and ldu map position respectively) generally comprise a series of plateaus and steps that correspond to LD blocks and regions of breakdown of marker association respectively. A flat region on the curve indicates strong LD in the chromosomal segment, while more vertical regions reflect high recombination intensity. However, the magnitude of LDU steps differ according to population differences in duration. Other differences between population-specific LDU maps may reflect regions of the genome that have been influenced by other processes such as mutation, selection, or drift that vary between populations.

The construction of LDU maps requires accurate physical (sequence) maps and high-density SNP genotype data to give correct distances of di and allow reliable estimation of epsilon(i). Prior to LDU map construction, the genotypic data are often screened to remove SNPs with extreme deviations from the Hardy–Weinberg test and SNPs with minor allele frequencies less than 5%. Having obtained a quality controled dataset with sufficient numbers of individuals and markers, LDU maps can be constructed by the LDMAP program. LDMAP uses genotypic SNP data from unrelated individuals in the form of diplotypes or haplotypes. Although both types of data can be used to construct highly concordant LDU maps, previous studies have shown that real haplotypes, as opposed to inferred, are on average 50% more informative than diplotypes. However, this gain must be balanced against the extra cost and error involved in determining haplotypes (i.e., phasing). The first step is to create an intermediate file, from either haplotype or diplotype data, containing pairwise association probabilities rho, information K(rho) and the kilobase size of intervals di. The association probability rho = D/Q(1 – R), where D is the absolute value of the difference between a haplotype frequency and its equilibrium value as the product of allele frequencies. For marker-by-marker association in unrelated individuals, rho equates to the absolute value of D prime. Under the null hypothesis that D = 0, the information K(rho) is given by NQ(1 – R)/R(1 – Q) for N haplotypes or diplotypes. Under the alternative hypothesis the information from haplotypes is a closed form in D (Collins, 2001), but the information from diplotypes requires inversion of the 3 × 3 information matrix for Q, R, and D.
Having produced an intermediate file from either haplotype or diplotype data, the Malecot model is used to estimate epsilon(i) for each marker interval that are then used to construct LDU maps. Large maps with high density, such as those constructed for whole chromosomes using HapMap data, can be constructed in segments of 500–1000 markers with little loss of information, whereas smaller maps can be made in one piece. Values of epsilon are estimated by a multiple pairwise algorithm; for example, Collins et al. (2004) consider a map with five SNPs and four intervals (see the figure below), a value of epsilon for the first interval is calculated using all of the pairwise measures of association that include that interval (i.e., the pairwise association between 1–2, 1–3, 1–4, and 1–5). These measures are combined by weighting them according to their information K(rho) so that interval 1–2 is given the biggest weight and 1–5 the smallest weight. The Malecot model is fitted to this data to determine values of epsilon by (composite) maximum likelihood. After we estimate epsilonn(i, the LDU distance corresponding to the first interval is given by epsilon(i)di, where epsilon(i) = 0.05 and di = 10 kb in this example, so it is 0.05
10 = 0.5 LDU. This process is repeated for each interval to produce an LDU map. Unlike epsilon, M and L in the Malecot equation are estimated as single values for the entire sample. Since L is the residual association at a large distance, erroneous estimates may occur for small regions and/or high-density data due to the predominance of block structures with high association. It is therefore advisable to use predicted L (Lp), given by the weighted mean deviation for a normal distribution (7) when dealing with these types of data. Convergence of parameter estimates is achieved by maximizing the composite likelihood given by

Previous studies have shown that maps built from subsets of rare markers have shorter LDU lengths. Given the strong dependence of LD measures on allele frequencies and thus mutation age, this is expected because LD is higher between recent (rare) markers compared with older (more common) markers. This implies that the length of LDU maps constructed from studies with nonrandomly ascertained SNPs, such as the HapMap Project that focuses on common markers, will be somewhat different in corresponding maps derived from samples with different allele-frequency distributions. LDU maps are largely insensitive to marker density as their profiles determined at various SNP densities from 1/2–1/23 kb are highly concordant, as the model fitting and simultaneous use of multiple intervals provides a degree of smoothing that seems to remove much of the marker density effects that plagues other fine-scale descriptions of LD. However, LDU maps may contain a small proportion of intervals with indeterminate values of epsilon(i), also known as holes, which are assigned maximum values of 3 LDU. The factors that determine holes are complex, not onlydominated by recombination, but also include SNP distribution, kilobase width of holes, the criteria to declare a hole, and errors in estimation of epsilon(i). There seems to be a reduced number of holes in response to increased marker density, which therefore yields more precise LDU maps.

represent LD patterns in the form of a metric map with additive “linkage disequilibrium unit” (LDU) distances

the effective number of generations over which recombination has occurred: the effective bottleneck time

The decline of LD, modeled as association rho as a function of distance d, in kilobases, is 􀁗 = (1–L)Me􀀒􀁊d + L, in which the L parameter reflects residual association at large distance not from linkage, M is the intercept, the association at zero distance, and epsilon is the exponential decline of LD as the product of recombination theta and number of generations t. The model has the same form as that developed by Malecot (13) to describe genetic isolation by distance but has different parameters. LD map construction estimates 􀁊 in each map interval between adjacent SNPs. For any pair of SNPs the association probability 􀁗 and the information form the data for LD map construction. Pairs that span a given interval contain information about association in that interval, but pairs at large distances are uninformative. The estimation of the 􀁊 vector requires the iterative substitution of distance d in the Malecot equation with distances in LDUs. These are defined, for the ith interval between adjacent SNPs, as 􀁊idi with locations by summation over preceding intervals (9). The LDU locations, when plotted against kb, typically show a pattern of steps where LD is breaking down and plateaus or blocks of high LD.

chromosome bands provide mutually exclusive and jointly exhaustive projections of genetic or cytogenetic candidate regions on the physical map as the first step in identification of genes that affect a phenotype of interest but are of unknown sequence and function. Their position must be refined on the genetic map before functional and physical studies become feasible.

Location databases that integrate aforementioned four types of genetic map (i.e., cytogenetic, physical, linkage, and LD maps) are extremely useful tools for gene mapping and for investigating the biological relationships between sequence and patterns of recombination and LD. Different maps have different properties, resolution, and applications. Although cytogenetic locations are coarse, this information is useful for regional assignment of chromosomal rearrangements. Linkage maps have proven invaluable for low-resolution mapping of genes predisposing to particular diseases with notable early successes such as localization of the Huntington’s gene. The resolution and accuracy of linkage maps have much improved in recent years and their application to disease mapping continues. Integration of these maps is essential as, for example, cytogenetic analyses may orientate linkage studies and association mapping is often directed by low-resolution linkage analyses, which identify large candidate regions (1–10 Mb) within which fine mapping is required to determine a causal locus. Once candidate regions have been identified, physical and genetic maps are required to determine regional gene content and gene function so that candidate genes can be prioritized for further investigation.

Genetic map functions

Motivation: The simulator I have been using (cosi1.2) applies an old ENCODE map in the form of a list of recobimation fractions between two adjacent loci. Nowadays bioinformatics tools often take advantage of existing genetic maps provided by, for example, 1000G projects. So often that they even require the correponding genetic map as an input (e.g., shapeit2). DASH, a primary competitor of CHAT, is one such tool. Exisiting genetic maps are generally in the form of centimorgans at single markers rather than recombination fractions between two markers. To apply DASH on simulated data, I have to create a genetic map for it. Interestingly, there are reasons why map distance (cM) is preferred to recombination fraction. Direct adoption of recombination fractions as distance between genes can be applied to loci which are situated closer on a chromosome. However when the distance increases the chances of crossovers are higher and hence the simple adaptation is not sufficient to calculate the distance between the loci. Also the double or even numbered cross overs result in the same progeny as the parental line and hence go unnoticed and doesn’t get counted among the recombinants. This largely underestimates the recombination fraction and hence distorts the genetic map. There are issues for mapping three or more points in a genome since the recombination fractions are not additive in nature. A map function was thus introduced as an error correction methodology in construction of genetic maps. It is a mathematical relation between the probability of recombination and map units. However, the existing map functions do have some limitations and need to be further modified or analyzed based on the observable data. Map functions relate the distance between loci and the recombination fraction by the equation R= M (d) where M is the mapping function, r- recombination fraction and d- distance between pairs of loci on a chromosome. Haldane’s Mapping function (1919) is the simplest of the lot assuming the number of crossovers to be in Poisson distribution. This also assumes interference (the effect of one recombination event on the adjacent crossover sites of a gene) to be nil.

dM = -1/2 ln (1-2r)

Where dM is the distance between the marker loci, r is the recombination frequency dM is expressed in Morgans.
The disadvantage is the non conformity of the recombination data to the expected Poisson distribution which is underlying the mechanism of Haldane’s mapping function. Map distances and recombination frequencies are also found to follow no predictable relations. Kosambi’s function (1944) considers the number of double crossovers and interference. The interference level is similar to that found in humans. The function is depicted as

d = ¼ ln[(1+2r)/(1-2r)]

where d- distance between markers and r is the recombination fraction. d is calculated as ‘Kosambi estimate which can be converted into centiMorgans by multiplying with 100 for construction of linkage maps. But the Kosambi’s can not be extended to more than three loci while calculating joint recombination probabilities. When the recombination fraction is ½, then the Haldane’s and Kosambi’s mapping functions are equivalent. Based on the (extremely) limited information above, I decide to use the Kosambi’s function.

Combined genetic maps

According to this post, Both recombination rates and hotspots were estimated separately for each HapMap analysis panel (YRI, CEU, CHB+JPT). Recombination rates were averaged across populations and p-values for hotspots were combined such that a hotspot requires that two of the three populations show some evidence of a hotspot (p<0.05) and at least one population showed stronger evidence for a hotspot (p<0.01). Hotspot centres were estimated at those locations where distinct recombination rate estimate peaks (with at least a factor of two separation between peaks) occurred, within the low p-value intervals. The width of the hotspot represents the region where the estimated rate is within a factor of two of the maximum.

References