Polymorphism, genealogy, simulation, and hypothesis testing

01 Apr 2015

Modern geneticists make sense out of DNA sequences using POLYMORPHISM DATA. These data include the genotypes of many individuals sampled at one or more loci; a locus is considered polymorphic if two or more distinct types of sequences are observed, regardless of their frequencies. The goal of much modern genetic research is to find gene variations that contribute to disease. Finding these genes should allow an understanding of the disease process, so that methods for preventing and treating the disease can be developed.

Why do we care about variations in single sites (SNPs) and multiple sites (haplotypes) on DNA sequences?

A SNP (single nucleotide polymorphism) is a site in the DNA where different chromosomes differ in the base they have. For example, 30 percent of the chromosomes may have an A, and 70 percent may have a G. These two forms, A and G, are called variants or alleles of that SNP. An individual may have a genotype for that SNP that is AA, AG, or GG. When chromosomes from two random people are compared, they differ at about one in 1000 DNA sites. Thus when two random haploid genomes are compared, or all the paired chromosomes of one person are compared, there are about three million differences. When more people are considered, they will differ at additional sites. The number of DNA sites that are variable (SNPs) in humans is unknown, but there are probably between 10 and 30 million SNPs, about one every 100 to 300 bases. Of these SNPs, perhaps four million are common SNPs, with both alleles of each SNP having a frequency above 20 percent.

The majority of SNPs are not actual functional variants that contribute to the risk of getting a disease, but they can be useful as markers for finding functional variants. To find the regions with genes that contribute to a disease, the frequencies of many SNP alleles are compared in individuals with and without the disease (case and control). When a particular region has SNP alleles that are more frequent in individuals with the disease than in individuals without the disease, those SNPs and their alleles are said to be associated with the disease. These associations between a SNP and a disease indicate that there may be genes in that region that contribute to the disease.

A haplotype is a set of DNA variations, or polymorphisms, that tend to be inherited together. A haplotype can refer to a combination of alleles or to a set of SNPs found on the same chromosome. Some recent studies found that haplotypes occur in a block pattern (ranging in size from about 3 kb to more than 150 kb): the chromosome region of a block has just a few common haplotypes, followed by another block region also with just a few common haplotypes, with the longer-distance haplotypes showing a mixing of the haplotypes in the two blocks. Another description of this pattern is that the SNPs in a block are strongly associated with each other, but much less associated with other SNPs. The majority of SNPs are organized in these blocks.

Consider the example below, of a region where six SNPs have been studied; the DNA bases that are the same in all individuals are not shown. The three common haplotypes are shown, along with their frequencies in the population. The first SNP has alleles A and G; the second SNP has alleles C and T. The four possible haplotypes for these two SNPs are AC, AT, GC, and GT. However, only AC and GT are common; these SNPs are said to be highly associated with each other (i.e., linkage disequilibrium: AC and GT are more likely to be at the same chromosome than AT and GC).
haplotype

It is important to look beyond SNPs into common haplotypes (see the International HapMap Project). First of all, many common diseases, such as heart disease, stroke, diabetes, cancers or psychiatric disorders, are affected by many genes (and environmental factors) rather than single genes (the missing heritability problem). Secondly, it is still too costly to type millions of SNPs across the entire genome to see which SNPs are associated with disease (GWAS, genome-wide association studies). In contrast, as shown in the above example, typing two SNPs is all that is needed to distinguish among the three common haplotypes. The larger the block, the fewer SNPs (within the region) need to be typed to identify the haplotype. Besides the typing cost, haplotype-based analysis can be much more statistically powerful than marker-by-marker analysis for association studies (Zhang et al., 2002).

Why do we care about the geneaological information?

First of all, we need to make sure the disease is genetic rather than due to some environmental factors. Among genetic causes to various haplotype and allele frequencies, there are still many possibilities. Haplotype and allele frequencies may be affected by cellular-level processes such as mutation, recombination, and gene conversion, as well as by population-level processes such as natural selection against alleles that contribute to disease. When genes are close together and associated, then selection that changes the frequency of an allele at one gene results in similar changes in the frequencies of alleles at other genes on the same haplotype. There are also demographic and social factors: population history factors such as population size, bottlenecks or expansions of population size, founder effects, isolation of a population or admixture between populations, and patterns of mate choice.

Population and evolutionary genetics study evolutionary forces or ancient events that create and maintain genetic variation. At the heart of it is the use of gene genealogy, a tree structure describes the evolutionary history (e.g., divergence) of a particular set of DNA sequences and their relatedness. The shape of the genealogy depends on genetic events such as inheritage, mutations, recombination, as well as population events such as isolation, migration, population growth, etc. geneticists have been using it to infer or predict DNA sequence variation. Due to the uncertainty elements of reproduction, the central part of genealogical data analysis is a stochastic characterization of the genealogies that relate the sequences.

Geneticists use two basic approaches: linkage analysis, which can be also described as identity by descent, and association analysis, which can be described identity by state. Linkage analysis looks to see if the same chromosome is inherited by the descendants of an individual that have the same trait. Association analysis looks to see if the specific gene variations are more common among people with a trait. In this example an individual can have an A or G nucleotide at a specific DNA position. One can see that the affected individuals are more likely to have a G than an A. The G nucleotide is said to be associated with being affected.

Why Simulation?

Since history is played only once, we generally have distorted perceptions of social behavior, overemphasizing what actually happened, to the detriment of what could have happened. -- Duncan Watts

The two main reasons for wanting to simulate genetic data are, first, to gain insight into the effects that underlying demographic and mutational parameters may have on the genetic data one sees, and, secondly, to create test datasets for assessing the power of alternative genetic analysis methods. For example, we may want to generate a data set characterizing a specific population by defining population size, evolutionary time, mutation rate, etc. There is a recent review on this area from a geneticist point of view. The article also points out several other review articles, which, according to the author, focus more on technical aspects.</p>

Modern genetics studies use data collected from natural populations. This approach has two problems. Firstly, there is no replication of the ‘experiment’, only one run of evolution is available to be studied. Secondly, the starting conditions of the ‘experiment’ are unknown. However, statistical hypothesis testing expects and utilizes inherent uncertainty/randomness of observations. Imagine that we sequence a 10-kb region in the DNA chromosomes of 30 randomly chosen individuals and, surprisingly, find no polymorphisms. We might interpret this observation as evidence for selective constraint in this region. Alternatively, it might be that the individuals chosen for the comparison are unusually closely related. The interpretation depends on the genealogy of the sampled sequences, which is a tree structure describing the unique, often unknown history of mutation, recombination and COALESCENCE of lineages in the ancestry of the sample.

One solution is to model and simulate the past using a suitable stochastic model. To decide if the above data are unusual, we might make assumptions about the process that gave rise to those data, model the evolutionary process, and then run the simulated process many times to generate random repetitions. If the fraction of the random genealogical and mutational histories that could have given rise to the observed data is small, we can conclude that the assumptions cannot explain the observed pattern. Thus, the key is to find suitable models that allow us to construct random genealogies as a result of certain evolutionary forces.

Why the coalescent (backward) approach?

All simulation algorithms face three primary challenges: (1) speed — typically one wants to do lots of simulations, so they need to be fast; (2) scalability — with the advent of genome-wide genotyping and large-scale sequencing, there is a need for simulation programs to match; and (3) flexibility — can the program cope with different demographic histories, population structure, recombination, selection, mutation models and disease models? Three approaches (termed ‘backwards’, ‘forwards’ and ‘sideways’) have been developed to deal with these challenges.

The ‘backwards’ (or coalescent) approach (Kingman 1982, Hudson 1983 & 1990, Donnelly and Tavare 1995) is an efficient way to sample sequences from a theoretical population that follows the Wright-Fisher neutral model (Ewens 1979). It starts with the sample of individuals that will form your simulated data-set, then work backwards in time to construct the ancestral tree or graph of genealogical relationships that connects them all. Neutral mutations can subsequently be placed on this structure to create the simulated data-set. ‘Forwards’ simulations start with the entire population of individuals — typically, many thousands — and then follow how all the genetic data in question are passed on from one generation to the next. The coalescent approach is more efficient than the ‘forwards’ approach by restricting attention just to the genealogical structure relevant to the sample in question, that is, ignoring sequences that become extinct in the middle or whose descendent are not in the sample (see the figure below).
coalescentEfficient

Also, the coalescent approach usually employs a continuous-time approximation to effectively skip over the intermediate generations between important tree generating events, which makes it even more efficient. Moreover, the coalescent approach avoids the difficulty and ambiguity in choosing suitable initial conditions. When using the ‘forward’ approach, one usually needs to simulate over many thousands of generations in order to arrive at an equilibrium in which the genetic characteristics of the population are independent of the original starting conditions.

What specific hypotheses are we testing via the coalescent approach? Using the coalescent approach, one of the null hypothesis is: the observed DNA sequence data come from the neutral evolution in a homogeneously mixing population of constant size? As you can read from my next post, this null hypothesis can be violated in many aspects. Alternative hypotheses incorporate either of the following or their combinations:

The above alternatives lead to changes in the genealogical tree or graph (if recombination is considered).

As for future research interests, the simulation of copy number variation and/or microsatellite data at larger genomic scales, and of more complex disease models allowing covariates and linked loci, remain areas that deserve further exploration. (Note: the performance of any methods trying to tackle the challenges of complex disease mapping should be evaluated by large scale simulation studies)

References