Simulate disease case-control GWAS data using sequences generated by the coalescent model

Current GWAS are generally well-powered to detect common variants with modest effect sizes, but underpowered to detect rare variants, even with larger effect sizes. Methods such as imputation can add significant power to detect rarer variants, but are limited by the size of the reference panel used. The 1000 genomes project will add significant power to detect rare variants, but will still not reliably cover every variant with minor allele frequency (MAF) ≤ 0.01 in the genome. An exciting recent development is the increasing feasibility of performing a full sequence analysis, which allows the testing of every variant in the genome. Costs have fallen several orders of magnitude in the last few years, meaning the concept of a large scale case-control study using the full sequences of participants will soon be feasible. New technology generally introduces new challenges, and sequence analysis is no exception. The sheer number of variables in such a study will mean problems in storage and analysis. The prior odds for a specific variant being associated are so low that a highly significant p-value threshold is needed to give confidence that a seemingly statistically significant result is a true positive

Generate the sample of sequences (or haplotypes) using coalescent simulators

Coalescent simulations are commonly used in population genetics to simulate genotypes under different parameters and demographic models.

A sample of 100,000 DNA sequences each comprising 20M base pairs was simulated using the COSI package. The mutation rate was

a per site mutation rate of 2.5*10−8 and an effective population size (Ne) of 100 in the final generation of the sequence simulation. The reduction of Ne in the preceding generations was modeled with a Ne 1000 years ago of 1256, a Ne 10,000 years ago of 4350, and a Ne 100,000 years ago of 43,500 with linear changes in between. This reflects estimates by Villa-Angulo et al. (2009) for the Holstein population.

The total number of segregating sites across the resulting genome was approximately 1,670,000.

One option open to researchers who would like to increase power in the context of limited case series is just to increase the control collection. This strategy might include using cases for one disease as extra controls for another (assuming suitably different disease aetiologies and similar population history). We investigated the utility of such an approach by performing simulations with 1000 cases and an increasing number of control. the ability to reject the null hypothesis of no association increase considerably with the size of the control panel.

There is an open question as to whether rarer causal alleles might have larger effect sizes than common causal alleles. If this were though plausible, then in assessing power overall for a particular chip, one could focus in Figure 3 on particular ranges of effect sizes for common causative alleles and a different range of effect sizes for rarer causative alleles.

It is becoming clear that many loci harbouring common alleles affecting common diseases will have effect sizes in the range 1.1–1.2, and our simulations demonstrate that there is almost no power to detect these in studies of the size currently underway. As has already been shown empirically [19],[20] these loci can be found by meta-analyses and follow up in larger samples of GWA findings. Slightly larger relative risks do become detectable in large samples. For example the power to detect an effect of size 1.3 jumps from almost zero with 1000 cases and 1000 controls to over 50% in a study three times the size.

effect sizes (relative risk per allele)：Slightly larger relative risks do become detectable in large samples. For example the power to detect an effect of size 1.3 jumps from almost zero with 1000 cases and 1000 controls to over 50% in a study three times the size.　a sample size of 2000 cases and 2000 controls and a relative risk at the causal SNP of 1.3

Add phenotypes to simulated genotype data

In recent years, genome-wide association studies (GWAS) became widely used to uncover the genetic basis of complex traits by comparing patterns of genetic and phenotypic variation [1-3]. The power of such studies depends on various factors that include the genetic architecture of the trait, the demographic history of the population, and variation in mutation and recombination rates [4]. In addition, the trait under investigation may be adaptive or (in case of a disease trait) can evolve under purifying selection, which both would result in a non-neutral pattern of genetic diversity in the genomic neighborhood of the causal mutation.

Next-generation sequencing (NGS) has become a popular technique for identifying novel rare variants associated with complex diseases [6]. Statistical association tests that can account for rare variants have also been developed rapidly [7-10]. These tests aim to identify multiple rare causal variants in a group of variants selected by biological functions, such as exons, genes, and pathways. A common approach is to pool all the variants in the group to increase the statistical power for associations. To evaluate the statistical power for new tests, a simulation tool that can simulate multiple rare casual variants based on sequence data is necessary. However, simulation programs developed for GWAS may not be appropriate for evaluating statistical properties for NGS studies, because they were designed to simulate common variants based on GWAS panels (e.g., Illumina and Affymetrix) or HapMap project data [11]. Thus, computer software that can simulate sequence data based on realistic models with phenotypes becomes important.

Quantitative/qualitative traits

additive effects between mutiple quantitative trait nucleotides (QTNs): Besenbacher et al (2009)

Qualtitative traits are analyzed in case/control studies. In a Genome-Wide disease association study the genomes of a large group of individuals are examined to establish the presence of a significant association between a disease and particular genes. The group of individuals is divided in cases (people with the disease) and controls (people without). Randomly pairing haplotypes to form diplotypes, we prospectively generated case-control data sets. We can simulate diploid individuals with a qualitative phenotype with three different probabilities P(affected|wildtype homozygous), P(affected|wildtype & mutant heterozygous), and P(affected|mutant homozygous). Note: The term “wild type” allele is sometimes used to describe an allele that is thought to contribute to the typical phenotypic character as seen in “wild” populations of organisms, such as fruit flies (Drosophila melanogaster). A “wild type” allele was historically regarded as dominant, common, and normal, in contrast to “mutant” alleles regarded as recessive, rare, and frequently deleterious. It was formerly thought that most individuals were homozygous for the “wild type” allele at most gene loci, and that any alternative “mutant” allele was found in homozygous form in a small minority of “affected” individuals, often as genetic diseases, and more frequently in heterozygous form in “carriers” for the mutant allele. It is now appreciated that most or all gene loci are highly polymorphic, with multiple alleles, whose frequencies vary from population to population, and that a great deal of genetic variation is hidden in the form of alleles that do not produce obvious phenotypic differences.

Constraints/Measures/Criteria

Disease frequency can be measured in terms of prevalence and incidence, both of which are probability measures. Prevalence is defined as the probablity that an individual has disease at some specified point in time, in other words, is the probability that a randomly sampled individual is affected by the disease. Incidence is defined either in terms of the probability of developing the disease over a fixed period, or of the probability rate per unit time. Another important measure is relative risk, defined as some measure of disease risk in exposed subjects divided by the same measure of risk in unexposed subject. In genetic epidemiology, relative risks may be defined for genotypes, alleles, or haplotypes.

For a diallelic locus with alleles A and a, there are three genotypes – AA, Aa, and aa. We usually take on e of these, say aa, as reference and express genotype relative risks as GRR(AA) = Risk for AA genotype/Risk for aa genotype, and GRR(Aa) = Risk for Aa genotype/Risk for aa genotype. For a rare disease, these relative risks can be estimated by odds ratios in case-control data (with aa as reference genotype): GRR(AA) = (Case(with AA)/Control(with AA))/(Case(with aa)/Control(with aa)), GRR(Aa) = (Case(with Aa)/Control(with Aa))/(Case(with aa)/Control(with aa))

Odds ratio and risk ratio (relative risk)
The odds of an event are defined as P(event)/P(not event). Given the following 2-by-2 contingency table,

	cancer	no cancer
smoker	a	b
non-smoker	c	d

The odds ratio (OR) is the ratio of the odds of cancer in smokers to the odds of cancer in non-smokers. OR = (a/b)/(c/d) = (ad)/(bc) The risk ratio (RR), also called the relative risk, is the ratio of the probability of cancer in smokers to the probability of cancer in non-smokers. RR = (a/(a+b))/(c/(c+d)) = (a(c+d))/(c(a+b)) Given that you know a, b, c, and d, you can compute either of these metrics. Yet odds ratio is strongly preferred as the “right” metric to report in almost all scenarios. That seems to be because the quantity that it measures is more fundamental to the biology of what you’re studying, and less likely to change depending on how you’re studying it. See this post for some examples to illustrate that. So in cases where you ascertain on a behavior (or genotype) and then check for a disease phenotype, the RR is a valid measurement – we’ve measured the real probability of getting a disease given that you have a risk factor, rather than cherry-picking people who have the disease. For this reason, the use of RR is mostly confined to longitudinal studies, where a group of individuals is followed for many years to see whether they develop a disease.

In genetics, the “exposure” is usually a bi-allelic genotype, where the alt allele count in one individual can have three values: 0/1/2. So how do we compute OR in genetics? Intuitively, you can imagine two ways of doing it. You could dichotomize the genotype into a binary exposure variable: “has an alt allele”/”doesn’t have an alt allele”. That would correspond to a dominant model. Or you could count the alleles themselves instead of the people carrying the alleles. That would correspond to an additive (allelic) model. The allelic model is more commonly used. The contingency table should look like this:

	affected	unaffected
minor allele count	a	b
major allele count	c	d

OR = (a/b)/(c/d) = (minor allele count in cases / minor allele count in controls) / (major allele count in cases / major allele count in controls) MAF(cases)/MAF(controls) = (a/(a+c))/(b/(b+d)) RR = (a/(a+b))/(c/(c+d)) For very rare alleles, a « c and b «d, so (a/(a+c))/(b/(b+d)) ≈ (a/c)/(b/d) = (ad)/(bc). Coalescent simulations are widely used to simulate genotypes under complex demographies with recent extensions to include recombination hotspots [6] and selection [7], or to simulate whole genomes. Simulations are often used to test population genetic hypotheses by comparing simulated and observed data. However, such simulations produce only genotypes but not phenotypes, which are also required to test methods for detecting significant associations between genetic and phenotypic variation. Although some tools provide an option to map phenotypes onto simulated genotypes, they only allow the simulation of qualitative phenotypes or require time-consuming forward-in-time simulations to create genotypes from complex demographic scenarios .

The software phenosim is a tool written in Python that was designed to add a phenotype to genotypes simulated by coalescent-based simulation tools. Simulated phenotypes may either be qualitative or quantitative traits with different effect sizes and may show epistatic interactions. Hence, the simulation of case/control studies as well as the search for quantitative trait nucleotides (QTNs) of a complex trait with a user-defined architecture is possible.

Once data are generated, it is straightforward to produce a SNP panel similar to a genotype chip by ascertaining common SNPs and to mask a random SNP with the desired disease allele frequency.

In case-control association studies, the sample is highly nonrandom. Ascertainment of individuals that exhibit a disease enriches the sample for an underlying disease mutation, increasing its frequency in the sample, relative either to that in the population or to that expected in a random sample.

Increasing evidence has suggested that rare and generally deleterious genetic variants might have a strong impact on the risk of not only rare Mendelian diseases, but also many common human diseases and related traits. According to the Common Disease/Many Rare Variants hypothesis, a common disease might be caused by multiple rare variants that are under relatively strong selection pressures. However, identifying such rare variants underlying the diseases remains to be challenging and is still an active research area. While genomewide association studies (GWAS) are incapable of identifying rare variants due to usually insufficient sample size, new methods keep emerging. Since simulation is widely used to evaluate, refine, and validate new methods, it is necessary to generate simulated data sets that can reflect realistic and critical characteristics of rare variants.

In addition to the genetic variations of human populations, other characteristics that are important for simulation include allele frequency and linkage disequilibrium (LD) patterns of genetic markers. The power of a statistical test to detect a risk locus relies heavily on the allelic spectrum (numbers and frequencies of alleles) and on the LD structure around the locus. Specifically, it is desirable for simulated data to possess both local and long range LD patterns and to maintain allele frequencies similar to the real data for which the methods are expected to be applied.

One important part is the integration of selection effects.

Choosing Markers

We often exclude the causal variants from the sample, because it is very unlikely that the causal mutation itself is genotyped in a real genome-wide study. We can also use a Tag SNP selection algorithm such as Tagger in Haploview. A tag SNP is a representative single nucleotide polymorphism (SNP) in a region of the genome with high linkage disequilibrium that represents a group of SNPs called a haplotype. It is possible to identify genetic variation and association to phenotypes without genotyping every SNP in a chromosomal region. This reduces the expense and time of mapping genome areas associated with disease, since it eliminates the need to study every individual SNP. Tag SNPs are useful in whole-genome SNP association studies in which hundreds of thousands of SNPs across the entire genome are genotyped.

When a group of SNPs are inherited together because of high LD there tends to be redundant information. The selection of a tag SNP as a representative of these groups reduces the amount of redundancy when analyzing parts of the genome associated with traits/diseases. The regions of the genome in high LD that harbor a specific set of SNPs that are inherited together are also known as haplotypes. Therefore tag SNPs are representative of all SNPs within a haplotype.

Almost every trait has both genetic and environmental influence. Heritability is the proportion of phenotypic variance that is inherited from our ancestors. Association studies are used to determine the genetic influence on phenotypic presentation. Genome-wide association studies (GWAS) use single-nucleotide polymorphisms (SNPs) to identify genetic associations with clinical conditions and phenotypic traits. Due to the large number of possible SNP variants (almost 13 million as of June 2008 [10]) it is still very expensive to sequence all SNPs. That is why GWAS use customizable arrays (SNP chips) to genotype only a subset of the variants identified as tag snps.

Selection of maximum informative tag SNPs is an NP complex problem; however, algorithms have been devised to provide approximate solution (with limited errors). A tagging algorithm will attempt to locate tag SNPs in neighborhood N(t) of a target SNP t based on a metric that assesses the quality of tagging. The metric measures how well a tag SNP, as a representative of the SNPs in the neighborhood N(t), can predict the target SNP t. The goal of the algorithm is to find the minimal subset of tag SNPs with maximum informativness.

Haploview’s Tagger operates in either pairwise or aggressive mode. In either case it begins by selecting a minimal set of markers such that all alleles to be captured are correlated at an r^2 greater than a user-editable threshold with a marker in that set. Certain markers can be forced into the tag list or explicitly prohibited from being chosen as tags. You can also specify which markers in the data set to be captured. Additionally, you can specify the maximum number of tags to pick, as well as the minimum distance (in base pairs) between picked tags.

Aggressive tagging introduces two additional steps. The first is to try to capture SNPs which could not be captured in the pairwise step (N.B. these must have been “excluded” since otherwise they would simply be chosen to capture themselves)using multi-marker tests constructed from the set of markers chosen as pairwise tags. After this, it tries to “peel back”the tag list by replacing certain tags with multi-marker tests. Tagger avoids over-fitting by only constructing multi-marker tests from SNPs which are in strong LD with each other, as measured by a pairwise LOD score. This LOD cutoff can be adjusted to loosen or tighten this requirement (the default value is 3.0).

Error simulation

Next-generation sequencing (NGS) technology has evolved rapidly in the last five years, leading to the generation of hundreds of millions of sequences (reads) in a single run. The number of generated reads varies between 1 million for long reads generated by Roche/454 sequencer (≈400 base pairs (bps)) and 2.4 billion for short reads generated by Illumina/Solexa and ABI/SOLIDTM sequencers (≈75 bps). The invention of the high-throughput sequencers has led to a significant cost reduction, e.g., a Megabase of DNA sequence costs only 0.1. Nevertheless, there is still a lack of efficient and accurate tools and algorithms to align (map) the sequence reades generated by NGS machines to a reference genome. Neither are there commonly agreed criteria or procedures to evaluate and understand the strengths and weaknesses of exisitng ones.

Existing tools and algorithms use different mapping techniques and thus provides different trade-offs between speed and quality of the mapping. For instance, to reduce runtime, the quality is often compromised by neglecting base quality score, limiting the number of allowed mismatches, disabling gapped alignment or limiting the gap length, and/or ignoring SNP information. To simulate sequencing errors in a realistic manner, i.e. emulating recent sequencing technologies, you need to model different error distributions, rates, and types (you currently have only substitutions). Recent technologies vary greatly in terms of their model specific error profile. This can not be modeled simply by introducing errors at uniform rates. Precisely, you would have to consider:

a) Different distributions. Usually quality degrade toward the end of the sequencing read, so most models should be polynomial (or at least exponential).

b) Different rates. Independent of the model different total error rates need to be considered (e.g. modeling same sequencing platform but different sequencing protocols).

c) Different error types and ratio between them. So not only substitution errors occur but also insertion and deletion. Moreover, the platforms differ with regard to the indel/substitution ratio.

d) Platform specific errors. Some platform tend to introduce specific errors which are mostly dependent on the underlying sequence (e.g. homopolymers or specific patterns).

In each case where a genotyping eror was added, a homozygous genotype is converted to a hterozygote or a heterozygous genotype was converted to a randomly chosen homozygoet (i.e., either the major or minor allele homozygote, each with probability 0.5).

A set of 9000 segregating sites were randomly selected from the sequence to be used as candidate QTL loci in two different ways, one a randomly sampled set and the other being a randomly sampled set with the restriction that their minor allele frequency should exceed 0.05.

References

The in silico lens
Chen, Gary K., Paul Marjoram, and Jeffrey D. Wall. "Fast and flexible simulation of DNA sequence data." Genome research 19.1 (2009): 136-142.
Peng, Bo, and Xiaoming Liu. "Simulating sequences of the human genome with rare variants." Human heredity 70.4 (2010): 287.
Wang, Ying, and Bruce Rannala. "In silico analysis of disease-association mapping strategies using the coalescent process and incorporating ascertainment and selection." The American Journal of Human Genetics 76.6 (2005): 1066-1073.
The difference between odds ratio and risk ratio