The multiple comparison problem in GWAS: Bonferroni correction, FDR control, and permutation testing

19 Oct 2015

If you are unfamiliar with some terms mentioned in this post, Wikipedia is a good place for quick references. I myself find direct mathematical definitions most helpful.

Statistical tests are generally called significant and the null hypothesis is rejected if the p-value (the probability of seeing certain value of a specific test statistic provided that the null hypothesis is true) falls below a predefined alpha value, which is nearly always set to 0.05. This means that 5% of the time, the null hypothesis is rejected when in fact it is true and we detect a false positive (Type I error). Every statistical test comes with such a probability, but this probability is relative to one single test. “Multiple comparisons” or “multiple testing” is the name attached to the general problem of making decisions (e.g., considering a set of statistical inferences simultaneously) based on the results of more than one test. The nature of the problem is made clear in this post (the XKCD “Green jelly bean” cartoon). With regard to GWAS, in a typical study there are hundreds of thousands to millions of tests simultaneously conducted, each for a single marker and with its own false positive probability. The cumulative likelihood of finding one or more false positives (i.e., false associations) over the entire GWAS analysis is therefore much higher. For example, testing 100,000 loci for association at the 5% level is expected to have about 5,000 false positives. If we want the overall Type I error (i.e., family-wide error rate) to remain at 5% we need to lower the significance level at each locus.

There are a variety of methods that serve this purpose. The first group includes some classic controlling procedures, represented by Bonferroni correction. Bonferroni correction simply divides the significance level at each locus by the number of tests. In other words, it adjusts the alpha value from a = 0.05 to a =(0.05/k) where k is the number of statistical tests conducted. If the tests are independent then the Bonferroni bound provides a slightly conservative bound. If the tests are correlated then the bound becomes more conservative. Due to linkage disequilibrium among GWAS markers, it is generally untrue to assume that each association test on a GWAS data set is independent. Thus, applying Bonferroni correction often gives us the most conservative p-value threshold (maybe as a lower bound?) – for a typical GWAS using 500,000 SNPs, statistical significance of a SNP association would be set as low as 1e-7. This is usually far too stringent and results in an enormous loss of power, producing many false negatives or Type II errors (failing to declare a test significant when the null is false).

Type I versus Type II tradeoff applies to most statistical tests and which error is more of a concern to the investigator determines how to proceed. The goal of Large-scale exploratory experiments is usually to select from original data a reduced set of potential candidates for further analysis. To include all possible true alternatives (as opposed to true nulls), reseasrchers are often willing to accept some false positives. In many GWASs, we wish to take a large number of possible hypotheses and extract a subset showing the most support for their alternative hypotheses for further consideration. It is also known that GWASs suffer from low power (i.e., susceptible to Type II errors) when the object(s) of study are rare variants, rare diseases, or variants with small effects. Thus, we are often more concerned with Type-II errors, as the penalty of including a test that is from the null may be less than the penalty of excluding a test that is not from the null.

Sequential Bonferroni correction is a power-increasing extension to standard Bonferroni correction. The general idea is that when we reject a hypothesis, there remain one fewer tests, and the multiple comparison correction should take this into account. A sequential Bonferroni correction method that allows for potential dependencies between tests was proposed by Holm (1979). Bonferroni corrections and their extensions are appropriate when we expect only a few of the many null hypotheses being false. Assume n is the total number of tests and n0 is the number of true nulls, taht means n0 is close to n. An alternate setting is that some substantial fraction of the tests are indeed expected to be false. In such cases, even sequential Bonferroni correction are likely too stringent, resulting in too many false negatives (i.e., less power). If we expect some reasonable number of the hypotheses to be false, it would be more appropriate to control the fraction of false positives in those tests we declare to be significant (i.e., false discoveries) than to bound false positive rate across all tests including those whose null hypotheisis we cannot reject. That being said, we also don’t want to be completely swamped with false positives, so the fraction of false positives within the reduced set still needs to be controlled.

A method resulting from this line of thinking is to control false discovery rate (FDR) – the proportion of significant results that are false positives (i.e., the proportion of positive tests that are false). This method tries to find a p-value threshold tao such that the set of tests declared significant using p value <= tao has the desired false discovery rate delta (delta = 5% means that on average 5% of the genes we declare as being significant are actually not). Originally developed by Benjamini and Hochberg (1995), FDR procedures essentially correct for the number of expected false discoveries, providing an estimate of the number of actual true results among those called significant. This technique has been widely applied to GWAS and extended in a variety of ways (van den Oord, 2008). Consider testing N hypotheses, H_{1}, H_{2}, \ldots , H_{N} based on their respective p-values, p_{1}, p_{2}, \ldots , p_{N}. Consider that a fraction q of discoveries are allowed (tolerated) to be false. Sort the p-values in ascending order, p_{(1)} \leq p_{(2)} \leq \ldots \leq p_{(N)} and denote H_{(i)} the hypothesis corresponding to p_{(i)}. Let k be the largest i for which p_{(i)} \leq \frac{i}{N}\frac{q}{c(N)}. Then reject all H_{(i)}, i=1, 2, \ldots , k. The constant c(N) is not in the original publication, and appeared in Benjamini and Yekutieli (2001) for cases in which independence cannot be ascertained. The possible choices for the constant are c(N)=1 or c(N)=\sum_{i=1}^{N} 1/i. The second is valid in any situation, whereas the first is valid for most situations, particularly where there are no negative correlations among tests. The BH procedure has found many applications across different fields, including neuroimaging, as introduced by Genovese et al. (2002). The problem with the FDR-correction is that q_{(i)} is not a monotonic function of p_{(i)}. This means that if someone uses any q_{(i)} to threshold the set of FDR-corrected values, the result is not the same as would be obtained by applying sequentially the B&H procedure for all these corresponding q levels.

To address this concern, Yekutieli and Benjamini (1999) introduced the FDR-adjustment, in which monotonicity is enforced, and which definition is compatible with the original FDR definition. Let q*{(i)} be the FDR-adjusted value for p{(i)}. It’s value is the smallest q_{(k)}, k \geq i, where q_{(k)} is the FDR-corrected as defined above. That’s just it!

As mentioned, the decision to control the false positives versus false discoveries hinges to a large extent on the fraction of the tests that are true nulls (n0/n), which is usually unknown. Some earlier works are thus relevant to the topic of this post due to their attempts to estimate n0 (for the purpose of rejecting the global null hypothesis – all tests are from their respective true null hypotheses). We know that draws of p values follow a uniform distribution over (0,1) if the null is correct. Otherwise, if the collection of tests contains some alternative hypotheses mixed in with true nulls, the distribution tends to be a mixture, with fraction n0/n being draws from a uniform and (1 - n0/n) from some other distribution. Some approaches were developed to help people visually detect the discrepancy/departure. One of them is the Schweder-Spjøtvoll plot in which p values are ordered from smallest p(1) to largest p(n) and their rank r(i) is plotted as a function of 1-p(i). Under a uniform distribution, the result is a straight line passing through the origin and the point (1, n), as the upper curve shown below. When alternative hypothesis (as opposed to null) is correct, there will be an inflation of p values near zero and therefore a strong departure from linearity near one in the lower curve (p near zero, so 1 - p values near one).

Alternatively, we can plot out the empirical distribution of p values (using a histogram) and see if there is any departures from a uniform distribution. A uniform distribution results in a flat histogram, whereas any shift will lead to a skew towards 0 or 1. The methods just introduces also allow to infer n0 from the plots they create, as reviewed Appendix 4 of Walsh and Lynch’s book (see References)

The multiple comparison problem in GWAS is more complicated than that in a general setting for two reasons – the large scale of simultaneous comparisons and the correlations among these comparion tests. Contemporary genetic association studies may test hundreds of thousands to millions of genetic variants for association, often with multiple binary and/or continuous traits or under more than one model of inheritance. Many of these association tests may be correlated with one another because of linkage disequilibrium between nearby markers and correlation between traits and models. And don’t forget that all these tests are conducted on one common data set (i.e., the same set of individuals) that may not have a large number of samples. As mentioned, a multiple-comparison problem makes statistical inferences based on the joint distribution of p-values from all the tests. If these tests are independent, it is relatively easy to compute the joint distribution of p-values assuming the null hypothesis of each test is true (each p value is a random value following a uniform distribution). If these tests are dependent, however, the calculation becomes much more difficult because the null distribution of p values can significantly depart from a uniform.

So how are the methods we have introduced doing in this context? How are their effectiveness and efficiency impacted? When tessts are correlated, Bonferroni correction is overly conservative. To give an extreme example, when all the p-values are the same (as in a case of perfect dependence), the cutoff value for the Bonferroni procedure (assume it is alpha/number of markers) should be just alpha. The original FDR controlling procedure assumes independence among multiple tests (Benjamini and Hochberg, 1995). Dependence in the noise of the statistics creates problems only for deriving FDR-controlling procedures and therefore can be addressed (Benjamini and Yekutieli, 2001), as long as the dependence structure meets certain constraint. (see this post). However, the large-scale dependence in GWAS is mainly caused by genetic not statistical reason and creates problems in the applicability of the FDR quantity itself (Chen and Storey, 2006). The argument was that FDR builds on the number of “true discoveries,” which is usually ambiguous in GWAS – it is difficult to determine the exact number of causal variants associated with a trait. Thus, applying FDR across multiple correlated loci (due to LD) for a single trait is dubious.

It is therefore of interest to account for the true dependence structure of the p-values (or the individual test statistics) in order to derive more powerful controlling/correctiong procedures. Statistically this can be achieved by applying resampling methods such as bootstrapping and permutations methods, admitted that they are computationally intensive. Genetically, we can model the dependency among the SNPs (i.e., LD) and incorporating this information into previous methods/procedures (e.g., Dalmasso et al., 2008; Wei et al., 2009; Broberg 2005; Need et al., 2009). Let’s start with permutation tests. Regarding p-value adjustment for multiple comparisons, permutation testing is widely considered the gold standard that other estimators and tests can be compared with. It generates the empirical distribution of test statistics under null hypothesis (hence useful when the asymptotic distribution of the statistic is unknown or difficult to model) while maintaining the original correlation structure. A typical permutation test on GWAS results is achieved by randomly reassigning the phenotypes of each individual to another individual (i.e., swapping the trait labels among the subjects) in the dataset, effectively breaking the genotype-phenotype relationship while keeping the LD structure of the dataset. Each random reassignment of the data represents one possible sampling of individuals under the null hypothesis, and this process is repeated a predefined number of times N to generate an empirical distribution with resolution N, so a permutation procedure with an N of 1000 gives an empirical pvalue within 1/1000th of a decimal place. Randomly permuting and reanalyzing the data many times and comparing the permutation-based results with the original results allows estimation of the probability of observing a P value as in the original result, given the correlation between tests. This solution is attractive because of its simplicity and robustness and is often considered the gold standard for analysis. However, in the context of GWAS, permutation is likely to require too much computation time, so computationally efficient alternatives are desirable.

Several software packages have been developed to perform permutation testing for GWAS studies, including the popular PLINK software, PRESTO (Browning 2008), PERMORY (), and REM (). Usually, the plink software can give you raw and permuted p-values, although it uses (by default) an adaptive testing strategy with a sliding window that allows to stop running all permutations (say 1000 per SNP) if it appears that the SNP under consideration is not “interesting”; it also has option for computing maxT, see the online help.

Another group of methods is to extend the Bonferroni or FDR adjustments to account for the correlation between tests. When multiple tests (say M) are correlated, the true probability of observing a particular P value in a test is smaller because there is less variation between test statistics than if the tests were independent, which makes extreme test statistics less likely. In effect, it is as though fewer tests were performed; for this reason, an intuitive solution is to estimate and utilize the effective number of independent tests (M_effs), as they are the tests that should be corrected for. For example, some researchers suggest calculating the effective number of independent SNPs in a gene region and using this value in the Bonferroni correction (Cheverud 2001; Dudbridge & Gusnanto, 2008; Galwey 2009; Gao et al., 2008; Li and Ji, 2004; Nyholt 2004). This approach is considered less conservative than the basic Bonferroni correction and more computationally efficient than permutation testing, but the accuracy is unsatisfactory when this approach was applied in GWAS (Dudbridge & Koeleman 2004; Salyakina et al., 2005). After all, relying on a single parameter cannot fully capture a correlation structure as complex as genome-wide LD of SNPs. A further method (Conneely & Boehnke 2007) uses extreme tail theory to explicitly calculate the probability of detecting an extreme test statistic (i.e., beyond the predefined thresholds) and it is typically thousands of times faster than permutation-based p values at a given level of precision. This method does not assume independence among tests but assumes the asymptotic joint distribution of all association test statistics is a multivariate normal distribution with known covariance matrix. Although many common association test statistics are asymptotically multivariate normal, use of the asymptotic distribution requires reasonably large sample sizes and may not be appropriate in all cases—for example, dominant or recessive models with a rare minor allele. This method was designed to adjust the minimum P value and other ordered P values for a large number of 1-df tests, so it is especially relevant if we are looking for a small number of reasonably large genetic effects. If we instead expect a large number of very small effects, a joint analysis of all associations simultaneously (based on multiple-df tests) might be more appropriate.

So-called gap statistics or sliding window have been proved successful in some case, but you’ll find a good review in (7) and (8). I’ve also heard of methods that make effective use of the haplotype structure or LD, e.g. (9), but I never used them. They seem, however, more related to estimating the correlation between markers,

Another method that intends to improve this dependency can be removed before individual tests are performed, and that removal of this dependency generates independent p values among the resulting tests.

Controlling for multiple testing to accurately estimate significance thresholds is a very important aspect of studies involving many genetic markers, particularly GWA studies. The type I error, also called the significance level or false-positive rate, is the probability of rejecting the null hypothesis when it is true. The significance level indicates the proportion of false positives that an investigator is willing to tolerate in his or her study. The family-wise error rate (FWER) is the probability of making one or more type I errors in a set of tests. Lower FWERs restrict the proportion of false positives at the expense of reducing the power to detect association when it truly exists. A suitable FWER should be specified at the design stage of the analysis1. It is then important to keep track of the number of statistical comparisons performed and correct the individual SNP-based significance thresholds for multiple testing to maintain the overall FWER. For association tests applied at each of n SNPs, per-test significance levels of α* for a given FWER of α can be simply approximated using Bonferroni (α* = α/n) or Sidak15,16 (α* = 1 − (1 – α)1/n) adjustments. When tests are independent, the Sidak correction is exact; however, in GWA studies comprising dense sets of markers, this is unlikely to be true and both corrections are then very conservative. A similar but slightly less-stringent alternative to the Bonferroni correction is given by Holm17. Alternatives to the FWER approach include false discovery rate (FDR) procedures18,19, which control for the expected proportion of false positives among those SNPs declared significant. However, dependence between markers and the small number of expected true positives make FDR procedures problematic for GWA studies. Alternatively, permutation approaches aim to render the null hypothesis correct by randomization: essentially, the original P value is compared with the empirical distribution of P values obtained by repeating the original tests while randomly permuting the case-control labels20. Although Bonferroni and Sidak corrections provide a simple way to adjust for multiple testing by assuming independence between markers, permutation testing is considered to be the ‘gold standard’ for accurate correction20. Permutation procedures are computationally intensive in the setting of GWA studies and, moreover, apply only to the current genotyped data set; therefore, unless the entire genome is sequenced, they cannot generate truly genome-wide significance thresholds. Bayes factors have also been proposed for the measurement of significance6. For GWA studies of dense SNPs and resequence data, a standard genome-wide significance threshold of 7.2e−8 for the UK Caucasian population has been proposed by Dudbridge and Gusnanto21. Other thresholds for contemporary populations, based on sample size and proposed FWER, have been proposed by Hoggart et al22. Most recently, the accepted standard is 5e−8 (ref. 23). Bonferroni correction uses all n single nucleotide polymorphisms (SNPs) across the genome, but this approach is highly conservative and would “overcorrect” for SNPs that are not truly independent. Many SNPs fall within regions of strong linkage disequilibrium (LD) (“blocks”) and should not be considered “independent”. It has been recently found that the typical human population under study has about 1 million independent chromosomal segments (e.g. among Europeans). This is based off of data from ENCODE (note: in African populations, there’s about 2 million independent segments). Thus, we have 0.05/1e6 = 5e-8. Other research has also come to about the same numbers and recommendations, and loci identified with this stringent p-value tend to hold up across different experiments. Further, graphical techniques for assessing whether observed P values are consistent with expected values include log quantile-quantile P value plots that highlight loci that deviate from the null hypothesis24.

References