Bayesian inference

11 Nov 2015

FYI, today is what Chinese people would call the Singles' Day. This thing even has a Wikipedia page now...

The Bayesian framework and the posterior p-value

Bayesian interpretation of probability is that it represents uncertainty. Bayesian data analysis is about modeling, and we would like to address and quantify the uncertainty that surrounds the appropriate choice for the model parameters. From a Bayesian perspective, we can use the machinery of probability theory to describe the uncertainty in model parameters, or indeed in the choice of model itself. The basic formula for Bayesian inference is Prior + data = posterior. More specifically, Bayes’ theorem is used to convert a prior probability, which formalizes our pressumption on a model parameter θ, into a posterior probability by incorporating the evidence provided by the observed data. The effect of the observed data y = {y1, . . . , yN} is expressed through the conditional probability p(y|θ). Bayes’ theorem, which takes the form p(θ|y) = p(y|θ)p(θ)/p(y) then allows us to evaluate the uncertainty in w after we have observed y in the form of the posterior probability p(θ|y). There is no difficulty from a Bayesian perspective in employing models for which the number of parameters greatly exceeds the number of data points. Indeed, in a Bayesian model the effective number of parameters adapts automatically to the size of the data set. (Dr. Jeff Tilson, another member of our group and a senior scentist at RENCI, is working with Kirk on predictive genomics. They are building predictive models with the number of parameters at the same magnitude of order as human genome. In constrast, the number of samples they have, or the size of the training set is much fewer. To solve this problem they have been applying a Bayesian method.)
The quantity p(y|θ) on the right-hand side of Bayes’ theorem is evaluated for the observed data set y and can be viewed as a function of the parameter vector θ, in which case it is called the likelihood function. It expresses how probable the observed data set is for different settings of the parameter vector θ. Note that the likelihood is not a probability distribution over θ, and its integral with respect to θ does not (necessarily) equal one. The denominator p(y) in the right side of the above formula is the normalization constant, which ensures that the posterior distribution on the left-hand side is a valid probability density and integrates to one. Indeed, integrating both sides of the formula with respect to θ, we can express the denominator in Bayes’ theorem in terms of the prior distribution and the likelihood function p(y) = ∫p(y|θ)p(θ)d(θ). In both the Bayesian and frequentist paradigms, the likelihood function p(y|θ) plays a central role. However, the manner in which it is used is fundamentally different in the two approaches. In a frequentist setting, θ is considered to be a fixed parameter whose value is to be estimated, and the error bar on this estimate is obtained by considering the distribution of possible data sets y. By contrast, from the Bayesian viewpoint there is only a single data set y (namely the one that is actually observed), and the uncertainty in the parameters is expressed through a probability distribution over θ.
A widely used frequentist estimator is maximum likelihood, in which θ is set to the value that maximizes the likelihood function p(y|θ). This corresponds to choosing the value of θ for which the probability of the observed data set is maximized. In the machine learning literature, the negative log of the likelihood function is called an error function. Because the negative logarithm is a monotonically decreasing function, maximizing the likelihood is equivalent to minimizing the error. One approach to determining frequentist error bars is the bootstrap, in which multiple data sets are created as follows. Suppose our original data set consists of N data points X = {x1, . . . , xN}. We can create a new data set XB by drawing N points at random from X, with replacement, so that some points in X may be replicated in XB, whereas other points in X may be absent from XB. This process can be repeated L times to generate L data sets each of size N and each obtained by sampling from the original data set X. The statistical accuracy of parameter estimates can then be evaluated by looking at the variability of predictions between the different bootstrap data sets.

Now let’s talk about the Bayesian goodness of fit test, posterior p-value. Again, let y be the observed data and θ be the vector of parameters. We define yrep as the replicated data that could have been observed, or, to think predictively, as the data we would see tomorrow if the experiment that produced y today were replicated with the same model and the same value of θ that produced the observed data. We define the distribution of yrep given the current state of knowledge with the posterior predictive distribution p(yrep|y)=∫Θp(yrep|θ)p(θ|y)dθ Now, we can measure the discrepancy between the model and the data by defining test quantities of the data that we would like to check. A test quantity, or discrepancy measure, T(y,θ), is a scalar summary of parameters and data that is used as a standard when comparing data to predictive simulations. The role test quantities play in Bayesian model testing is the same as the role test statistics play in classical model testing. We define the notation T(y) for a test statistic, which is a test quantity that depends only on data; in the Bayesian context, we can generalize test statistics to allow dependence on the model parameters under their posterior distribution. Classically, the p-value for the test statistic T(y) is

where the probability is taken over the distribution of yrep with θ fixed. From a Bayesian perspective, lack of fit of the data with respect to the posterior predictive distribution can be measured by the tail-area probability, or p-value, of the test quantity, and computed using posterior simulations of (θ,yrep). In the Bayesian approach, test quantities can be functions of the unknown parameters as well as data because the test quantity is evaluated over draws from the posterior distribution of the unknown parameters.

Now, we can define the Bayesian p-value (PPPs) as the probability that the replicated data could be more extreme than the observed data, as measured by the test quantity: pB=Pr(T(yrep,θ)≥T(y,θ)|y) where the probability is taken over the posterior distribution of θ and the posterior predictive distribution of yrep (that is, the joint distribution, p(θ,yrep|y)): pB=∬ΘIT(yrep,θ)≥T(y|θ)p(yrep|θ)p(θ|y)dyrepdθ, where I is the indicator function. In practice though we usually compute the posterior predictive distribution using simulations.

If we already have, say, L simulations from the posterior distribution of θ, then we can just draw one yrep from the predictive distribution for each simulated θ; we now have L draws from the joint posterior distribution, p(yrep,θ|y). The posterior predictive check is the comparison between the realized test quantities T(y,θl) and the predictive test quantities T(yrepl,θl). The estimated p-value is just the proportion of these L simulations for which the test quantity equals or exceeds its realized value; that is, for which T(yrepl,θl)≥T(y,θl) for l=1,…,L.

In contrast to the classical approach, Bayesian model checking does not require special methods to handle “nuisance parameters.” By using posterior simulations, we implicitly average over all the parameters in the model.

The exponential distribution family and their conjugate pairs

In general, for a given probability distribution p(x|η), we can seek a prior p(η) that is conjugate to the likelihood function, so that the posterior distribution has the same functional form as the prior

A example of Bayesian inference in Genetics: LD-based phasing

Check out the slides of my current group talk.

Gamma distribution as a conjugate prior distribution

the gamma distribution is a two-parameter family of continuous probability distributions. The common exponential distribution and chi-squared distribution are special cases of the gamma distribution. There are three different parametrizations in common use:

With a shape parameter k and a scale parameter θ. With a shape parameter α = k and an inverse scale parameter β = 1/θ, called a rate parameter. With a shape parameter k and a mean parameter μ = k/β. In each of these three forms, both parameters are positive real numbers.

The parameterization with α and β is more common in Bayesian statistics, where the gamma distribution is used as a conjugate prior distribution for various types of inverse scale (aka rate) parameters, such as the λ of an exponential distribution or a Poisson distribution[3] – or for that matter, the β of the gamma distribution itself. (The closely related inverse gamma distribution is used as a conjugate prior for scale parameters, such as the variance of a normal distribution.)

A random variable X that is gamma-distributed with shape α and rate β is denoted

The corresponding density function in the shape-rate parametrization is

The cumulative distribution function is the regularized gamma function:

If α is a positive integer (i.e., the distribution is an Erlang distribution), the cumulative distribution function has the following series expansion:

Set up and compute model (Summarizes uncertainties using probability) Inference for quantities of interest (Inference using iterative simulation (Gibbs sampler) Model checking  Do inferences make sense?  Compare replicated to actual data, cross-validation  Dispersed model validation (“beta-testing”)