COSI and CoJava

18 Apr 2015

This post is dedicated to Daniel Thomas Gillis, Kirk’s doctoral student and the author of CoJava who suddently and unexpectedly passed away three months ago.

The coalescent theory provides an efficient simulation tool for generating DNA samples drawn from populations.

The Cosi program (Schaffner et al., 2005) implements a coalescent model similar to the MS program but allows for complex demographic histories and variable recombination rates. Both MS and Cosi implement the standard coalescent approach that simulates genealogical events backward in time. Simulated events typically include the coalescence of two DNA sequences into a single ancestral lineage, recombination within a sequence, or migration between populations. Since all these events are typically rare, coalescent simulators assume that they never occur simultaneously and assume many generations pass between consecutive events. Time between events is explicitly modeled and used to skip over generations with no genealogical events of interest. The algorithm proceeds until all sequences coalesce to their most recent common ancestor (MRCA) and the resulting genealogy is used to place mutation events along the various sequences.

The parameters of cosi are predefined and stored in a separate file named “params”. Cosi(v1.2) supports parameters such as mutation rate, gene conversion rate and sequence length. It also allows users to define the sampled populations, their effective population size (in the present) and demographic history, as well as the number of sampled chromosomes.

Coalescent simulations are very well suited to exploratory data analysis. Samples that are simulated under various models (basic and extension, see my previous post) can be combined with empirical data to test hypotheses. Let us assume that we observe an unusual pattern of polymorphism in a data set and want to know if it is simply a historical accident or whether it requires a special explanation (such as the existence of selection).We then simulate many possible data sets under a null coalescent model that does not include the factor of interest (for example, selection). Finally,we compare the value of a test statistic obtained from the real data with the distribution of values obtained from the simulated data sets. If patterns that are characteristic of the actual data are rarely seen in the simulations, we reject the null hypothesis. Having rejected the null hypothesis, it is usually possible to propose patterns that are compatible with the actual data. It is typical of population-genetics applications that many alternative hypotheses are possible. It is therefore important to choose carefully the alternatives for characterizing deviations from the null.

Based on this idea, Cosi was calibrated (i.e., figure out the values of model parameters by fitting the model to some empirical data) using genome-wide human population data for different populations (deCODE genetic map). The resultant model is referred to as the “bestfit” model and the obtained parameter values (default) are listed in Table 1 in Schaffner et al’s paper.

Major statistics for evaluating the synthetic data and calibrationg the simulator Linkage Disequilibrium (LD) Allele Frequency spectrum

Performance

Cosi is suitable for short genomic segments or gene sequences (<2-3Mb) but become very slow for larger regions (> 100Mb). As sequences get longer, many more coalescent, recombination and migration events occur and the time intervals between them diminish. For longer sequences and large sample sizes, little computational efficiency is gained by skipping over uninteresting generations and substantial computational effort is expended tracking recombination events and their positions, and allocating memory to track the many ancestral fragments of each sequence as they repeatedly recombine and coalesce with each other. As genome-wide studies become a reality, efficient tools for simulating large sequences are essential. Thus, it is necessary to improve cosi’s performance in this aspect.

An improved algorithm – GENOME (While in the standard coalescent algorithm, each coalescence event involves exactly two sequences that coalesce to a common ancestor, in GENOME, multiple sequences can coalesce to a common ancestor simultaneously. The interested readers are also refered to the author’s doctoral dissertation – “Efficient methods for analysis of genome scale data”)

Daniel rewrote Cosi in Java and utilized the fork-join framework for parallel processing. He also started to incorporate chromosome anomalies such as inversion, insertion and deletion. You can find the source codes on Dan’s github site. If you are a layman like me, you may want to check out my commentory on Daniel’s codes (still working on it).This post describes how to use CoJava. [Update: 11/09/2015] We are considering rerun Daniel’s CoJava program to get some new datasets. While doing this, I found some .jar files Dan created at different times out of his program (with modifications) when he tested it. Of course a pro like Dan would not include .java files in these .jar. To extract the source code I used Java Decompiler. It is open source and works like a charm.

Improvement: book p143 5.5.1.3

cosi2 improves on the speed of existing exact simulators, and permits further speedup in approximate mode while retaining support for selection. cosi2 supports a wide range of demographic scenarios, including recombination hot spots, gene conversion, population size changes, population structure and migration. cosi2 implements coalescent machinery efficiently by tracking only a small subset of the Ancestral Recombination Graph, sampling only relevant recombination events, and using augmented skip lists to represent tracked genetic segments.

Performance and required resources

cojavaSer.jar Running on Renci largemem node exclusively (i.e., occupy 40 CPU cores of the entire node) JVM setting: -Xmx800g (must > 300g) -Xss2000M -proc 40 Java fork-join framework threshold = 1 (with regard to the number of segregated regions on a sample chromosome) The out.hap file is around 380g per population

How do population size and sample size affect the performance?

The probability that you can get a rare variant with the same MAF increases with the sample size

Create progress report

Gather information from multiple threads invoked via the fork-join framework: use static volatile variable

Declaring a static variable in Java, means that there will be only one copy, no matter how many objects of the class are created. The variable will be accessible even with no Objects created at all. However, threads may have locally cached values of it.

When a variable is volatile and not static, there will be one variable for each Object. So, on the surface it seems there is no difference from a normal variable but totally different from static. However, even with Object fields, a thread may cache a variable value locally.

This means that if two threads update a variable of the same Object concurrently, and the variable is not declared volatile, there could be a case in which one of the thread has in cache an old value.

Even if you access a static value through multiple threads, each thread can have its local cached copy! To avoid this you can declare the variable as static volatile and this will force the thread to read each time the global value.

However, volatile is not a substitute for proper synchronisation! For instance:

private static volatile int counter = 0;

private void concurrentMethodWrong() {
     counter = counter + 5;
     //do something
     counter = counter - 5;
} Executing concurrentMethodWrong concurrently many times may lead to a final value of counter different from zero! To solve the problem, you have to implement a lock:

private static final Object counterLock = new Object();

private static volatile int counter = 0;

private void concurrentMethodRight() {
     synchronized (counterLock) {
         counter = counter + 5;
     }
     //do something
     synchronized (counterLock) {
         counter = counter - 5;
     }
}

References