Initiate CoJava
The program must be invoked with the following command-line arguments
- "-p" (REQUIRED) specifies the parameter file, which is a text file that describes the demographic model to be simulated.
- "-o" (REQUIRED) specify the base name for output files. Output consists of a pair of files for each sampled population, one containing a list of all variant sites (with position and allele frequencies), the other containing the haplotypes for that population.
- "-l" specify the log file that
- "-s" specify the seg file that
- "-proc" specify the number of processors used in the program (1 processor by default). Note that the fork/join framework of Java 8 targets single node, multicore parallelism, so request for only one node if you are working on a cluster
Set up parameters
The parameter file defines the population structure and other input parameters, using keywords. In this file, comments are indicated by “#” at the beginning of a line. Newlines don’t matter. The file includes the following options, which you can comment out by adding “#” to the beginning of each line.
infinite_sites yes
The default is a finite sites simulation, in that mutations occur at discrete sites; if multiple mutations occur at a single site, only the first one is retained. Setting “infinite_sites” to yes converts the output positions to floating point, with all mutations retained.
number_mutation_sites 2000
Use this option when you prefer a fixed number of mutation sites
random_seed 183122717
This option specifies a particular random number seedUseful for debugging or recreating a previous run. If a seed of zero is supplied, or the keyword is not found, a random seed will be generated from the time and process id of the job.
# length <sequence length in bp (base pair)>
length 200
# mutation_rate <mutation rate per bp per generation>
mutation_rate 1.5e-8
# recomb_file <file_name>
recomb_file model.test
This option specifies the file describing the genetic map to be used (a brief introduction of genetic map). The file has two columns separated by whitespace:
<position (kb)> <recomb prob per bp per generation>
The first column gives a base-pair position; the second column sets the crossover recombination rate, per generation, from that point until either the end of the sequence region or the position specified by the next line. The basepair positions in the first column must be in strictly increasing order. #The first line of the genetic map file also specifies the recombination rate from the beginning of the region to the base-pair position of that line.
Gene conversion is specified by the following parameters:
#gene_conversion_rate <rate of initiation per bp per generation>
gene_conversion_rate 4.5e-9
You can also use gene_conversion_relative_rate
to indicate the rate of gene conversion initiation relative to crossover recombination rate per generation, and gene_conversion_mean_tract_length
to indicate the mean length of gene conversion tract in bp (the default value is 500 bp).
Define rare event rates:
inversion_rate 1.0e-7
insertion_rate 1.0e-7
deletion_rate 1.0e-7
Any population that appears in the simulation, either as a source of samples or in the history of those samples, must be defined in the file; at least one sampled population is required. The syntax for defining a population is:
pop_define <pop id> <label>
pop_size <pop id> <size>
sample_size <pop id> <n sample>
pop id
is an integer ID of the population, used to refer to the population when specifying demographic events. label
is human-readable name for the population (a string with no spaces and not put in quotes). pop_size
indicates the effective present-day population size. sample_size
indicates the number of sampled individuals. For example,
pop_define 1 european
pop_define 3 african-american
pop_define 4 asian
pop_define 5 african
# European
pop_size 1 100000
sample_size 1 10
# African American
pop_size 3 100000
sample_size 3 10
# Asian
pop_size 4 100000
sample_size 4 10
# African
pop_size 5 100000
sample_size 5 10
If you only want to get samples from one or some, but not all of the four populations, modify sample_size accordingly (change the “sample_size” value of unneeded population to 0. Do NOT comment out that population and related population event below (starting with “pop_event”).
Parameters that define the demographic history of the populations are as follows and they can be supplied in any order.
#pop_event migration_rate <label> <source pop id> <target pop id> <T> <probability per chrom per gen>
pop_event migration_rate "afr->eur migration" 5 1 0. .000032
pop_event migration_rate "eur->afr migration" 1 5 0 .000032
pop_event migration_rate "afr->as migration" 5 4 0. .000008
pop_event migration_rate "as->afr migration" 4 5 0 .000008
pop_event migration_rate "afr->eur migration" 5 1 1996 0
pop_event migration_rate "eur->afr migration" 1 5 1995 0
pop_event migration_rate "afr->as migration" 5 4 1994 0
pop_event migration_rate "as->afr migration" 4 5 1993 0
Time T
is measured in generations and increases going into the past (present = 0); It can be fractional so is defined as a double value. Labels are used only to provide human readable output.
You can also usechange_size
to set the population size for all times prior to T
pop_event change_size <label> <pop id> <T> <size for time prior to T>
pop_event change_size "agriculture - african" 5 200 24000
pop_event change_size "agriculture - european" 1 350 7700
pop_event change_size "agriculture - asian" 4 400 7700
pop_event change_size "african pop size" 5 17000 12500
exp_change_size
is used to indicate an exponential change. The following example specifies an exponential population increase in population 1 that started 500 generations ago and ended 50 generations ago, increasing from 1000 to 10000. Prior to 500 generations, the size remains at 1000 (unless changed by another pop_event parameter); more recently than 50 generations ago, the population size is whatever was set by the pop_size command.
# pop_event exp_change_size <label> <pop id> <Tend> <Tstart> <final size> <start size>
pop_event exp_change_size "expansion" 1 50 500 10000 1000
# pop_event bottleneck <label> <pop id> <T> <inbreeding coefficient>
pop_event bottleneck "african bottleneck" 5 1997 .008
pop_event bottleneck "asian bottleneck" 4 1998 .067
pop_event bottleneck "european bottleneck" 1 1999 .02
pop_event bottleneck "OoA bottleneck" 1 3499 .085
# pop_event admix <label> <admixed pop id> <source pop id> <T> <fraction of admixed chroms from source>
pop_event admix "african american admix" 3 1 5. .2
# pop_event split <label> <source pop id> <new pop id> <T>
pop_event split "african to aa" 5 3 7.0
pop_event split "asian and european split" 1 4 2000
pop_event split "out of Africa" 5 1 3500
# pop_event sweep <label> <pop id> <end time> <selection coefficient> <position of
selected allele (as fraction)> <final frequency>
pop_event sweep "selective sweep" 5 10000 .02 .5 .4
Output files
cosi provides two output files for each simulated population, named out.hap-? and out.pos-? (replace the '?' with a specific population index -- 1 european, 3 african-american, 4 asian, 5 african).
The out.hap-? file contains the simulated samples (chromosomes/haplotypes). Each line corresponds to a sample, including ID of the chromosome (sample ID), label of the population this chromosome belongs to, and then the simulated sequence (sites separated by a blank space). The original state and the mutation are represented by "2" and "1" respectively. The file is formatted as below:
Chrom # | Pop # | Simulated sequence | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 2 | 1 | 2 | 2 | 2 | 2 | 2 | 1 | 2 | 2 |
1 | 1 | 2 | 2 | 1 | 2 | 2 | 2 | 2 | 2 | 1 | 2 | 2 |
2 | 1 | 2 | 2 | 1 | 1 | 2 | 2 | 2 | 1 | 1 | 2 | 1 |
3 | 1 | 2 | 2 | 1 | 2 | 2 | 2 | 1 | 2 | 1 | 2 | 2 |
4 | 1 | 1 | 2 | 1 | 2 | 1 | 2 | 2 | 2 | 1 | 1 | 2 |
5 | 1 | 2 | 2 | 1 | 1 | 2 | 1 | 2 | 2 | 1 | 2 | 2 |
The out.pos-? file contains the information for each segregating site (i.e., SNP), includling its ID (starting from 0), ID of the chromosome (i.e., sample ID) the SNP is on, chromosome position of the SNP, and the frequency of each allelic state of the SNP. The file has the following format:
SNP # | Chrom # | Chrom pos | Allele 1 | Freq of A1 | Allele 2 | Freq of A2 |
---|---|---|---|---|---|---|
1 | 1 | 127.3788 | 1 | 0.1154 | 2 | 0.8846 |
2 | 1 | 215.9448 | 1 | 0.0000 | 2 | 1.0000 |
3 | 1 | 229.8352 | 1 | 0.0000 | 2 | 1.0000 |
4 | 1 | 623.0247 | 1 | 0.4231 | 2 | 0.5769 |
5 | 1 | 463.2629 | 1 | 0.1538 | 2 | 0.8462 |