Today I’m starting this post to record my thoughts while looking into Kirk’s self-written software – CHAT. Although I don’t know (a) how often I will update, (b) how useful the content will be, or (c) how long it will take to complete this post (or this series of posts), it is pretty exiting that after almost 5 months I’m finally here! During the past 6+ years Kirk has been working on CHAT alone in his spare time. Now Charles and me are helping him publicize this software. To Kirk, CHAT is a fun product. The original idea came to his mind when he glanced at some hanging charts in his office with light through the paper. He got the simulated annealing strategy after three glasses of wines in a dinner meeting. He cracked the Maths for haplotype smor scores during a holiday break. He figured out how to parallelize the jobs in a brown-bag lunch with Chris (who BTW is another genius scientist in our team). He wrote codes when he needed a break at late night in the middle of writing a grant. And these are just what happened during my short time around here. If you think that things come easily for Kirk, you are wrong. I observed how he worked on haplotype smore. He made mistakes and he got frustrated, like normal people. But he never gave up. Thus, to me, CHAT is an inspiration in many ways.
"To get good at something you've got to love what you do and you've got to put in the work. You've got to think about it every day. You've got to fight to understand every little bit. There is no way around it. -- Ben Haggerty
To start with let me give you a brief introduction of CHAT. CHAT stands for Convergent Haplotype Association Tagging. It is a computer program that we are developing to detect rare moderate penetrant mutations from GWA data sets. GWAS data are mostly obtained from SNP chips, which tend to omit genetic sites where variant alleles arose too recently to spread to a significant fraction of the human population, i.e., rare variants. Although probably none of these omitted sites are particularly informative on its own, together they harbor a vast amount of additional genetic information. This hard-to-detect variation is a clear candidate for “missing heritability” in disease genetics, where known genetic risk factors usually fail to account for the full heritability of complex diseases. DNA is passed from parents to children in continuous blocks between recombination sites; when two sequences share a rare allele, it is likely that the allele was inherited from a recent common ancestor along with a block of surrounding DNA containing some sites that appear on the SNP chip. Thus, it is still possible to infer the position of a rare variant even when it does not appear on a chip being used to to gather data.
The core of CHAT is a graph-theory based algorithm that systematically searches for subsets of individuals that share long range haplotypes. The underlying assumption is that a small (but significant) subset of unrelated individuals have the same disease because they have inherited a moderate penetrant mutation from a distant common ancestor. Since the algorithm is computationally intensive (an NP-complete problem), we have been cooperating with the Renaissance Computational Institute (RENCI) and using their high-performance computing clusters. How powerful these machines are? “Consider the resources are limitless.” said Erik Scott, a Senior Research Software Developer at RENCI.
Previous work showed the success of CHAT in constructed and simulated data sets as well as real data sets. The program has been applied in several independent projects to search for the genetic basis (i.e., causal mutations) of and treatment targets for diseases such as Parkinson’s disease, stroke, alcoholism, multiple sclerosis, lymphoid cancer, and lung cancer and smoking. These studies found that CHAT is more likely to be successful with phenotypes that have lower population prevalence. The next step is to test CHAT using additional real data sets as they become available. Our goal is to help establish the baseline rate for detecting clusters of individuals with long shared haplotypes and to establish collaboration to resequence several genes for a few subjects.
When I first looked into the code of CHAT, my impression is that Kirk put into a lot of efforts to make this software fault tolerant when working with clusters. CHAT tries to break up the jobs to use the number of CPUs that are expected. If a job fails it will be restarted and is often broken up into smaller jobs. CHAT keeps track of finished jobs and tries to resume remaining or failed ones. Most output files have internal line count information, so that a corrupted file can be automatically detected. Files are usually checked once for completeness but also frequently checked for whether they are where they are supposed to be. The above settings are accomplished through a properties file and can be changed in some in the longest step (i.e., phasing). Any step can be reinitialized (organized as multiple beans in the properties file), but then all subsequent steps then need to reinitialized. The intermediate data is stored in a derby database.
Install CHAT 1.0 -- for debuggers
CHAT is developed using eclipse and maven, so eclipse is the best environment for debugging. The executable file is chat.jar. First install maven and eclipse. You need to have a home repository for maven . For example, mine looks like ‘/home/linly/.m2/repository’. Kirk’s source code is stored in renci SVN repository. You can open the SVN repository perspective of Eclipse, log into the above address and checkout CHAT. Do not close Eclipse. Open an SSH session and ‘cd’ to wherever you put the checked out program, i.e., the checked-out chat root directory. For example, mine is ‘/projects/sequence_analysis/vol4/CHAT_simGWAS/workspace/chat-1.0/’. Run
mvn clean install eclipse:eclipse
in the command line. Go back to Eclipse, refresh you project in Eclipse with f5. Then go to Window -> Preferences -> Java -> Build Path -> Classpath. Create a new variable called ‘M2_REPO’ and the value is
Second, create a jar with all the dependencies includes (Note: with Eclipse Mars, this step could be done by building a Maven project). Type
mvn clean install assembly:assembly
at the shell command line (make sure you are in the chat root directory). You will find the .jar file in the /target sub-directory of the root directory, probably named as “chat-1.0-SNAPSHOT-bin.jar”. Copy this .jar file to a place where it can be found by the envrionment variable $Path (i.e, make sure it is “in the PATH”). For example, I keep all execution files/application programs in the /bin folder of my home directory (/home/linly/bin), so I can use something like
cp target/chat-1.0-bin.jar ~/bin/chat.jar
Note that you need to make the .jar actually executable by
chmod +x chat.jar
Run CHAT in batch using shell scripts
I first built chat.jar as mentioned above (note: if any changes have been made to sub-project source codes, for example, workspace/chat-core/src, you need to update corresponding source code files in workspace/chat1.0/chat-core/src and/or workspace/chat1.0/chat-common/src before making the new chat.jar. Othewise no change will be reflected in chat.jar. You can probably use a jar decomplier (I use this one) to make sure the new .jar file has whatever updates you’ve made.) Then I use the following shell script to create a Chat working directory for every simulated Chat input file that I’ve made (the input file must be stored in /ChromosomeSpecificData subdirectory):
#!/bin/bash
# invoke a function defined below to create a CHAT working directory for every CHAT input file and move the file into that directory
CHAT_INPUT_DIR_NAME="ChatInput_tagSNP"
# define make_dir_move_File function
# $1 is the current directory in which the function will create CHAT working directories
function make_dir_move_file()
{
count=0
#The -n test tests whether a string is not null. It requires that the string be quoted within the test brackets.
while [ -n "$1" ]
do
#create the directory including intermediate directories that do not yet exist
mkdir -p ${1}/Rep${count}/ChromosomeSpecificData
#return 2 is used to signal an error and stop the processing. If a cp fails, you don't want to keep on trying with the other files, because it's likely to be a serious error (out of disk space or missing target directory)
cp ${1}/GenotypeChrom_08_Rep${count}.gz ${1}/Rep${count}/ChromosomeSpecificData/GenotypeChrom_08.gz || return 2
let count+=1
if [ ${count} -gt 24 ]
then
break
fi
done
}
# give find an absolute path to start with, so that it will print absolute paths
CURRENTDIR=`pwd`
for D in `find ${CURRENTDIR} -type d -name ${CHAT_INPUT_DIR_NAME}`
do
`make_dir_move_file ${D}`
done
Then I copy needed files to each of these working directory using the shell script below. I store these files in a folder named “CHATResources” beforehand. I also stored the shell scripts that will be used while CHAT is running (i.e., ldMap.sh, runCHATv4.sh) in the bin folder of my home directory (DO NOT forget to “chmod +x” them!!!)
#!/bin/bash
# invoke a function defined below to copy CHAT running recources to all /Rep# directories
CHAT_INPUT_DIR_NAME="ChatInput_tagSNP"
SOURCE_DIR="/projects/sequence_analysis/vol4/CHAT_simGWAS/newSimGWASData/CHATResources/"
# define make_dir_move_File function
# $1 is the current directory in which the function will create CHAT working directories
function copy_file_multiple_dir()
{
count=0
#The -n test tests whether a string is not null. It requires that the string be quoted within the test brackets.
while [ -n "$1" ]
do
#return 2 is used to signal an error and stop the processing. If a cp fails, you don't want to keep on trying with the other files, because it's likely to be a serious error (out of disk space or missing target directory)
cp ${SOURCE_DIR}/Chromo_8.txt ${1}/Rep${count} || return 2
cp ${SOURCE_DIR}/simCodes.txt ${1}/Rep${count} || return 2
#rm ${1}/Rep${count}/ldMap.sh || return 2
#rm ${1}/Rep${count}/runCHATv4.sh || return 2
let count+=1
if [ ${count} -gt 24 ]
then
break
fi
done
}
# give find an absolute path to start with, so that it will print absolute paths
CURRENTDIR=`pwd`
for D in `find ${CURRENTDIR} -type d -name ${CHAT_INPUT_DIR_NAME}`
do
`copy_file_multiple_dir ${D}`
done
(update: I integrated the job submission package Kirk had developed for CHATv1.2 into CHATv1.0, so I no longer need the two shell scripts, i.e., ldMap.sh and runCHATv4.sh) Then I put the following shell script in each of the working directories, so that to run a CHAT process for a working directory I can simply navigate to the directory (by “cd”) and then use “sbatch” to launch the following shell script
#!/bin/bash
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --mem-per-cpu=10240
cd /projects/sequence_analysis/vol4/CHAT_simGWAS/newSimGWASData/SNP_12170677_1.0E-5_1.0E-4/Case1000_Control5000_GRR10.0/ChatInput_tagSNP/Rep14/
java -Xmx6g -cp /home/linly/bin/chat.jar org.renci.chat.Main /projects/sequence_analysis/vol4/CHAT_simGWAS/newSimGWASData/CHATResources/CHAT_prep.xml
sleep 2
java -Xmx6g -cp /home/linly/bin/chat.jar org.renci.chat.Main /projects/sequence_analysis/vol4/CHAT_simGWAS/newSimGWASData/CHATResources/CHAT.xml
The first java command does the preparation while the second java commond does the core work. Note that they use the same chat.jar file but different instruction files (chat_prep.xml and chat.xml). Another thing worth mentioning is that sometimes the job may fail with an error saying “var/spool/slurmd/job15423229/slurm_script: fork: Resource temporarily unavailable”. The reason seems to be the maximal number of processes in the system has been reached (CHAT does tend to launch many threads itself, for internal jobs). If this is the case, try launching the shell script again later. The job scripts and another file important for running chat (chat.properties) were created using Java (you can find the code doing this at /home/linly/workspace/jobChecker). Note that there is a “workDir” field in chat.properties file that needs to be changed to the corresponding chat working dir. Also, the properties files used to run CHAT on simulated data and real data differ in some parameter values (e.g., genotyping error threshold), so be careful.
Handle additional chromosomes
CHAT is designed to handle mutiple chromosomes in parallel whenever applied and it implicitly assumes users will provide a list containing all chromsomes to be analyzed at the very beginning. However, I encountered a use case (in my simulation studies) in which I needed to add more chromosomes to an existing dataset. Then, how can we allow for new analyses while keeping the results of previous analyses given that CHAT uses specific result and job directories (the names of these directories are predefined)? The following solution (see the paragraphs after for explanation) is based on CHATv1.0 (mainly the makeOutputDirs() function in corresponding Parallele*.java files in the chat-core package). In the future we may need to modify the code to better assist this kind of requests. Remember, the whole point of going through this turmoil is to avoid rerunning LDU mapping anc chat set building jobs which can be extremely time intensive. For small datasets (e.g., one with 500 cases and 500 controls) it is really not worth the pain, but for large datasets such as 2000 cases and 2000 controls or more, you probably need it.
- In the chat.properties file, add "reinitializeStepOutput=false", set "projectChromosomeList" to a list file WITHOUT alrealdy analyzed chromosome IDs. Modify the list inside the "chatTaskExecutor" bean in CHAT.xml file, comment out the "ParalleLshDataAdjustRunnable" step and all those after it. Run CHAT major process.
- Wait the last step to finish and do the following. In the chat.properties file, set "projectChromosomeList" to a list file WITH alrealdy analyzed chromosome IDs. In the aforementioned list in CHAT.xml file, comment out all the steps before "ParalleLshDataAdjustRunnable" and after "CalculatePISMORThresholdsJobsRunnable". Run CHAT major process. Make sure the "reinitialize" field of "ParalleLshDataAdjustRunnable" is false.
- Wait the last step to finish (the last line of the slurm*.out file is "INFO: Running...org.renci.chat.interpreter.CalculatePISMORThresholdsJobsRunnable" but this step usually finishes really quick) and do the following. In the chat.properties file, set "projectChromosomeList" to a list file WITHOUT alrealdy analyzed chromosome IDs. In the aforementioned list in CHAT.xml file, comment out all the steps before "ParalleCHATSetRunnable" (and if you choose to do Step 4 later, also comment out all the steps after "ParalleleLSHCHATSimulationRunnable" for now, then follow the instruction of Step 4). Run CHAT major process. Make sure the "reinitialize" field of "ParalleCHATSetRunnable" is "false" and the "reinitialize" field of "ParalleleLSHCHATSimulationRunnable" is true
- [optional] Because a new set of permutation tests was generated in the previous step, you can rerun "ParalleleLocalAdjustChatSetRunnable" to make previous and current results of different chromosomes comparable. To do so, wait the last step to finish and bring back old chromosome IDs; you need them for the rerun!! Then Comment out every step before "ParalleleLocalAdjustChatSetRunnable" and make sure its "reinitialize" field is "true". Run CHAT major process. Note that you cannot reuse previously generated permutation test files (not by setting the "reinitialize" field of that step to "false" or by commenting out that step. I've tried both and had no luck).
Basically, we can provide a list of additional chromosome IDs ONLY (avoiding all previously analyzed chromosomes) as a new file, replace the old list file indicated in chat.properties (the "projectChromosomeList" field) with this new file, and then run CHAT from scratch. To be more specific, the LDU map construction at the CHAT preparation step should be reinitialized, but since the program will reuse rather than recreate the job and result directories (if they exist), previous results will not be wiped out (each chromosome has its own subdirectory named after its ID). However, the map file (curLDUMap.txt) will be rewritten and not including information on previous chromsomes. Thus, if you would like to keep the LDU map of previous chomosomes, save the old map file in a different name before running any new analysis. As for the main analysis, the first step (i.e., raw LSH identification) needs to be reinitialized (i.e., set the "reinitialize" property in CHAT.xml to be "true"), but the program only wipes out the result and job subdirectories of chromosomes in the list. Therefore, as long as we exclude the IDS of previously analyzed chromosomes from the list, their corresponding subdirectories can sustain. The following steps from LSH merging to LSH filtering will not wipe out job and result directories when they are not reinitialized (the default setting; In the CHAT.xml file that controls the main process of CHAT, the "reinitialize" properties of these steps are all preset to "false"). Or you can ask the program not to erase exising result directories even when "reinitialize" is set to "true" but adding "reinitializeStepOutput=false" to the chat.properties file.
However, there is another problem in the step of LSHDataAdjust. Here the program will check every file in the result directory including those old files and once it finds out that some results have no records in the current job management database (chatDB), it will pop out an error message and terminates. Thus, we will have to include previous chrom IDs back in at this point (I know)! To do so, we need to stop the program right before this LSHDataAdjust step, by commenting out this step and all following steps in CHAT.xml. Next we change the chromosome list to include old chromosome IDs. Then we launch the LSHDataAdjust step with "reinitialize=false", commenting out previous finished steps and stopping before building chat sets. Take out old chromosome IDs from the list and start from the step of chatSet building with its "reinitialize" set to true (make sure there is "reinitializeStepOutput=false" in the chat.properties file). As for the step of building chat sets, under the result directory ("chatSet") there will be four subdirectories each for a model. Inside each model directory you will see the results of all chromosomes, including previous ones and new ones, but in separated files with chromosome ID in the file name. And as above, as long as you do not "reinitialize" this step, the directories remain.
The outputs of the "LSHCHATSimultion" step are permutation test files, which are same for all chromosomes, so we are safe. Actually you can reuse the files generated in previous analyses for new analyses (set "reinitialize" of this step in chat.xml to "false"). The "LSHLocalAdjust" step put all chromosomes' intermediate files in one single subdirectory for each model and distinguishes them by filenames, as the "BuildChatSet" step does, and again, the result directories containing previous results will be reused as long as the "reinitialize" property is "FALSE". However, sometimes you will have to reinitialize this step, given that the last step (LSHCHATsimulation) needs to launch hundreds of jobs and tracking every one of them is difficult so missing often happens. If we redo the last step, this step also needs to redo, as the last step will create a different set of permutation tests every time it is relaunched, due to its stochastic nature. If you end up redoing this "LSHLocalAdjust" step, you probably need to save previous results in a temporary directory beforehand. The last step "LSHGenomeAdjust" is the same as "LSHLocalAdjust" regaring output directory making. However, notably, this step does adjustment across all chromosomes by integrating the local adjustment results (in the output directory of "LSHLocalAdjust") of every chromosome whose ID is on the aforementioned list. In other words, the results obtained from previous analyses will not be considered. Moreover, every time this step is executed, the outputs will be rewritten anyway, as the filenames are the same and the file writer is set to not allow for appending.
If you need to analyze each chromosome separately (i.e., do adjustment across different bins in a chromosome but not different chromosomes), you will have to trick the program. Basically, you would run the whole process without stop. After the run finishes successfully, go to the last job planning directory of genome-wide adjustment (i.e., the final step) named "LSHCHATSimulationGenomePlan_#". In that folder you will find several job scripts, each for a model. Make several copies for one job script; each copy will become the job script for one chromosome so name them accordingly. Divide the values of one specific field in the original job script and put them in corresponding copies. I wrote some code for this purpose, which you can find at "/projects/sequence_analysis/vol4/CHAT_simGWAS/workspace/CHATGenomeAdjStepSeparatedbyChrom/". After that, you can lauch all these new job scripts. Note that before you lauch the jobs, you probably need to add the IDs of previously analyzed chromosomes back into the chrom list file.
CHAT inputs
In the the chat root directory there is a test example folder called ‘TestDebug’ that can be copied to a workspace as a test use case. This folder contains a genotype file that has just the data for a region on chromosome that contains the lrrk2 locus. 48 of the subjects in the pedigree file have a LRRK2 mutation that causes parkinsonsin’s disease. The file lrrk_samples_save list the subjects. Not all of them are included in the analysis as CHAT excludes closely related individuals (I will introduce how this is done later). Before running the test example, make sure you have done the following:
- Import (by right click and choose import from the drop-down menu) an existing project into workspace for each of the sub directories. (i.e., chat-common, chat-prep, chat-core etc). The classes built in these directories are what that are actually executed.
- Open the 'chat.properties' file in the 'TestDebug' folder and change the path for the workdir to the new path. For example, I put the 'TestDebug' folder in '/projects/sequence_analysis/vol4/CHAT_simGWAS/workspace/chat-1.0', so the second line of the 'chat.properties' file is correspondingly "workdir=/projects/sequence_analysis/vol4/CHAT_simGWAS/workspace/chat-1.0/TestDebug".
The “TestDebug” folder contains the minimal input files CHAT needed (specified at the beginning of the “CHAT.properties” file). If you prefer your own test data, please also create and format these files (see below for detailed instructions). There is no need to name your files using same extensions (e.g., .lgen, .map, or .sample) though. You can simply use “.txt” or nothing. In the source code of CHAT (written in Java), DataSet.java inside chat-common contains functions (readChatDataSet, readLgenDataSet) that are used to parse these input files. Now let me introduce these files one by one.
- A raw genotype file (also referred to as the lgen file) with tab or comma delimited columns:
chromID,pos,rs,subjectID,genotype[,CONFIDENCE]
A lgen file corresponds to a single chromosome and each line in the file describes a subject's genotype of a particular marker SNP on that chromosome. The first five columns (required) provide detailed information on a marker: its located chromosome, physical position, rs number, ID of the subject (individual), and genotype. The genotype data should be in single char format (e.g. 0,1,2 for AA, AG, GG) or the IUPAC format (e.g., A,R,G for AA, AG, GG). See the code file below for more information. The last column (optional) gives a float-type confidence value that can be used to filter the data. This file SHOULD be gzipped (end with .gz) to save space (required by the program). - A LDU map file: tab or comma delimited columns:
chromID,pos,rs,ldumap (-999999 for unknown place holder),in use bit
The first three columns have the same meanings as in the lgen file. The fourth column (ldumap) indicates a particular marker's LDU (linkage disequilibrium unit) map position. The last column indicates whether or not this record will be used in later analysis: 0 means no use and 1 means use.
- A sample file (also referred to as the fam file): tab or comma delimited SubjectID, DX (disease affection) {0 as unknown,1 as unaffected/control,2 as affected/case},Sex {2 female,1 male, 0 unknown},in use bit {0,1}
- A code file (also referred to as the truth table or truth matrix) indicates genotype coding strategy used in the data set. The first line gives the codes that represent homozygous geneotypes (e.g., A,C,G,T for AA, CC, GG, TT). The second line gives the codes of all possible genotypes (e.g., A,C,G,K,M,R,S,T,W,Y,X, if the IUPAC format is applied). These are also the horizontal and vertical elements of the truth table/matrix (referred to as truth vectors). Such a table checks compatibility. Regarding the IUPAC format, an "R" genotype means the two haplotypes that form this genotype is "A" and "G". Thus, an "R" is compatible with an "A" and a "G" (denoted as R = A/G). Similarly, Y=C/T, S=G/C, W=A/T, K=G/T, M=A/C. In order to indicate that A is compatible with M, R, and X, "1" is assigned to the matrix cell AM, MA, AR, RA, AX, and XA. In contrast, "0" indicates incompatibility. Thus, a IUPAC code file should look like below (note that the placement of comma is critical for CHAT to correctly parse the truth matrix):
,A,C,G,T ,A,C,G,K,M,R,S,T,W,Y,X A,1,0,0,0,1,1,0,0,1,0,1 C,0,1,0,0,1,0,1,0,0,1,1 G,0,0,1,1,0,1,1,0,0,0,1 K,0,0,1,1,0,1,1,1,1,1,1 M,1,1,0,0,1,1,1,0,1,1,1 R,1,0,1,1,1,1,1,0,1,0,1 S,0,1,1,1,0,1,1,0,0,1,1 T,0,0,0,1,0,0,0,1,1,1,1 W,1,0,0,1,1,1,0,1,1,1,1 Y,0,1,0,1,1,0,1,1,1,1,1 X,1,1,1,1,1,1,1,1,1,1,1
CHAT preparation
The first step/module is to convert input files into CHAT accepted format. It is controlled by the CHAT_prep.xml file in the same folder. The file defines all the spring beans used to accomplished this step. Users can skip a particular bean by commenting out the corresponding ref bean line in the threadList of the XML file using <!– statement –>.
Note: (a) In most of the steps called in CHAT.xml and CHAT_prep.xml there is a Main that can be edited so that the step can be run one at a time. I’ll show you how to do that in details later.
(b) The step of building LDU map in this module is done by a third-party executable file called “ldmapper1”, which you can find in the ‘Notes_Resources’ folder in the root directory. Move it to wherever convenient for you and add its path to the environment variable PATH, so that this file can be located. Also, building LDU map requires enough memory space, so you may want to grab a compute node of the cluster to run this module on. If you are going to run the module using Eclipse debug perspective, you need to grab an interactive node. Take a look at this post on how to do that.
In Eclipse, to run using the debug perspective, go to run>debug_configerations. Right click on “java application” and a new Dialog box will pop up allowing you to create a new configuration for debug. Name the new instance CHAT_prep. CLick the browse button next to projects and select chat-commons. Next to the Main Class text box select Main-org.renci.chat. The change the project to chat-prep. (Main is in commons and chat-prep is dependent on it). Then click arguments tag. In the Program arguemnt type ‘CHAT_prep.xml’; in the vm arguments you can add something like ‘-Xmx6g’. Change the working directory to wherever you have put the ‘TestDebug’ folder (At the bottome of the Arguments tab, there is a place allowing you to modify Working directory. Click on “Other” and specify the path). Click apply. Now you can click the ‘Debug’ button to start running the app or click the ‘close’ button to do this later.
Step 1
The first three beans check the sanity of input files (users can scroll down in the “CHAT_prep.xml” file to see and modify the properties of each bean). “CheckLgenStructureRunnable” conducts sanity check on the lgen file. If there are any problematic genotype records (i.e., lines) in the file, the process will be terminated so that users can review detected problems, which are recorded in a “problemGenotypes.txt” file in the same folder. If no problem is detected, there won’t be such a file (created and then deleted by the program). All “filtered” genotype records (obtained by stripping problematic and redundant lines) are outputted into a file named by adding a prefix “Filtered” to the the original sample lgen name. This file has the same format as the old file. If the user is willing to ignore (instead of trying to modify) all identified problem lines, he or she can set the “projectRawLongDataFileName” field in CHAT.properties (which is the field that indicates the lgen file to be used) to the file that contains all filtered genotype records (i.e., the one whose name starts with “Filtered”). Three other files can also be created during the process, although users can choose not to write any of them by commenting out corresponding property lines or setting the value to “” in the CHAT_prep.xml file. Among the three files, “GenotypeContainsChrom.txt” stores involved chromosomes (ordered by chromID), each per line. (b) “GenotypeContainsSamples.txt” stores involved subjects (ordered by subjectID), each per line with the format:
subjectID,0 (place holder for unknown disease affection state),0 (place holder for unknown sex),1
(c) “GenotypeContainsSNPs.txt” stores involved marker SNPs (ordered by marker position), each per line with the format:
chromID,pos,rsnumber,-77777.0 (place holder for unknown LDU map position),1
The second bean “CheckMapStructureRunnable” conducts sanity check on the LDU map file (required information and their number format and redundancy). If there are any problematic map elements in the file (one element as a line), the process will be terminated so that users can review detected problems (recorded in a “ProblemMapRecords.txt” file in the same folder). This bean mainly checks whether each record (i.e., each line) in the file has a unique rs number (the 3rd column/field), i.e., whether each SNP uniquely maps to a location. Two maps (data structure) are also created connecting each SNP (identified by its rs number) to its located chromosome (the 1st column/field) and physical position on that chromosome (the 2nd column/field) respectively. All “filtered” (i.e., non-problematic) map elements are sorted using their rs numbers and outputted into a file named by adding a prefix “FilteredSorted” to the old map file. This file has the same format as the old file. If the user is willing to ignore (instead of trying to modify) all identified problem lines, he or she can set the “projectMap” field in CHAT.properties (which is the field that indicates the map file to be used) to the file that contains all filtered map records (i.e., the one whose name starts with “FilteredSorted”). Additionally, the bean can create a “MapContains
The third bean “CheckFamStructureRunnable” conducts sanity check on the sample file (required information and their number format and redundancy). Any problematic sample element in the file (as a line) will be assigned -9 to its “include” “dx” and “sex” fields, so that it won’t pass validity test. All problematic samples are outputted to a separate file named “problemSample.txt”; given no problem, this file will be automatically deleted. This bean mainly checks whether each record (i.e., each line) in the sample file has a unique sample ID (the 1st column/field). Duplicated sample lines are stripped and put in a file named by adding a prefix “dupSamples” to the original sample file name. All non-redundant sample elements are outputted into a file named by adding a prefix “FilteredSorted” to the old sample file; the list is sorted in the order of the “include” “dx” “sample ID” and “sex” fields subsequently. This file has the same format as the old file. If the user is willing to ignore (instead of trying to modify) all identified problem lines, he or she can set the “projectSamples” field in CHAT.properties (which is the field that indicates the map file to be used) to the file that contains all filtered map records (i.e., the one whose name starts with “FilteredSorted”). If users prefer, this bean will check to see whether the sample file contains all the samples in the “GenotypesContainsSamples.txt” file outputted by the first bean. Missing samples will be outputted to a “SamplesIdInGenotypeNotInFamilyFile.txt” file that has the same format as the previous file but the “sex” and “dx” fileds of these sample records are changed to unknown and the “include” field to 0 (not included). Then the process will be terminated so that users can review these mismatched samples.
Step 2
Then the filtered input information is further checked, reorganized, and filtered. This is accomplished by two beans. The first bean “MakeChromSpecificChatFromLgen” collects information from the above output files and rearranges them for each chromosome. The records in the filtered lgen file are distributed into different txt files, each of which corresponds to one chromosome (“
chromID,pos,rs number,subjectID,genotype,confidence
. These txt files are stored in a sub-directory of the work directory (i.e., TestDebug) named "TempChromosomeSpecificRawData". If users are only interested in some but not all of the chromosomes in the data set, they can specify the chromosome(s) of interest beforehand by putting them in a separate file (see the "projectChromosomeList" field of the CHAT.properties file). Then only the information of specified chromosomes will be processed. Users can also filter the raw lgen records based on their QC scores (the "confidence" field). This bean also checks whether (a) every SNP has corresponding records in both the filtered lgen file and the filtered map file, and (b) each sample has a unique genotype at every spot (identified by its rs number) on one chromosome. In the latter case, given a single chromosome, there should not be multiple different records regarding a subject's genotype at the same spot. This kind of errors in the data set are referred to as "subject-snp discordant genotype" and will be outputted to a file named "Discord_
subjectID,rs number of the SNP,occurrence of this subject-SNP pair in the data set
When there is a discordant genotype, it will be coded as unknown (e.g., "X" in IUPAC). The major task of this bean is to use write input data files for CHAT. Each CHAT data file is converted from a previous "
chromID,pos,rs number,LDU map position,include or not (i.e., used or not),all subjects' genotypes of this SNP
The last field is a string of genotypes. After writing all CHAT data files to the "chromosomeSpecificData" folder, the program will delete the "TempChromosomeSpecificRawData" directory to save space.
The second bean “FilterChromSpecificChat” applies four filters on CHAT data files (also referred to as chromosome specific files). Users can skip any of the filters by changing its corresponding value to a negative number or by commenting out the property line. Users can also specify the order in which these filters will be applied through the values of order properties (i.e., properties named as “***Order”) in CHAT_prep.xml. Before filtering, this bean also conducts regular sanity check. Any SNP record (starting from the 5th line in a CHAT data file) with more or less than 6 fields (see above) will be considered invalid and read into the memory as the following
””,-9,’‘invalid”,-99999 (a predefined code meaning excluding this SNP from LDU map”),0 (i.e., not to be included)
This bean also identifies and keeps records of SNPs and samples/subjects with more than 50% missing genotypes (IUPAC code “X”) in the data set (referred to as “almost empty” SNPs and samples). The filtering introduced below are conducted on the rest of the data (samples and SNPs that have less than 50% missing genotypes). Now let’s get to the “real business” of this bean, i.e., the four filters implemented by four functions.
- Hardy-Weinberg Equilibrium (not applicable for the X chromosome): For each SNP, the filter function first reads through samples to find out which codes (in the truth vector) represent the TWO homozygotes and one heterozygote regarding that SNP (denoted as a1, a2, and a12 in Java implementation; notably, the program assumes that all SNPs are biallelic. It will give an error message and stop once encountering a multi-allelic SNP). Then the filter function counts the occurrences of a1, a2, and a12 in all control samples (i.e., value of the "dx" field is not "2") and applies an exact test on the HWE of that SNP. The exact test for HWE has been mentioned in many articles, such as Emigh, T. H., 1980. A comparison of tests for Hardy-Weinberg equilibrium. Biometrics 36 627–642, or A note on exact tests of Hardy-Weinberg equilibrium. If the test result cannot exceed a user-defined threshold, the SNP will be excluded from further analysis (set the "include" field to be 0)
- Monomorphic SNPs: this filter function removes SNPs whose minor allele frequency (MAF) is smaller or equal to a user-defined threshold. A genotype code compatible with both a1 and a2 will be counted twice. (Note: the default MAF threshold is 0.01; since our simulated GWAS datasets contain only common variants (MAF >= 0.05) obtained using Haploview.tagger algorithm, all SNPs should be able to pass this filter.)
- Missing SNPs: this filter function remove any SNP record (from the map attribute of a DataSet object) whose number of missing genotypes exceeds a user-defined threshold (by changing the "include" field from 1 to 0)
- Missing samples: this filter function remove any sample record (from the sample attribute of a DataSet object) whose number of missing genotypes exceeds a user-defined threshold (by changing the "include" field from 1 to 0)
After the filtering, old chromosome specific data files (i.e., CHAT data files) are moved to a Backup subdirectory of the ChromosomeSpecificData directory. New CHAT data files will replace old corresponding ones (same name and format but with modified “include” fields; SNPs that have been filtered out will have include == 0) in this directory.
[We created some simulated data sets to test the functionality of CHAT after this point. These data sets mimicked the format of chromosome specific data sets. They were placed in
Step 3
The next step is to build an LDU map (a genetics linkage disequilibrium map measured in LD units (LDU), see one of my earlier post) for every chromosome based on CHAT data files. Recall that each CHAT data file contains SNP information of one single chromosome (1 out of 23), one SNP record per line starting from the 5th line. (Check how to build a LDU map in this article); only female samples are used to make X-chromosome LDU map). You can change names of the directories and files involved, but you have to do it consistently for all three beans introduced below that implement this step. At the time being we rely on an enternal executable program called “ldmapper1” written by Dr. Andrew Collins to construct linkage disequilibrium (LD) maps.
The first bean “makeLDMaps” intends to prepare and manage cluster-running jobs around this enternal software. Kirk used a database object (implemented by Aphache Derby (Java DB) that supports JDBC and SQL as programming APIs) to store and manage all the LDU mapping jobs. He distinguished two types of Derby database objects in his code: one for LDU jobs (here) and the other for CHAT jobs (discussed later). The two databases will have different folders in the working directory, named as “lduDB” and “chatDB” respectively. See chat-common/src/main/java/org/renci/chat/derby/CHATDBApp.java for the sepcific SQL commands used for database initialization and management. The database for LDU jobs is named as “lduDB” and its contents are stored in a “lduDB” subfolder of the working directory. SQL statements are constructed as strings and executed for certain purposes. The program will break the entire chromosome into multiple segments and assign a job for each segment. To do so, an LDUJobsPlan table is built inside the database containing all these jobs; each record in the table has several fields such as job name, startMapIndex and endMapIndex of the chromosomal segment, number of SNPs on that segment (not every site is a SNP), and a field indicating whether the job is “done”. The program allows users to define the values of several parameters:
- The type of samples that will be used to build the LDU map. Possible choices are "case", "control", "unknown", "case_control", "control_unknown", or "all". If users did not choose anything, the default strategy using by the program is "control", i.e., building the map on only control samples (no disease). If users did not use any of the above parameter values, the program will return an error message.
- The length of each chromosome segment and the overlapping area between two adjacent segements (by default 500 and 100 respectively). The size of the overlap should be smaller than the size of the segment
Currently the program allows for at most three job-running “cycles” with the next cycle lauched only if the previous one fails (you can tune the number of cycles and the walltime for each cyle in chat_prep.xml. Each cycle will create its own folder for inputs (e.g., /LDUinput0) and outputs (e.g., /LDUmap0). Thus you will see /LDUmap0 and /LDUmap1 if the first cycle fails and the second one succeeds, or all three in the worst case. Inside the input folder, one subfolder will be made for each chromosome (named as “Chr_#”, e.g., Chr_08). All jobs for a specific chromosome will be placed in this subfolder, each named as “Chr#_job#.txt.gz”, e.g., 08_0001.txt.gz). The output file of a specific job will be placed in the output folder and the hierarchy is similarly organized. For example, the output file of the job /LDUinput2/Chr_03/03_0012.txt.gz is /LDUmap2/Chr_03/03_0012.txt.map). On planning a batch of jobs, the program will read the CHAT data file created in the previous step and the truth model, which will be used to parse the data file. The LDMappingJobs class will be used to create job info; you can find its source code at ../chat/common/LDMappingJobs.java. An important job property is the chromsomal segment to be processed in the job, delineated by the starting and the ending base pairs’ map ID (startMapIndex and endMapIndex respectively). The program uses a “zigzag”-like approach to determine these two values for each job, as conseutive jobs share an overlap region (see the planInitalJobs function in ManageLDMapperRunnable.java). The input files, located at the input folder (e.g., /LDUinput#), are modified from the CHAT data file to include markers and samples with the “include” field = 1; in addition, the samples are selected based on predetermined sample inclusion strategy (by default only controls). The input files contain coded marker positions (as positions in a Kb physical map, for example, the marker at 123 will be coded as 0.123) and sample genotypes (as homozygote1, homozygote2, or heterozygote). The map building process is parallelized, with one thread created for processing one chromosome (corresponding input and output folders as arguments). When the current cycle is done, the program will check the completeness of each job through its output file (.map). The major contents of this file is a table that maps markers within the segment region from their physical positions to LD map positions. The table has 5 columns indicating Line ID (1,2,3…) and then the name (as Chr#_physical position), physical poition (i.e., base-pair position, converted from integer to float by dividing it by 1000), LD map position (a continuous number), and MAF of a marker. The number of lines in the table is supposed to be the same as the number of markers in the segment, although the program allows for a small discrepancy. In other words, there can be a small number of markers missing and we still consider the job has been “completed”. Complete jobs will be removed from the job database, while incompleted or missing ones will be redefined (to prepare for a rerun at the next cycle) and then added to the database in their new forms (use SQL update statements). Note that the program will count the number of updates so far and close the database every 1000 updates to “committ” these updates; otherwise the database will run more and more slowly (according to Kirk’s experience). When a job is redefined, the new cycle ID will be attached to the job name and the chromosomal segment to be processed will be extended at both ends for half of the original segment size. The idea is to make a failed job bigger to cover more of the LDU map. We also assigned more running time to a job at a newer cycle (currently it proceeds in the order of 2 hrs, 4 hrs, and 24 hrs).
The second bean “readLDMapFilesConstructMap” will create a unified LDU map by combining individual LDU maps obtained from the previous bean. The program will produce a map data structure, of which the key is chromosome ID and the value is another map structure with the entry (physical position of a marker, a MapElement object containing all physical & LDU map information of this marker). Before executing this bean, the program will check whether there is an error file (“err.txt”) in the working directory and notify the user if such a file exists. Basically, the user can choose to ignore the errors or stop to handle them. Recall that in the previous bean each job produces a separate LDU map based on a specific segment and the segments in consecutive jobs are overlapped. To make a unified LDU map, the program will find the middle point (called “transition”) of an overlap (see ./chat-prep/…interpreter/ConstructLduMapRunnable.java/findTran() function). Then it uses the former job’s results for the first half of the overlap (till the transition position) and the latter job’s results for the second half (starting with the position right after the transition). Finding the middle point can be complicated, as the markers are not equally distributed along a segment and there can be missing data (so that “include” = 0). It is not an arbitrary decision that we break up overlap areas at their middle point, as middle points often show higher accuracy in calculating LDU map positions. After re-defining the boundaries of individual LDU maps, overlaps may still exit but should be controlled < 10 markers (the program will check this and stop otherwise). Individual LDU maps created by different jobs have their own start and end points (different on physical positions but all originate from 0). On building the unified LDU map, we need to stitch them together, which means the LDU values of every marker on the next map need to be increased. The amount of increment can be calculated by focusing on the aforementioned “transition points”, i.e., the new start points of the original jobs (except for the first job). The LDU increment of each marker on map j (j >= 2) is the LDU value of map j’s start point on the previous map (i.e., map j-1) minus the LDU value of map j’s start point on map j.
After obtaining the unified LDU map, the program will estimate the LDU positions of missing markers (their LDU values temporarily coded as -88888.0), if any. This process is called exptrapolation (see ../chat-common/…/chat/common/CHATUtil.java extrapolate() function) and it distinguishes between p and q arms: extrapolation on the p arm takes an ascending order while on the q arm takes a descending order. (All human chromosomes have 2 arms – a short arm and a long arm – that are separated from each other only by the centromere, the point at which the chromosome is attached to the spindle during cell division. By international convention, the short arm is termed the “p arm” while the long arm of the chromosome is termed the “q arm.”) Ususally a chromosome is marked out from its p arm to its q arm on a genetic map; thus, genes on the p arm have smaller LDU values than genes on the q arm. Next, the program will normalizes the resultant unified map by making it start at zero: all markers will shift a certain amount of LDU distance x, which is the LDU value of the start point. Lastly, the program will check the resultant unified map, making sure the LDU positions of consecutive markers are indeed in an ascending order. The unified map will be output to the working directory in a file named “curLduMap.txt”. Each line is a marker record with 5 comma-deliminated fields: chromID, physical position, rs name, ldu position, and whether included.
The last bean of this step (“replaceLDUMap”) is just to copy the CHAT data file (i.e., chromosome specific data file) and replace the last second column of each marker record with corresponding ldu map position of that marker.
Step 4
This step is achieved by three beans sequentially. The first bean “makeSampledGenotypes” tries to find informative SNP markers for each non-sex chromosome based on the genotype data of all individual samples. It starts with a buildRecodeMap() function, which reads each CHAT input file (i.e., chromosome specific data file) and checks allele frequency for each included marker.
The second bean “planComparePairwiseSamplesUsingSampledGenotypes”
The third bean “compareSum”