CHAT -- main process (part 1)

30 Nov 2015

CHAT main process -- Part 1

This process is controlled by CHAT.xml file (as opposed to CHAT_prep.xml) and completed by 9 beans, while the chat.properties file continues to be used. Each bean differs in their inputs, outputs, job size, and the database tables they use to store job information. Here, again, we use a derby database (with a folder named “chatDB” in the working directory) to store job information in multiple tables:

Despite their differences, most of these beans have their execution classes (indicated by in CHAT.xml) inherited from the same abstract class (/chat-core/src/main/java/org/renci/chat/interpreter/AbstractManageJobsRunnable), which defines common attributes of these beans and outlines a generic process of job management (runMethods()) including the following steps. Some of these steps correspond to abstract functions defined in the parent class (AbstractManageJobsRunnable.java) and overridden in each child class to fit bean-specific needs (as indicated in parentheses). After a step is finished, it returns a boolean value indicating whehter everything is fine and whether it is OK to proceed.

  1. Get general job management settings from "chat.properties", such as the amount of memory requested for JVM (i.e., -Xmx, default 6g), the name of cluster queue requested, the number of cpu cores requested per node (default 3), the maximal number of jobs in the queue (default 200), target machine initial usage (default 512k?), the walltime for initial and resumed jobs (default increasing as 2, 4, 24 hours per job), and target Job size scale factor (default 1.0)
  2. Check to see if the err.txt file exists and terminate the process if the file exists
  3. Get chromosome specific directory path, truth code file, job database (chatDB) path, and the list of to-be-processed chromosome IDs
  4. Conduct bean-specific initialization steps to handle the transit between consecutive beans (uniqueStartSteps())
  5. Connect to the database
  6. Check whether there are any jobs that have been planned (checkIfPriorJobsPlanned()).b The program will query the "*JobsPlan" table of the CURRENT bean (again, each bean has such a table) to get a list of missing/undone jobs and a list of completed/done jobs (either can be empty). Currently if either list is not empty, the program will switch to the error recovery mode (set a boolean variable "errorRecoveryOn" to TRUE). (Later the missing job list will be checked and if it is empty the program will output messsage saying this step has been completed and then return from the current bean) The program will also query the "*JobsPlanSub" table of the PRIOR bean (each bean has such a table, see below) to get the names of completed/done jobs for each chromosome. Since the name of a job is the prefix of all job-related files including the final output file, it amounts to getting the names of all input files.
    • If the error recovery mode keeps ON and the "reinitialize" boolean variable is FALSE, it means that we are in a continuous sequential CHAT working process controlled by chat.xml (and every relevant move leaves a record in chatDB) and that the current bean was invoked before but not successfully completed. In other words, the current bean has some planned jobs that are not done, at least according to the database records (no matter whehter involved files/directories exist). In this case, the next step for the program is to check the existence of the output directory of this specific bean and remake it only if it does not exit (makOutputDirs(false)). Then the program will check database records against files in the output directory and update the missing job list and the completed job list accordingly (checkForMissingJobs()). At this point, the content of each output file will be examined too using LSH Data standard(see below). Finally, incompatible database records will be corrected (markCompletedJobsAsDoneInDb()). [Debugging tip: If you only want to debug a single bean, do this via its main() function, which will set the "reinitialize" variable to be TRUE. Any bean can be reinitialized in this way. If you are debugging the entire process and would like to rerun a bean, you cannot just modify CHAT.xml file by commenting out all previous beans, because every relevant move of the process has left a record in chatDB and any unmatching will raise a flag that causes the program to stop. If you want to restart from a specific bean, in addition to comment out previous beans, you should also change the value of the "reinitialize" property of that bean from "false" to "true" in CHAT.xml.
    • If the error recovery mode is OFF (i.e., not job records for the current bean in the database) or this step needs to be reintialized any way, the program will check the existence of input data directory (checkInputDirExist()) and input files (checkInputFilesExist()). As for the latter, the program will get a list of job names (as mentioned earlier) and go to the input data directory looking for files that start with these names and end with ".out.gz". Then the program will make an empty output directory of this specific bean (if existing, delete and remake) (makOutputDirs(true)), make the first "job plan" directory, i.e., ended with 0 (makeEmptyCurPlanDir(0)), plan initial jobs (delete this bean's previous job records in the database and make new records, in which jobs are marked as undone) (planInitialJobs()), create .properties and .xml files needed for job running (writeIncompleteJobs()) and eventually run all jobs in the aforementioned job plan directory (requiring the first setting of walltime) (runJobs(startHours[0])). All job-running .xml files are created using a template -- chat-core/src/main/resources/genericJob.xml.vm (AbstractManageJobsRunnable.java/makeJobXMLFile(); import org.apache.velocity.*). For each job, the runJobs() function creates a CHATJobsScriptRunnable object (/chat-common/src/main/java/org/renci/chat/common/CHATJobsScriptRunnable.java) and passes required information to it. This object writes a shell script and starts a running process for each specific job. During the process, another shell script named "runCHATv4.sh" will be executed to actually submit the job to your computing environment (i.e., the clusters of your institution). You can find an exemplary "runCHATv4.sh" script in the test case folder written for RENCI clusters and inside this script you can find a command line as below
      "java -jar $EXE $INFILE"
      where $EXE is the chat.jar file, $INFILE is the aforementioned .xml file for a specific job.chat.jar integrates all classes needed to finish the entire CHAT main process and each of the nine beans will call its corresponding class (as indicated in .xml files of that bean's jobs). If executing this command line gives you an error message "no main manifest attribute in ../chat.jar", replace the above line with the following and retry.
      "java -cp $EXE org.renci.chat.Main $INFILE"
      The "runCHATv4.sh" script should be present in your working directory and executable (e.g., by "chmod +x runCHATv4.sh"). After a job is done, the corresponding CHATJobsScriptRunnable object will fetch the output, error message, and exit code of that job and prints them out with a job finish message. Job outputs and error messages are collected using the StreamGobbler class (in the same package as CHATJobsScriptRunnable), because the Java library used to run job processes tend to crash when there accumulated too many outputs/error messages.
  7. Next, the program will handle undone jobs from the previous running cycle. Basically, it will divide undone jobs into a set of subjobs, each with fewer subjects (not sure..). Right now there can be at most 3 cycles, each requests more walltime for each job than the previous cycle. The program will first call the current bean's checkForMissingJobs() and markCompletedJObsAsDoneInDb() functions. Then missing jobs (i.e., previously failed jobs) will be divided into smaller pieces (divideJobs()), because smaller jobs are more likely to get into the queue (doesn't it depend on the resource requirement? CHAT seems to make the same requirement...check the .xml file, but only if the dividing is possible -- some beans (e.g., ParalleLshDataCalcFilterRunnable) do not allow for dividided jobs. Failed jobs are marked in the database (markFailedJobs()). Then the program will make the next "job plan" directory, i.e., ended with current cycle + 1 (makeEmptyCurPlanDir(cycle + 1)), create .properties and .xml files needed for job running (writeIncompleteJobs()) and run all jobs in the aforementioned job plan directory (requiring the next setting of walltime) (runJobs(startHours[cycle])). This step will iterate until it uses up all cycles or all jobs are done (i.e., no missing jobs). Jobs are named in a certain way to indicate the connections between an initially planned job and all its counterparts in later cycles. (Note: Kirk has rewrote this part of code, so that the program can figure out these connections through slurm jobID...) [Debugging tip] To suit the need of a specific bean, you can adjust the walltime requirement in each cycle as well as the number of cycles via the "startHoursPerJobString" property of that bean in CHAT.xml. If there are undone jobs after all cycles (the program will notify you and write an err.txt file into the working directory), detele the err.txt file, relaunch CHAT making sure "reintialize" is FALSE (see above), the program will pick up undone jobs and try to rerun ONLY these jobs through a new iteration of cycles.]
  8. Indicate whether the current bean is executed successfully.
  9. </ol> The first bean "ParalleRawLshRunnable" puts job plans in folders named as "RawLSHDataPlan0/1/2". These jobs utilize input data from the "ChromosomeSpecificData" directory and output results to the "RawLSHData" directory. The jobs created by this bean will be executed by another class "LSHRawJobRunnable" in the same package as this one (i.e., "ParalleRawLshRunnable"). You will see this separation of a "job creator" and a "job runner" in the next several beans and it is the job runner class that does the real trick. To see how a job runner class (e.g., "LSHRawJobRunnable") works step by step, open "Debug configuration" in Eclipse, go to "Java Application" to create a new application (e.g., "LSHRawJobRunnable"). In the "Main" panel use what have been automatically detected (Kirk created a main function in each of these .java files). In the "Arguments" tab fill in the name of one of the job xml file in "RawLSHDataPlan#" folder. Open the LSHRawJobRunnable.java file, change Line 477 to indicate the same job's .properties file. Also insert a breaking point at the beginning of the main() function. Then we can lauch "LSHRawJobRunnable" in the debug mode. Note that "ParalleRawLshRunnable" has a very important unique operation in its uniqueStartSteps() function, which is to create earlier mentioned chatDB database and ALL the tables to be used in the later CHAT main process. This is done by calling chat.derby.CHATDBApp.buildDB() function. I will introduce different job creater and runner classes as they are subsequently invoked in the CHAT main process controlled by CHAT.xml. I would like to start with the big picture, which hopefully can help with the navigation in case we dive too deep. As shown in the figure below, the main CHAT process consists of the following steps
    1. Identify potential long shared IBD haplotypes
    2. Order potential haplotypes by piSMOR metric
    3. Phase each individuals
    4. Transitive Testing determines the size of the clique –if person A and person B share a haplotype and person B and person C share a haplotype, do person A and person C share a haplotype
    5. Form Cliques
    6. Correct for local Sharing
    7. Permutation testing: gene dropping – to remove false sharing – significance testing – genotyping error rates and the number of generations that have passed since the founder lived

    Step 1: Identify potential long shared IBD haplotypes

    Now let's look into the first job creator "ParalleRawLshRunnable". It determines job size based on marker density. The maximal number of markers to be processed in each job is bounded by (int) Math.round((targetJobSizeScaleFactor * TARGET_RAW_JOB_SCALE_CONSTANT * effecencyScaleFactor) / ((Num_case + Num_control) * (Num_case + Num_control))), where by default TARGET_RAW_JOB_SCALE_CONSTANT = 2.5E7, and targetJobSizeScaleFactor = effecencyScaleFactor = 1. In addtiion, the LDU distance between two consecutive markers in each job is also bounded and the default distance is 3 ldu map units. For each job, the program will create a .properties file with the following information: chromID, whether it is an X chrom, output file name (for outputting analysis results), indices of markers processed in this job ("listIndexPos"), working directory, the location of chat data file (chromosome specific data directory), the name prefix of chat data file, the genotype coding file (i.e., truth matrix), the location of raw LSH data directory, forgiveCtBase, forgiveRateLength, assumedPriorProbIBD, postProbCutOffToEstimateProbIBD. Some of these items will make more sense when I introduce the corresponding job runner class "LSHRawJobRunnable".

    "LSHRawJobRunnable" detects IBD haplotypes from every pair of samples/subjects (i.e., pairwise comparison; order does not matter) through every marker position assigned to a job ("listIndexPos" in the job property file). The IBD probability of two individuals is positively correlated with the number of consecutive corresponding markers on which the two indiviudals have compatible genotypes (diplotype, not haplotype data). The genotype data are coded using the truth table mentioned earlier, by which their compatibility is determined too. As you can see, what matters here is the length of sharing/compatibility/similarity, so IBD segments are referred to as long shared haplotypes (LSH) in this bean and the rest of CHAT beans. A potential obstacle in estimating IBD, especially concerning recent demography, is that random sequencing and phasing errors will tend to break up LSH. Thus, CHAT allows for certain amount of accumulated genotyping error before it stops checking the compatibility/similarity of the rest markers (on the chromosome segment to be processed in a job). More specifically, accumulated genotype error is bounded by a threshold that considers both how many incompatible sites have been accumulated so far ("forgiveCtBase" in the property file) and how fast they are accumulated ("forgiveRateLength" in the property file). Incompatable genotypes allowed in each direction from seed position is forgiveCtBase + (Used SNPS/forgiveRateLength)-1. For any pair of individuals, the LHS identifying process starts at a given marker and works towards two sides; this marker will be referred to as starting or seed marker later on. On one side (called "p" side), markers are checked in a descending order interatively (markerID--) until it hits the beginning of this chromosome segment or the accumulated number of errors has achieved the predefined threshold. On the other side (so-called "q" side), markers are checked in an ascending order (markerID++) until it hits the end of this chromosome segment or the accumulated number of errors has achieved the predefined threshold (same threshold as the p side). The checking process returns an SampleToSampleGenotypeConparisonDataElement object (also referred to as sample-to-sample CHAT bean) that stores the result of one comparion between two individuals that starts from a particular marker (i.e., information on an LHS identified via the comparsion), including the pair of individuals being compared (their IDs), the starting marker index, the pair's disease affecting status (0 - both controls, 1 - one case one control, 2 - both cases), the pair's gender (indicated by specific values defined in /chat-common/src/main/java/org/renci/chat/common/CHATUtil.java, setPairClass() function), the smallest marker ID that has been checked, the biggest marker ID checked, and LDU map distance between these two markers. The program creates a CHATSimilarityMatrixData object for each marker, storing all SampleToSampleGenotypeConparisonDataElement objects (related to different pairs of individual samples) obtained from comparisons starting at that marker, along with that marker's physical position and the number of case-case, case-control & control-control pairs involved in those comparison.
    Since the number of pairs increases quadratically with the number of individuals, pairwise comparisons on a big data set (due to many samples or high marker density) may take a long time and the similarity matrix can be huge. Thus, more efficient strategy is needed to process the data. Common ones include breaking down the original chromosome segment and/or the sample set (into subsets). CHAT uses an algorithm that first tries to build a null model -- the distribution of IBS segments around a marker -- using genotype data of control indiviudals and then uses the fitted distribution to identify IBD segments, which are supposed to fall into the tail area of the distribution. More specifically, to estimate the IBD probability of every detected shared haplotype, the program will first fit a gamma distribution to the length (measured by LDU map distance) of all shared haplotypes around a marker, for all the markers. This wiki page explains why we do so. The empirical data CHAT uses to fit the distribution are retrieved from aforementioned CHATSimilarityMatrixData objects. Each marker has such an object storing information on LSHs detected around that marker. A CHATSimilarityMatrixData object contains a list of SampleToSampleGenotypeConparisonDataElement objects (i.e., CHAT beans), each of which records the LDU distance of haplotypes shared by a specific pair of individuals (by default they should both be controls). As long as there are totally more than 200 data points for a marker (each data point is the LDU distance of a shared haplotype obtained from one single pairwise comparison), we can go ahead fitting the length distribution of LSH around that marker. CHAT implements a Bayesian fitting process that has several parameters to be predefined and/or adjusted over iterations (they will make more sense in later discussion):
    • the prior IBD probability ("assumedPriorProbIBD"), that is, the chance that any detected LSH is due to IBD. It is current a constant with default value 0.005. Alternatively, we could estimate it from the data [future improvement].
    • the threshold of posterior IBD probability ("postProbCutOffToEstimateProbIBD"). If an observed LSH has a posterior IBD probability exceeds this threshold, CHAT considers it as due to IBD (default is 0.9, which means 10% false positive).
    • the probability threshold CHAT uses to exclude extreme LSH from the Gamma distribution ("extremeProbExcludeNullDist"). Gamma distribution is a partially bounded distribution that theoretically extends from 0 to plus infinity. While we intend to fit a Gamma distribution for IBS segments, they are inseparable from IBD segments in the data. An extreme long segment is more likely to be IBD rather than IBS, and therefore should be excluded to prevent it from distorting the fitted distribution. The value of this parameter is calculated by 1-(1/100*numOfIncludedControls), which is something like 0.99..9.
    Bayesian fitting proceeds iteratively. In each iteration, the program filters out "suspicious" IBD segments (with IBD probability > "extremeProbExcludeNullDist") and uses the remaining data points fit the gamma distribution for IBS segments, until there are no more "suspicious" IBD segments. Thus, the "extremeProbExcludeNullDist" threshold determines the number of iterations. The gamma distribution fitted at the last iteration will be used for future computation. Distribution fitting applies a third-party gamma-distribution fitting function umontreal.iro.lecuyer.probdist.GammaDist.getInstanceFromMLE(double[] c, int n_c), c being the set of data points used to estimate α and λ, and n_c being the size of this data set. Gamma distribution has different forms of parameterization. This function creates a new instance of the following form of gamma distribution with shape parameter α > 0 and scale parameter λ > 0 (these two parameters will also get more precise through iterations). Density function: You can tell from the name of the function that it uses maximum likelihood method to fit the Gamma distribution (i.e., to figure out the most likely values of parameters α and λ). As mentioned in the above wiki page, there are several alternative ways, yet MLE is often the most efficient method in the sense that results are as precise as is possible for a given amount of data. Using the fitted gamma distribution and the prior IBD probability ("assumedPriorProbIBD"), we can calculate the IBD probability of each input LSH based on its length as follows. Assume S is a variable indicating the length of LSH (i.e., the number of contiguous markers identical alleles). We would like to infer the probability that an observed LSH of length c is IBD, i.e. P(IBD│S>c) =?
    
    Use Baye's theorem, P(IBD|S>c) = P(IBD & S>c)/P(S>c)
    Since P(IBD & S>c) is indeed P(IBD), we have
    P(IBD|S>c) = P(IBD)/P(S>c)
    
    P(IBD) is the assumed prior probability that any detected LSH is due to IBD. Denote P(IBD) = Z and P(S>c|notIBD) = Y.
    
    Since P(IBD) + P(notIBD) = 1, P(S>c|IBD) = 1, we have
    P(S>c) = P(S>c|IBD)*P(IBD) + P(S>c|notIBD)*P(notIBD) = 1*Z + Y*(1-Z) = Z + Y*(1-Z)
    P(IBD|S>c) = P(IBD)/P(S>c) = Z/(Z+Y*(1-Z))
    
    As mentioned, Z = 0.005. Since Y = P(S>c|IBS), 1 - Y is P(S<=c|IBS), which is the value of the cumulative density function (CDF) of the fitted Gamma distribution, P(S<=X|IBS), when X = c. CHAT estimates the CDF using the
    cdf function in the same Java library mentioned earlier. Now for an input LSH of length X, current gamma distribution gives its posterior IBD probability P(IBD|S>X), which is then compared with aforementioned posterior IBD probability threshold ("postProbCutOffToEstimateProbIBD") to determine whether this LSH should be treated as IBD. [fugure improvement: if we allow prior IBD probability to self-adjust over iterations, its value at iteration t+1 = (number of estimated IBD LSHs + number of extremely long LSHs)/total number of detected LSHs at iteration t. The denominator is equivalent to the total number of pairwise comparisons] The length threshold used to filter out extremely long shared haplotypes (and thus likely IBD) is conveniently set as Double.MAX_VALUE at the first iteration, so no segments are actually excluded at all. In each of later iterations (2nd, 3rd,...t-th), this value will be recalculated based on the gamma distribution fitted during the previous iteration (1st, 2nd,...(t-1)th respectively), so that it can actually do something. The calculation is shown below.
    
    Recall that we have predefined the PROBABILITY threshold for exclusion ("extremeProbExcludeNullDist") to be 0.9...9. This value is by essence P(IBD|S>X). Now we need to figure out the LENGTH threshold X. Denote P(IBD|S>X) = Q, we have
    Q = Z/(Z+Y*(1-Z))
    Y = (Z/Q-Z)/(1-Z)
    Recall that 1-Y gives P(S<=X|IBS), the cumulative density function (CDF) of the fitted Gamma distribution. So we can calculate X by putting 1-Y into the inverse function of the CDF, which CHAT gets through theinverseF function in the same Java library as earlier mentioned 
    </pre> 
    The fitting process is conducted by the FitCHATSharingDistributionGamma class (chat-common/src/main/java/org/renci/chat/fitChatSharingDistribution/FitCHATSharingDistributionGamma.java). Using marker-specific gamma distributions fitted at the last iteration, CHAT re-calculate the posterior IBD probability of every detected LSH. If the probability > 0.5 (by default; you can find the default values of most aforementioned parameters either in chat.properties file or in chat-common/java/org.renci.chat.Constant.java), that LSH will be considered as IBD. Results of the fitting process are organized by three classes: LSHData, LSHDataElement and FittingData (all in the package /chat-common/src/main/java/org/renci/chat/common). An LSHData object corresponds to a raw LSH job planned by ParalleRawLshRunnable and contains two lists. One list are LSHDataElement objects each storing infomration of an LSH around a marker (including the posterior IBD probability of each LSH) for all markers handled in the job. The other list are FittingData objects each storing information of a Gamma distribution fitted for a marker. Each job also has an output file in the output directory (../RawLSHData/Chr_#/filename) with the following content.
    
    
    Row 1: chromosomeID	"number of fitted gamma distributions"	"number of LSH"
    Row 2-N: parameters of fitted gamma model, one per row with the following 9 fields separeted by comma
    	0: "Gamma"
    	1: LDU map position of the marker corresponding to the fitted gamma distribution
    	2: sex pair class (0 for non-X chromosome)
    	3: mean
    	4: standard deviation
    	5: assumed prior IBD probability (0.005)
    	6: estimated IBD probability = (number of IBD LSHs + number of extremely long LSHs)/total number of detected LSHs at the last iteration
    	7: fitted value of parameter α
    	8: fitted value of paramter λ
    Row N+1-M: long shared haplotypes (LSH), one per row with the following 14 fields separated by tab
    	0: LSH ID
    	1: indexmapseed (index on the LDU map)
    	2: indexsample1 (ID of the first subject involved in the pairwise comparison)
    	3: indexsample2 (ID of the second subject involved in the pairwise comparison)
    	4: subject pair class (0 control-control, 1 case-control, 2 case-case; even though earlier we use only control-control pairwise comparison results to fit gamma distributions, this output file contains all comparison results including case-case and case-control)
    	5: sex pair class (1=Male-Male, 2=Female-Female, 3=Male-Female 0=unknown-?)
    	6: indexmapp (the end of the LSH on the p arm)
    	7: indexmapq (the end of the LSH on the q arm)
    	8: ldu (LDU map distance of the LSH)
    	9: isvalid
    	10: p value (posterior IBD probability of the LSH)
    	11: fixindexp (at this step its value is -9)
    	12: fixindexq (at this step its value is -9)
    	13: pismore (at this step its value is -9.0)
    

    Step 2: Merge raw LSHs

    this merges the lsh from that start at points into a single lsh The second bean "ParalleMergeRawLshRunnable" calls "LSHMergeRawForSubJobRunnable" to merge all the LSHs identified in Step 1 that are contiguously overlapped and involve same individual samples. The output files of the previous step (in the "RawLSHData" directory) become the inputs of this step, whose outputs will go to the "LSHDataMerge" directory. The "ParalleMergeRawLshRunnable" class plans LSH merging jobs while the "LSHMergeRawForSubJobRunnable" class actually runs these jobs. Job plans are put in folders named as "LSHDataMergePlan0/1/2" depending on which cycle it is. Each merging job has a property file including the names of all raw LSH jobs (i.e., Step 1 jobs) and the ID of individual subjects to be handled in this merging job. The program starts by examining input and output directories. If both are fine, it will import the result file of every raw LSH job (using the name of each job) to assemble a list (named "curFitData") of all gamma distributions fitted in Step 1 and a list (named "curLSHData") of all LSHs found at Step 1. If any of the result files are missing, the program will write the names of missing jobs to the error file and stop. The retrieved LSHs will be reordered based on first the ID of the first subject in the pair, then the ID of the second subject, and lastly the map index of the seed marker (see the output format above, field 1, 2, and 3 of each LSH record). Upon reorganization, LSHs obtained from the same pair yet different markers (recall that we detect LSHs using different start markers at Step 1) become contiguous. For example, the first several LSHs are shared by Subject 0 and 1 with different seed markers may look like below:

    After reordering raw LSHs, the program revisits them one by one in this order to cull overlapping LSHs. Assume we encounter two LSHs L1 and L2 subsequently. The map indices of their P-arm end markers (i.e., the left most markers) are P1 and P2, while themap indices of their Q-arm end markers (i.e., the right most markers) are Q1 and Q2 (see the output format above, field 6 and 7 of each LSH record). CHAT considers L2 overlaps with L1 and L2 if any of the following conditions is met:

    Using the culled overlapping LSHs, the program then tries to configure one LSH for this pair of subjects. It picks out three special LSHs: (a) the LSH whose seed marker is closest to the median of all seed marker indices, which should land at the "middle" of the intersect of all overlapps; (b) the first and the last culled LSHs. Since errors are accumulated along the chromosome starting from the seed marker towards both sides (if you recall), the first culled LSH is supposed to accumulate fewest errors on the left side among all LSHs and the last culled LSH has fewest accumulated errors on the right side. The final LSH for this pair of subjects will have the "middle" LSH's seed marker, the first LSH's left side (p-arm) end marker and the last LSH's right side (q-arm) end marker, as the blue segment in the figure below. Notably, it is possible that after trimming the right end marker ID is smaller than the left end marker ID. If this happens, the LDU length of the final LSH is set to zero. Otherwise, the length is the distance between two end markers on the LDU map.

    Assign raw LSHs that share one individual subject into one job. Job size is measured by the number of raw LSHs. It may > predefined size threshold to include all raw LSHs related with one individual. The last job may < predefined size. One job must include all pairwise comparisons conducted between the reference individual (indexsample1) and all other individuals. Job size decreases? (1-2,3,4…; 2-3,4,…; 3-4,…;… ; n-1-n?) LSHDataElement extends SampleToSampleGenotypeConparisonDataElement, indexSample1 always smaller than indexSample2 Use the chromosome of one of the compared individuals as a base to merge all LSHs across markers Jobs is a map of <jobName, list of individual subjects>, this list is written into job property file as “listIndexSub” Only overlapping LSHs obtained by comparing the same pair of individuals will be merge to one single LSH. This LSH has the closest-to-middle position (with regard to the seed marker of all overlapping LSHs) as its index marker. Its p/q (i.e., left/right) end is the p/q end of the overlapping LSH that has the right/left most seed marker. In other words, this merged LSH is like the core of an LSH cloud. Its length, in terms of LDU distance, is the LDU map distance between the new ends). Non-overlapping LSHs, even if they belong to the same pair of subjects, will not be merged. Assume the length of a merged LSH is c. Recall that every marker has an associated gamma distribution fitted at Step 1 and each one of the overlapping LSH is considered IBD LSH given the distribution at its seed marker. Thus we can calculate the IBD probability of the merged LSH using each of these seed-marker gamma distributions as P(IBD|Gi) = Z /(Z + (1-Fi)*(1-Z)), where Fi = P(X > c | IBD) (value of the CDF function of Gamma distribution at seed marker i at c). Thus 1-Fi is Y = P(X > c | not IBD, i.e., IBS). Z is the assumed or predefined prior probability of IBD. The largest P(IBD|Gi) will become the pv of the merged LSH, and the seed marker whose gamma distribution produces this largest value will become the new seed marker of the merged LSH. Each merging job (check, there should be more than one merging jobs) writes an output file of the same format as the output file of Step 1, i.e., the input file of the current step, except that the allLSH list, which previously contains all raw LSH obtained from comparisons between involved subjects are replaced by a new LSH list, in which overlapping/redundant LSHs are replaced by their merged LSH.

    Step 3: Adjust the ends of identified LSHs

    The third bean "ParalleLshDataAdjustRunnable" puts job plans in folders named as "LSHDataAdjustPlan0/1/2". The input directory is the output directory of the previous bean "LSHDataMerge" and the output directory is "LSHDataAdjust". This class calls "LSHDataAdjustJobRunnable" to run all the planned jobs, trimming the ends of each LSH. trims off the end a bit assuming that part of the end that could be shared is actually due to chance. qpFixMultiplier is the multiplier times the sd of sharing under the null that is trimmed from the ends of a merged LSH #qpFixMultiplier=0.25 predefined q/p-end fix multiplier is 0.25 each job handles no more than 1000000 LSHs (parameter LSH_DATAFILE_SIZE_LIMIT) the mean E[X] = α/λ of the gamma distribution For each LSH obtained from the last step (merged or not), this step trims its two ends (or p/q arm) to further reduce the number of sites that may be identical by chance. Generally, the program reaches out from both arms for the next marker (e.g., p1 and q1), gets the mean length (e.g., L1 and L2) of IBS LSHs around either marker using their corresponding gamma distributions, and then, starting from either marker, say, p1 (or q1), the program turns back searching for the next marker, say p2 (or q2) whose LDU map distance from p1 (or q1) is no larger than ¼ L1 (or L2). This second marker at either arm (p2 or q2) will be the new end of the LSH. In this step, the fixIndexP and fixIndexQ fields of LSHDataElement are assigned the values of an LSH’s p and q ends after trimming. In contrast, the indexMapP and indexMapQ fields indicate the LSH’s p and q ends before trimming. The output files of this step has the same format as those of the last step, except that the LSH records in this step’s output file starts with 1 (in last step they start with 0) and the fixIndexP and fixIndexQ fields has specific values instead of the place holder -9 Usually the sharing information is more accurate at the middle of an LSH and less accurate at two ends, because originally there tend to be more overlapping raw LSHs from more pairwise comparisons (i.e., genotype information from more samples, data density? data sufficiency?)

    Step 4:

    The fourth bean "ParalleLshDataCalcFilterRunnable" puts job plans in folders named as "LSHDataFilterPlan0/1/2". The input directory is the output directory of the previous bean "LSHDataAdjust" and the output directory is "LSHDataFilter". This class calls "LSHDataCalcFilterJobRunnable" to run all the planned jobs, calculating metrics about LSH including PI smores. Recall that DataSet.java handles SampleElement (subject) and MapElement (map position) data, while LSHData handles LshElement data. Build a recode map for the genotype data at each marker and store the map in DataSet. If the genotype is homozygous with two major alleles, no matter how this situation is coded before (which you can find out by looking at the truth table), the new code is 0. If it is homozygous with two minor alleles, the new code is 1. If it is heterozygous, the new code is 2. Unknown genotype is represented by “?”. In order to do so, every marker m has a recode map with at least one entry in the map <index of m, <old code of unknown, ?>>. If the major allele of this marker is not unknown, there will be a second entry in the map <index of m, <old code of homozygote on major allele, ‘0’>. Likewise, if the minor allele is not unknown, there will be a third entry <index of m, <old code of homozygote on minor allele, ‘1’>. If both major and minor alleles are not unknown, there will be a fourth entry <index of m, <old code of heterozygote, ‘2’> chat-common/DataSet.java returnMajorMinorAlleleAndCts() function: by going through the coded genotype data (one string; not two strings of haplotype data) of all subjects, this function returns a char array that contains, in order, the characters of major allele and minor allele. It also fills a double array (as a reference passing to it) with, in order, the frequency of the major allele and the minor allele in the sample set. For each final LSH, a Pi-SMOR statistic is calculated to measure the likelihood that this LSH is inherited identical by descent (i.e., from a common ancestor). Basically, for every LSH obtained from previous steps, the program will scan it (from its left most marker, i.e., fixIndexP, to its right most marker, i.e., fixIndexQ) and sum up a specific LOD (Log of Odds) score for each marker that is covered by the LSH and included in the analysis. The LOD score for each marker is calculated as log10 (P(observed genotypes at that marker | two individuals are IBD = 1)/P(observed genotypes at that marker | IBD = 0, i.e., IBS)). For autosomes, different values of the odds ratio given different observed genotypes are shown in the table below (00 means at a specific marker an individual is homozygous on major allele; 11 means at a specific marker an individual is homozygous on minor allele; 01 means at a specific marker an individual is heterozygous and it can be 01 or 10; p0 is major allele frequency). original 00 01 11 Original Genotype coded 0 2 1 Coded Genotype 00 0 1/p0 1/2p0 0 01 2 1/2p0 1/4p0(1-p0) 1/2(1-p0) 11 1 0 1/2(1-p0) 1/(1-p0) The cumulative sum is Pi-SMOR and the final result will be saved in the “piSmore” field of that LSH (LSHDataElement). According to the above table, the odds ratio should fall in the region of [0, 1). When the value is 0, the log is negative infinitely and will largely pull down the overall Pi-SMOR. This situation, however, should not happen to a shared haplotype, in which every marker should be at least compatible. Homozygotes at different allele states are not compatible. If we believe the LSH is truly IBD, such incompatibility can only be caused by genotyping error. Thus, instead of introducing negative infinitive to the sum, CHAT replaces it with a predefined genotype error penalty of the value -2. Given a rare variant (i.e., p0 close to 1), most of the observed genotype will be 00. For such samples, the corresponding odd ratio (1/p0) is close to 1 and the log is close to 0, meaning that genotype (homo at major) tells us nothing interesting. However, a sample with that rare variant (genotype 01 or 11) will have a huge odd ratio and log value, greatly increasing Pi-SMOR. Thus, large Pi-SMOR value indicates the existence of a rare or even unique LSH, which we will be very interested in. Moreover, since Pi-SMOR is a cumulative sum, its value also gets bigger when the LSH is longer. In this way, Pi-SMOR reflects the uniqueness and length of a putative long shared haplotype (LSH). Given the same genotype (00, 01, or 11), the odds ratio of a homozygote compatible genotype (00, 00/11, and 11 respectively) is always larger than a heterozygote compatible genotype (01). Thus, the Pi-SMOR statistic can overcome the problem that an individual with a long string of heterozygous markers has the potential to share a haplotype with many other samples for that segment, which then becomes a false-positive LSH. I don’t know this history behind the name “Pi-SMOR”. It may have something to do with SMOR (Standardized mortality odds ratio), but I’m not sure.

    Step 5:

    The fifth bean "CalculatePISMORThresholdsJobsRunnable" calculates statistics for PiSmore: ldu floor, median, mean, sd. These are stored to the database. calculates means & sd of sharing of LSH The calculation results are stored to the database chatDB and also output to a file named PISMOR.out. They are organized in a table with 5 fields -- Chr, LDUFloor, medianPISMORNearlduFloor, meanPISMORNearlduFloor, sdPISMORNearlduFloor Recall that when identifying raw LSHs, we estimate their P(IBD) and put the value in the “pv” field. Then we retain those with pv > 0.5 (default value of which parameter?) for further consideration. Here when calculating LDU floor, we use only LSHs (merged and adjusted LSHs) whose raw version has P(IBD) >= a prob threshold defined by rawPostProbCutOff (default 0.5; possible range is thus [0.5, 1]). First, the length of each of these LSHs (in LDU) was rounded to the closest integer. Next, LSHs of same rounded length from one single chromosome are put into one bin identify by their common length. A map named lduCt is created to store the information of these bins with map entry <LSH length represented by a bin, size of the bin>. Then the bins are organized in an ascending order of the length they represent. This step amounts to order all LSHs from the shortest to the longest. The smallest length becomes LDUFloor. The program goes through the ordered bins one by one until the accumulative number of LSHs (by adding up bin size along the way) > 5000 (predefined by PISMOR_NEAR_LDU_FLOOR_SAMPLE_SIZE) and records the LDU length represented by the bin where it stops (denoted curBin). Then the program goes through every LSH of that chromosome again and record the PI-SMOR value of LSH meeting the following criteria:
    • its pv > 0.5 (see above)
    • its length (in terms of LDU ) + 1 > LDUFloor
    • its length < curBin + 1
    • its PI-SMOR > 0
    Basically, these criteria help us gather PI-SMOR values of the 5000 shortest LSHs which, however, are still likely to be IBD regarding its pv and PI-SMOR. Finally, the program calculate and output the mean, median, and standard deviation (with a divisor of n, not n-1) of these PI-SMOR values. The next step is the most facinating one in this version of CHAT...