CHAT main process -- Part 2
Step 6:
The sixth bean “ParalleCHATSetRunnable” calls “BuildChatSetRunnable” to help with its jobs. “ParallelCHATSetRunnable” builds regions over which LSH cliques will be formed, while “BuildChatSetRunnable” computes cliques for a region. does a long iterative process to decide which strand different lsh are sharing with subjects. make graphs of LSh at many spots for several models. Models allow for greater error and min pismore or LDU. gives first peek if anything interesting. Significance called at cliques for many points. Results are collapsed into representative results.
Currently the program runs four different models (defined by “chatSetModels” in the chat.properties file) as Kirk cannot decide which one is better (thus we are supposed to compare their results. Maybe in the future we could even average over their results or do permutation on four models). Based on the values they are taking for eight key parameters (separated by “”), the four models are named as: • chatset_false_0.5_0.005_10_10_false_0-100 • chatset_true_0.50.005_10_10_false_0-100 • chatset_false_0.50.01_5_10_false_0-100 • chatset_true_0.50.01_5_10_false_0-100
The eight key parameters are listed below. Their meanings and usage will be introduced as we describe this step of CHAT.
- LSHElement sorted by PISMOR or LDU sharing: true – by PISMOR, false – by LDU sharing
- segLDUSize: LDU length of chromosomal segments.
- fractionMatchScoreLshPhaseFilter(0.01 or 0.005): Thresholds for declaring a match:
- fractionMatchScoreLshPhaseFilterMultiplier(10 or 5):
- filterSegOverlapCt(5 or 10)
- usePismoreFilter: false for all
- pismoreSdFilter(usually 0)
- lduFilter: -100 for all
In this step of CHAT, every chromosome being analyzed will be first divided into small segments of same LDU length. Each of these segments serves as a bin holding all the markers whose LDU map positions fall into that segment. Currently all four models use a segment size of 0.5 LDU; that is, the first and the last marker in a bin and their map distance <= 0.5 LDU). After dividing a chromosome into same-size segments, the program uses two sliding windows, one inside the other, to parse the entire chromosome into multiple jobs. Both windows are expected to cover one or more contiguous segments (i.e., marker bins) entirely, yet their sizes are determined by the number of SNP markers inside the segments, so the size of either window actually varies along the chromosome with marker density. By default, the smaller internal window is expected to contain at least 1000 marker SNPs, whereas the bigger window extends at both directions for at least another x times 1000 more markers if the extension does not hit the end of the chromosome at the same direction (the external window is referred to as the flanking window and x as flanking multiplier as the extra regions at both ends are often referred to as flanking regions, by default x is set to 2.5 to avoid edge effects?). Even if the number of markers exceeds the above thresholds, those lastly being included are within a distance of one bin, i.e., 0.5 LDU. The internal window moves along the chromosome without overlap in its contiguous locations, while the external window, by its definition, produces overlaps. Each job analyzes all segments covered by the external window, but only reports segments inside the internal window. With the default settings, 6000 SNPs will be analyzed to report data on segments with a 1000 snps, assuming that neither end of a chromosome is arrived (chat-common/DataSet.java provides lastUsedMapPositionIndex()and firstUsedMapPositionIndex() to find the index of the last and the first marker with regard to the entire chromosome respectively).
When running jobs, each model will take a thread and all models run in parallel. The thread executes BuildChatSetRunnable class. Currently each executing thread is scheduled to output a recovery file every 2 hours (BuildChatSetRunnable.java, run() function, closeToEnd variable), but Kirk has not implemented the recovery function yet.
Each job runs BuildChatSetRunnable to compute cliques for a chromosomal region defined by the start and the end positions of the report and the analysis windows. The program first reads in the chromosome specific data set to load a DataSet object that has all marker and subject information (openChromSpecificDataSet() function). Then the program selectively stores the piece of information critical for this job in a ChatSet object while ignoring other information to save memory (DataSet.trimDataSetPosBasedOnSetSeeds() function). Basically, it further extends the analysis window for 1000 LDU at both sides and only focuses on markers inside this region. Every marker in this region (analysis window with 1000 LDU extension at either side) is reassigned a map index starting from 0. A DataSet object is created to store the genotypes of only these markers. Then markers inside the analysis window are binned into 0.5 LDU segments again and boundary markers, i.e., the first and the last markers of each bin, are identified (ChatSet.loadChatSetSnpMapData() function). The ChatSet object records the new DataSet object (in its field curData) and boundary markers (the first makers of each bin in the field segSnpKeyList; the first-last marker pair of each bin in the field segSnpEnds). Note, however, that this strategy may not work: all markers rather than a subset of them may end up being included. For example, if the LDU map of the data set is short (i.e., LDU/Kb ratio is small) for some reason (e.g., low recombination rate), 2000 LDU + analysis window may cover the map positions of all markers).
Then the program selects from all LSHs identified and adjusted in previous steps those useful for the current job and loads them into the ChatSet object (actually into an object of CHATSetStatus class, which is an internal class of the ChatSet class, via a so-called loadLsh() function). The loaded LSHs must meet three criteria. First, the PI-SMOR value of the LSH must >=0 and >= PI-SMOR threshold. Second, the LDU length of the LSH must >= LDU threshold, which is the LDUfloor. PI-SMOR threshold is either an extremely small value (Float.MIN_VALUE = 1.4E-45) or calculated using PI-SMOR mean and standard deviation obtained from the previous step, depending on the value of usePismoreFilter, which is currently false for all four models. If usePismoreFilter == true, the parameter pismoreSdFilter indicates whether PI-SMOR sd will be used: PI-SMOR threshold is either PI-SMOR mean (pismoreSdFilter = 0), or PI-SMOR mean plus PI-SMOR standard deviation (pismoreSdFilter = 1). LDU threshold is either the LDUfloor (the smallest LSH length converted to closest integer) or LDUfloor plus lduFilter (-100 for all four models). Finally, the LSH must at least partly overlap with the particular region handled by this job, as shown below.
Then the program will adjust the p/q-end markers (fixIndexP and fixIndexQ) of loaded LSHs. If the p-end (left-most) marker of a loaded LSH falls into any bin except for the last one in the analysis window, the LSH’s new p-end marker will be the first marker of the next closest bin (on its right). If the old p-end marker falls into the last bin, the new p-end marker will be the first marker of the last bin. If the q-end (right-most) marker of a loaded LSH falls into any bin except for the first one in the analysis window, the LSH’s new q-end marker will be the last marker of the immediate previous bin (on its left). If the old q-end marker falls into the first bin, the new q-end marker will be the last marker of the first bin. Basically, each LSH’s overlap with the target region will be trimmed to fit bins except when the overlap covers the entire region (then a trimming operation has no use) or only part of the first or the last bin (then a trimming operation would completely remove the overlap)
Then if the current model uses PI-SMOR as the primary sorting criteria, the program sorts all loaded LSHs, which are stored as a list of LSHDataElement objects in a ChatSetStatus object named css, in the descending order of first PI-SMOR, second LDU, and third IBD p-value. If the current model uses LDU length as the primary sorting criteria, the sorting will be based on first LDU, second PI-SMOR, and finally IBD p-value. Either way, the first several LSHs are considered most likely due to IBD sharing.
Then the program loops through all LSHs in their sorted order (css.lshCt counts and IDs every processed LSH). Recall that each LSH is the product of an earlier pairwise comparison between two individual subjects (denoted A and B). For every LSH, the program extracts the IDs of involved subjects and infers their shared and non-shared haplotypes. Human beings are diploids. Assume two individuals A and B who share an IBD haplotype. Either of them would have this haplotype on one strand of their two chromosomes, while their haplotypes on the other strand tend to differ given that A and B are unrelated subjects in a population (as opposed to family) dataset. Thus, we should be able to infer three haplotypes at the same chromosomal region (one of previously segregated bins, i.e., the haplotype is a combination of alleles at all the markers falling in a bin) from an LSH shared by A and B – a haplotype A shared with B (denoted ASB), a haplotype A not shared with B (ANSB) and a haplotype B not shared with A (BNSA).
At each bin, the program infers four haplotypes ASB, BSA, ANSB, and BNSA from A and B’s genotypes at within-bin markers (recall that genotype information has been loaded in a DataSet object named curData), via a function phaseAssumingPaired. The table below lists different genotypes and their corresponding inference results (“0” and “2” indicate homozygous in either allele; “1” indicate heterozygous; “X” indicates unknown genotype; “Z” means genotype incompatibility – A and B are homozygous in different alleles at the same marker). Geno A Geno B ASB ANSB BSA BNSA 0/2 2/0 Z Z Z Z X X X X X X 0/2 X 0/2 0/2 0/2 X 1 X 1 1 X X X 0/2 0/2 X 0/2 0/2 X 1 X X 1 1 1 0/2 0/2 2/0 0/2 0/2 1 1 1 1 1 1 0/2 1 0/2 0/2 0/2 2/0
ASB and BSA haplotypes are further analyzed to get a consensus form. If they share the same allele state (0, 1, 2, Z, or X) at a marker, that state will be output as the consensus state. Otherwise, if there are two different allele states and one of them is “Z”, the consensus state is “Z”. If one of them is “X”, use the other. If there are any undefined symbols, use “X”. Now consider the case that the two different alleles are among 0, 1, and 2. In this case, the program will determine the consensus form based on each allele’s compatible homozygous genotype (i.e., 0 for 0, 2 for 2, and {0, 2} for 1). The results are shown in the table below. These results are stored in a map, so that when the same situation is encountered later we can directly fetch the result rather than recalculate it. This is intended to save time when the shared haplotype extends over many markers. Original allele at a marker in ASB 0/2 1 1 0/2 0/2 Compatible homo allele set {0}/{2} {0,2} {0,2} {0}/{2} {0}/{2} Original allele at a marker in BSA 2/0 0/2 1 0/2 1 Compatible homo set {2}/{0} {0}/{2} {0,2} {0}/{2} {0,2} Intersection of the two homo set ∅ {0}/{2} {0,2} {0}/{2} {0}/{2} Consensus allele Z 0/2 X 0/2 0/2
One bin after another, three haplotypes (one shared and two non-shared) from two individual subjects are phased. The sequence of each haplotype is stored in a ConsensusSet object, which also keeps the IDs of involved LSH and subjects (one subject ID for non-shared haplotypes and two subject IDs for the shared haplotype). This is implemented by naming each haplotype as “subject ID + LSH ID + 0/1 (0–shared, 1–non-shared)”. Each haplotype also contributes a clique object to the current bin, which later will be used to build a graph (discussed below). Thus, one clique is associated with one ConsensusSet object. These objects are created to store relevant information as the program phases shared and non-shared haplotypes at each bin for each LSH. In order to quickly find relevant information, all clique and ConsensusSet objects are organized into two maps in the aforementioned ChatSetStatus object (named css). The first map css.graphs is defined as <first marker of a bin (as bin identifier), <clique ID, the ConsensusSet object associated with this clique»; the second map css.sub is defined as < first marker of a bin (as bin identifier), <subject ID, the set of cliques this subject is involved».
So far the program does haplotype inference/phasing for bins, the ultimate goal is to do phasing for each individual subject (recall that the input of CHAT is unphased genotype data). Although each subject only has two chromosomes and thus two haplotypes, information needed for phasing the two haplotypes is scattered in multiple bins: a subject may be involved in multiple LSHs, each of which may extend multiple bins. Intuitively, we need to configure two threads across the entire chromosomal region (focus of one job, marked by the analysis window) to connect these bins, so as to merge fragmental, bin-specific haplotypes into two long haplotypes for each subject. It is relatively easy to merge haplotypes inferred from the same LSH, as the ConsensusSet object for each haplotype contains subject ID and LSH ID. A bigger challenge is to merge one subject (say A)’s haplotypes inferred from LSHs that all involve subject A but differ in the second subject say B and C. Let’s say there are two such LSHs – LSH (A vs B) and LSH (A vs C) – and they overlap at some bin (or bins) say Bin X. As discussed, an LSH results from a pairwise comparison between two individual subjects and it will contribute three cliques to every covered bin, two of which are related with either subject. So Bin X will have two cliques from LSH (A vs B) and two others from LSH (A vs C) that are related with A. Since all four cliques, which represent bin-specific haplotypes ASB-X, ANSB-X, ASC-X, and ANSC-X, originate from Subject A’s two chromosomes, ASB should be identical to either ASC or ANSC and ANSB should be identical to whichever one left. We need to decide which way to take.
The program processes LSHs obtained from previous steps iteratively. The last step of each iteration is to explore aforementioned merges with regard to the haplotypes of both subjects (one per time), using the bookkeeping information previously stored in css (BuildChatSetRunnable.exploreMerge). Say we try to merge all cliques related to Subject A. First, the program will identify bins to which A has contributed more than 2 cliques (so that there is indeed a need to merge). Then A-related cliques at each bin are paired up consecutively. Given the order that cliques were previously added (see ChatSet.addNextLshToChatSet) and presently revisited at each bin, the first clique in a clique pair usually corresponds to Subject A’s shared haplotype inferred from a comparison (i.e., an LSH say LSH A vs B) and the second clique corresponds to A’s non-shared haplotype from the same comparison. The program will retrieve existing information from css (note that more and more information will be available as the program processes more LSHs) and put them into a map of < Bin identifier (first marker in the bin), clique pairs contributed by Subject A to this bin inferred from already processed LSHs>. Each clique pair is identified by two clique IDs. Recall that clique IDs were assigned consequently at each bin, so the same clique ID from different bins does not necessarily come from the same LSH.
This map (if not empty) is then used to populate a three-dimensional array (number of bins × max number of clique pairs regarding all bins × 2; the last 2 is to put the IDs of the paired cliques) by function ChatSet.mkHapPairs. Thus, each element of the array indicates <bin# (simple count number, not the first marker), clique pair#, [clique1#][clique2#]>. This array only keeps information of bins that have the max number of clique pairs while filtering out others. For bins with fewer than the max number of clique pairs (including with zero clique pairs), all their corresponding elements in this array are assigned a value of “-9”. For bins that do have the max number of clique pairs, their corresponding elements are populated with clique IDs. The reason for this reorganization is that as more information comes in, the memory may not be able to keep information for every bin. Bins with more cliques are more likely to actually capture part of a region-wide long haplotype. The maximal number of cliques suggests the maximal number of long haplotypes there might be.
With the help of this 3-dim array, the program is able to traverse the chromosomal region that we segregated into bins and fetches information from bins that have the longest list of clique pairs. Using this information the program constructs two maps (denoted i0 and i1) with entries <bin#, clique#>. Each entry is a clique at a bin that corresponds to a bin-specific short haplotype of Subject A. Map i0 keeps adding shared haplotypes at each bin, whereas Map i1 keeps adding non-shared haplotypes at each bin. These two maps serve as threads that go through contiguous bins to connect cliques (one clique per bin in one map). The program stops extending the maps when it can no longer find information from the next bin, that is, when there is no paired cliques (shared and non-shared haplotypes) contributed by Subject A from the same LSH in the next bin. Notably, the 3-dim array is repeatedly visited during the process to look for the next clique pair to add to the maps; visited elements are marked as “-9” to avoid reuse.
Each LSH gives us two maps as describe above; each map is a collection of cliques (short haplotypes) related with one single subject. Recall that LSHs were ordered by their PI-SMOR scores or LDU length and both criteria tend to put longer LSHs first. Thus, we are expected to get longer maps earlier. As more and more LSHs are processed, there will be more and more cliques added to each bin and more and more information stored in css. As a result, when the program retrieves existing information to configure Map i0 and Map i1 for one subject (say A), there may be multiple pairs of them (each covering multiple bins) inferred from different LSHs (with or without overlapping bins). For example, there may be one pair of maps inferred from ASB and ANSB and the other pair from ASC and ANSC. Since one subject can only have one set of i0 and i1 corresponding to her two chromosomes, we need to merge these different sets of i0 and i1 (i.e., figure out a representative pair for all). The program uses a greedy strategy to do so – it merges two pairs at a time and takes the first merging result that meets certain criteria.
To implement this strategy, the program first determines which two pairs of maps to merge each time. It uses an index permutation algorithm (see chat-common/ IndexPermuter.java) to select pairs of maps for merging. Say you have N slots and M of the slots are filled with pebbles, the algorithm will generate all the permutations of filling the N slots with the M pebbles. Thus, suppose you have 6 pairs of Map i0 and i1 and would like to select 2 pairs to test merging result. Thus, N = 6 and M = 2. The algorithm will generate all possible permutations one by one: the first would be [true,true,false,false,false,false], the second would be [true,false,true,false,false, false], the third would be [true,false, false, true,false,false], and the last one would be [fasle,false,false,false,true,true]. Thus, the program will try merging Pair 0 and Pair 1 first, then Pair 0 and Pair 2, and so on, until it finds the first satisfactory merging that meets several criteria.
Let’s start with the first criterion. As mentioned, the program will attempt to merge a specific subject’s haplotypes every time when it processes a new LSH involving the subject. This new LSH is expected to provide new information, in terms of newly added cliques, for phasing that subject’s two region-wide haplotypes. Since every time the merging result will be stored in css (discussed below), we should focus on new cliques only. The program keeps a list of new cliques inferred from the current LSH being processed, in form of a map named hapInPlay with entries <Bin identifier (first marker in the bin), Set of new clique IDs>. The first criterion is thus for at least one of the four to-be-merged maps to contain at least one new clique recorded in hapInPlay. In other words, it is required that at least one newly inferred within-bin haplotype be covered by at least one of the four cross-bin haplotypes inferred from bookkeeping data (old and new information).
If the first criterion is met, the next criterion is for the two pairs of maps to have enough overlapped or common bins (>= a predefined threshold filterSegOverlapCt). Since two maps in a pair come from the same LSH and therefore cover same bins, the program will compare only Map i0 in both pairs. Then the program compares two ways of merging (let’s reuse the earlier example of Subject A, B, and C): (a) ASB merges with ASC and ANSB with ANSC (ASB+ASC & ANSB+ANSC), or (b) ASB merges with ANSC and ANSB with ASC (ASB+ANSC & ASC+ANSB). (a) and (b) are referred to as cis and trans respectively as they differ by switching the merged maps. Since the maps from different pairs are not necessarily of the same length, for comparability the evaluation is limited to common bins. This limitation is also necessary for bin-level comparability: since a haplotype is a combination of marker alleles, two haplotypes are comparable only if they contain the same number of markers.
The program will calculate the across-bin difference scores of both merging approaches (say scoreCis and scoreTrans). If both scores > a predefined score threshold fractionMatchScoreLshPhaseFilter, it implies that the two map pairs are so different that we cannot figure out a representative pair (this situation is possible given genotyping and phasing uncertainty). If both scores <= the same threshold, we cannot effectively distinguish them so there will not be a merge either. If one score <= the predefined score threshold while the other > certain times of that threshold (denoted fractionMatchScoreLshPhaseFilterMultiplier), we select the former way of merging. If the difference is less, again there is no merging because we cannot effectively distinguish the two ways.
Now let’s talk about how scoreCis and scoreTrans are calculated. Basically, scoreCis (scoreTrans) is the average of within-bin difference scores considering all common bins for cis (trans) merging; average rather than sum is used because sum increases with the number of common bins, and that conflicts with the meaning of difference. As mentioned, at each common bin reside 4 within-bin haplotypes inferred from ASB, ANSB, ASC, and ANSC respectively. The within-bin difference score for cis = (number of incompatible alleles between ASB and ASC + number of incompatible alleles between ANSB and ANSC)/(number of conclusively phased alleles in all four haplotypes). Similarly, the within-bin difference score for trans = (number of incompatible alleles between ASB and ANSC + number of incompatible alleles between ANSB and ASC)/(number of conclusively phased alleles in all four haplotypes). Conclusively phased alleles refer to alleles that are inferred as neither “X” nor “Z”. The compatibility of different allele states at the same marker is defined by the truth matrix (note that the allele states of markers on the four haplotypes are inferred and not necessarily the same as the actual ones).
Suppose a merge plan meets all the criteria, say Map i0 inferred from LSH (A&B) should be merged with Map i0 inferred from LSH (A&C). We think the two maps can be merged because they are actually identical. Thus, corresponding short haplotypes (currently as two cliques) in each common bin on the two maps should be identical too. Map merging is conducted by merging two corresponding cliques at each common bin of the two maps sequentially. Recall that clique and haplotype information is stored in css in two data structures. One is css.graphs, which stores <first marker of a bin (as bin identifier), <clique ID, the ConsensusSet object associated with this clique». The other is css.sub, which stores < first marker of a bin (as bin identifier), <subject ID, the set of cliques this subject is involved». For each merge, relevant information currently stored in these data structures need to be updated accordingly (referred to as BookKeeping in Kirk’s code).
If ASB merges with ASC, this is a cis merge, so ANSB need to be merged with ANSC too. Likewise, in a trans merge, ASB merges with ANSC while ASC merges with ANSB. On merging two cliques, the one added to the bin later (with a bigger clique ID) will be merged into the one added to the bin earlier (remember that earlier added cliques from a more promising LSH). Let’s call the former the source clique and the latter the target clique. BookKeeping first extracts haplotype information of subject(s) related with the source clique (recall that there is only one haplotype if the clique represents a non-shared haplotype, e.g., ANSB, and two haplotypes if the clique represents a shared haplotype, e.g., ASB and BSA). Any new subject coming up will be added to the subjectToMerge list for future processing (see below). For each subject related with the source clique, source clique ID will be removed from and target clique ID will be added to css.sub.get(current bin).get(that subject), which is the set of cliques this subject is involved at the current bin. Then the haplotype info of all these subjects will be added to the target clique at css.graphs.get(current bin).get(target clique) and the source clique record at css.graphs.get(current bin).get(source clique) will be removed. Also, the source clique ID will be removed from hapInPlay map (recall that it keeps track of valid cliques related with current LSH inference) and the target clique ID will be added to that.
Since each clique actually corresponds to one and only one haplotype, after we add new subjects (each with a haplotype) to a clique, we can and should recalculate the representative haplotype of that clique considering all the new information. The resulting representative haplotype is a consensus haplotype inferred from all haplotypes currently attached to the clique (note that the haplotypes attached to the same clique all have the same length, which is equal to the total number of SNPs at the bin). Previously we have mentioned how to work on one SNP at a time to reach consensus for two haplotypes; now we are facing potentially multiple haplotypes, but the logic is similar. For each SNP, if all haplotypes share the same allele, the consensus haplotype will have that allele. If all haplotypes miss data at that SNP, the consensus allele is “unknown”. If there appear to be different alleles at one SNP (recall that CHAT only allows for bi-allelic states, so the maximally possible number of different alleles are 2), the program will use the same rules mentioned above to figure out the consensus allele, without considering how many times either allele appears in the multiple haplotypes. As a result, if an earlier added haplotype has a genotyping error at the SNP (say it is A while actually should be T), it will bias the following computation (say all later added haplotypes show T at the same spot) and the program has no way to self-correct. (This is a potential improvement of CHAT). Lastly, the source clique (i.e., the one being merged) when merging the first map pair may be the target clique when merging the second map pair. To avoid the loss of information, the program records which clique is the source at the first merging and checks whether this recorded clique is the target at the second merging. If this is the case, instead of doing bookkeeping on the to-be-eliminated source clique, the program will do so on the target clique in last merging.
After a merge has happened, the program will stop trying to merge other pairs of maps related with the current subject (recall that a greedy strategy is used). Instead, it will move on to the next subject in the subjectToMerge list and explore whether that subject has multiple maps and if so whether any two pairs can be merged. It basically repeats the process we have introduced, starting from finding all clique pairs of the new subject. After the program has processed all subjects initially existing on the list (every processed subject will be removed from the list), the list is not necessarily empty, because new subjects could be added during this process. These newly subjects will then be processed, following the same procedure we’ve talked about. When the program finishes processing the current LSH and moves on to the next one, it will clear up the subjectToMerge list and fill it with new subjects discovered while analyzing the next LSH. Again, these subjects contribute cliques to bins in the chromosomal region covered by the new LSH. Since it is the same chromosome divided by the same set of bins, cliques accumulate at bins, along with subjects whose phased haplotypes are represented by these cliques. The logic could be summarized by the following pseudo code. For each LSH (and the corresponding pair of subjects) { For each bin covered by this LSH, add cliques that represent the pair of subjects Add the pair of subjects to a check-to-merge list For each subject on the list { Check whether its cliques can merge with other cliques at all relevant bins Whenever there is a merging, add involved subjects to the subject list Remove the current subject from the list } }
When certain conditions are met (see above), different cliques at a bin will merge into one. The merging involves calculating a consensus haplotype supposedly shared by all subjects that contributed those original cliques. Thus, these subjects are grouped together due to their haplotype sharing. Haplotype sharing is actually pretty common in human population and we are only interested in very few shared regions that harbor disease causing rare variants. Remember that in CHAT haplotype sharing is estimated from LSHs and the use of pi-smor statistic allows us to identify most “interesting” LSHs (high pi-smor scores) and process them first. By doing so, we expect to find at a bin where the merged clique corresponds to a group of mostly affected subjects (i.e., cases). Then the location of the bin will tell us where the rare variant might be and the consensus haplotype can further help us identify it. CHAT locates such “most-case cliques” by doing a Fisher’s exact test on the numbers of in-clique cases and controls against the numbers of out-clique cases and controls (i.e., cases and controls in the dataset that do not belong to the current clique).
The merging also triggers possible merging of these subjects’ representing cliques at other bins. (downside?)
Since the above process is expectedly time-consuming, CHAT will constantly report the progress and output intermediate files. When the number of processed LSH (curVal) grows to certain times of 1000, CHAT will write to a progress report file about how many LSH has been processed, how many merges has happened, and the elapse time since last report. On the completion of the whole process, CHAT will output two types of result files. The same two types of files will be output every time after adding a new LSH, given that the total elapse time of the process has passed a predefined value (closeToEnd). One result is named “GraphSet” and the other is named “ChatSet” (or “Final” on completion of the process) plus chromID plus jobID (chromID + jobID identifies a unique job). Recall that we have 4 models and probably multiple jobs (a long chromosome is divided into sets of segments and each segment handled by a job) for each model. Each model has a separate output folder (in the ChatSet dir) and a job plan folder per cycle (in the chatSetPlan0/1/2 dir), both named after the model. Each job will have at least two result files (i.e., GraphSet & ChatSet).
A job’s GraphSet file records all resultant cliques in all the bins handled in that job. The file has multiple sessions, each for a bin. Each session starts with an axis comprised of all the markers inside that bin separated by “…” (symbol of intervals). Next are lines of subject records, each line for a subject with the fields: • A subject’s original sample name in CHAT (input) data set (i.e., chromosome specific file) • the subject’s ID • the subject’s disease affect status (as in chromosome specific file) • the subject’s gender • the subject’s genotype on the markers inside the current bin (i.e., markers shown in the axis at the beginning), with a white space every 10 genotypes Next in each session are lines of haplotype information, each for a haplotype CHAT phased from a LSH at a bin with the fields: • ID of the clique the haplotype now belongs, i.e., the one it finally merges into (in an ascending order, i.e., earlier added cliques first) • ID of an involved subject • ID of the original LSH • Sharing status of the haplotype, i.e., whether it is a shared (0) or non-shared (1) one • Sequence of the haplotype, with a white space every 10 alleles.
The other result of a job, the ChatSet file, contains clique information – each line for a clique. There are 9 fields per line describing the following attributes of a clique: • the base pair position of the first/start marker in the current bin • the base pair position of the last/end marker in the current bin • ID of the first/start marker in the current bin • ID of the last/end marker in the current bin • Clique ID • -Log10P: the P value is the result of a Fisher’s exact test on the current clique • how many subjects involved in this clique are cases • how many subjects involved in this clique are controls • a list of all subjects involved in this clique (separated by “,”) The lines are sorted by significance, with more important cliques come first. The significance is measured by two fields: -Log10P and Clique ID. High negative log p-value indicates cliques with significantly different case/control ratio than the rest of the dataset. Small Clique ID indicates cliques that are added early, i.e., obtained from more interesting LSHs.
The file ends with a line indicating the total number of cliques in this file, which amounts to total number of lines in this file. This is for sanity check purpose.
Step 7:
The seventh bean “ParalleleLSHCHATSimulationRunnable” calls “LSHCHATSimulationJobRunnable” to help with its jobs. “ParalleleLSHCHATSimulationRunnable” constructs the list of simulations to run, while “BuildChatSetRunnable” runs each simulation and computes fisher statistic. Permutes phenotypes for cliques and calculates staticistics If the obtained by comparing any one of the individual subjects to be handled in this merging job (ID in the property file) with any other individual (not necessarily also included in this merging job).
It then estimates the (posterior) IBD probability of the haplotype.
The eighth bean “ParalleleLocalAdjustChatSetRunnable” This class calls “LSHCHATLocalAdjustJobRunnable” to run all the planned jobs, correcting for local excess sharing.
The last bean “ParalleleGenomeAdjustChatSetRunnable” This class calls “LSHCHATGenomeAdjustJobRunnable” to run all the planned jobs, correcting for size of genome for each model and Corrects for multiple models (?)
ParalleleLSHCHATSimulationRunnable At this step, CHAT concatenates all the files from the CHATSet output directory for each model, permutes phenotypes (i.e., disease affected status), conducts Fisher’s exact test for each clique based on permuted phenotypes, and saves the most interesting clique. The outputs of this step are saved in a directory named “LSHCHATSimulation”, with one subdirectory per model (same name as in the previous step). There are two types of output files. One type of files are named “SimDx_#.txt” and produced at the job planning phase by the writePermutedAffectionStatus() function in ParalleleLSHCHATSimulationRunnable.java. Each such file represents a permutation, with “#” being the permutation ID (starting from 1). Inside the file, each row contains a subject ID followed by a permuted dx (=2 affected, =1 otherwise) for this subject. The last line of this file is “lct” followed by the total number of subjects in the panel plus 1; it is used for checking the completeness of this file. By default there will be 1000 “SimDx” files (i.e., 1000 permutations) per model.
The other type of output files are stored in the same subdirectory (of each model) and named “SimRaw_#.gz”, with “#” again being permutation ID. They are produced at the job running phase. Usually multiple jobs will be planned, each handling a certain number of permutation scenarios defined by “SimDx_permutationID.txt” file (the number of permutations each job will handle depends on predefined job size) and producing one “SimRaw” file per permutation. Thus, one permutation scenario corresponds to a “SimDx_permutationID.txt” file and a “SimRaw_permutationID.gz” file.
The codes for running a job are in LSHCHATSimulationJobRunnable.java. Basically, for every permutation assigned to the current job, the program will first read in the corresponding “SimDx_permutationID” file to get the permuted phenotypes (subject ID + permuted dx value). Then it will go through all “Final.gz” files produced in the previous step and recalculate the -log10P value for each previously identified clique using permuted phenotypes (the number of cases and controls inside the clique changes with specific permutation scenario). “Final.gz” files record cliques (i.e., a small subset of IBD individuals), one clique per line. The first two columns indicate the specific bin where a clique was find; the IDs of all individuals in a clique are provided in the last column/field of that line. Each “Final.gz” file was produced by a job in the previous step, which handled a specific and unique region on a chromosome (there may be more than one chromosomes). Thus, different “Final.gz” files do not have overlap bins.
There might be multiple cliques identified per bin in the previous step and stored in Final_#.gz. At the current step, however, for each permutation scenario, only the clique with the highest recalculated -log10P value among all cliques at the same bin will be picked out and recorded in the corresponding “SimRaw _#.gz” file (# is the ID of the specific permutation scenario). In this record, the number of cases and controls as well as the -log10P value will be updated to be consistent with the permutation result; the number of the chromosome being processed will be added to the beginning of the record.
ParalleleLocalAdjustChatSetRunnable At this step, the “SimRaw_#.gz” files from previous step are inputs, whereas the outputs go to two folders – “LSHCHATSimulationLocal” and “CHATSetLocalAdjust”. In either of the two folders, again there are four subdirectories each for a model and named as before. Multiple jobs may be planned, with one job handles one or more bins (delineated by the start marker ID of the first bin and the end marker ID of the last bin) on a specific chromosome for one of the four models. Although job outputs go to four different subdirectories, current code puts all job-running files under one folder “LSHCHATSimulationLocalPlan_#” and names them sequentially (“Local_#”).
Each job only focuses on the bins assigned to it and is conducted through a series of functions defined in LSHCHATLocalAdjustJobRunnable.java. First the function populateEvData() extracts the maximal -log10P values for each bin (denoted maxLogP hereafter) from each “SimRaw_permutationID” file (P is the recalculated Fisher’s exact test p value based on that specific permutation) and stores them in several map data structures (you can find the names and contents of these map structures at the beginning of LSHCHATLocalAdjustJobRunnable.java). Then, collectChatSetData() collects clique records from Final*.gz files (i.e., outputs of the ChatSet step) and store them in a map data structure, but only cliques with -log10P no smaller than a thresh referred to as curPvWriteFilter (i.e., the original Fisher’s exact test p no larger than a threshold). The default value is curPvWriteFilter = 0.001, which means Fisher’s test p <= 0.001and -log10P >= 3. Next, the function calcEVDAndAdjust() fits maxLogP values from all permutations (one value per permutation) for a specific bin (specified by its start marker ID) to a Generalized-Extreme-Value distribution and a Gumbel distribution . The function starts by fitting the Generalized-Extreme-Value distribution, which has three parameters (indicating its location, scale, and shape), and then uses the same data to fit a Gumbel distribution, which has two parameters (indicating its location and scale) and is supposedly easier to fit than the former. According to Kirk, in case that fitting the first distribution fails (getting parameters return “NaN”), fitting the second one always works. (If both distributions are available, GEV will be preferred to Gumbel in future analysis where only one of them is needed). Maximum likelihood estimation (MLE) method is used to fit both distributions. Then for each clique record stored in those map data structures, the maxLogP value is replaced by -log10P(x > maxLogP) if maxLogP > 0 and -maxLogP otherwise. P(x > maxLogP) is calculated by 1 - P(x <= maxLogP), where P(x <= maxLogP) can be obtained from the CDF of generalized extreme value distribution or Gumbel distribution (depending on whether GEV is available) given a specific value of maxLogP.
Next, the writeLocalGevParameterData() function will output a file named as “LocalGevParameter_ChromID_jobID.gz” to the folder “LSHCHATSimulationLocal” (under the appropriate subdirectory) for the current job. Each line of this file describes the extreme value distributions fitted for a bin with the following fields: • ID of the chromosome being processed • ID of the job on the chromosome • ID of the start markers of a bin processed in the current job • Type of the extreme value distribution used for this bin: general (‘E’) or gumbel (‘G’) • Parameters of the fitted GEV distribution (GevMu, GevSigma, and GevZeta) • GevMLE: final MLE of the fitted GEV distribution; -99.99 if this distribution could not be fit. • Parameters of the fitted Gumbel distribution (Gumleft and Gumright) • GumMLE: final MLE value of the fitted Gumbel distribution; -99.99 if this distribution could not be fit.
Next, the writeLocalAdjustSimData() function will output a file named as “LocalAdjustSim_ChromID_jobID.gz” to the folder “LSHCHATSimulationLocal” (under the appropriate subdirectory) for the current job. Each bin processed by the current job has two lines in this output file. The first line starts with a string “Raw” and stores the original maxLogP values for the current bin in every permutation; it has the following fields: • ID of the chromosome being processed • the base pair position of the first/start marker in the current bin • the base pair position of the last/end marker in the current bin • ID of the first/start marker in the current bin • ID of the last/end marker in the current bin • maxLogP values for the current bin in every permutation The second line is very much similar to the first line except that it does not start with “Raw” and that it replaces maxLogP with -log10P(x > maxLogP), also referred to as extreme local p values or locally adjusted p values.
Next, the writeLocalAdjustChatSet() function will output a file named as “LocalAdjustChatSet_ChromID_jobID.gz” to the folder “CHATSetLocalAdjust” (under the appropriate subdirectory) for the current job. Each line of this file is an almost identical copy of a clique record extracted from previous “Final” ChatSet files (i.e., “Final_JobID.gz”) with -log10P >= 3 (i.e., p <= 0.001), except that it inserts after the original -log10P (where P is the p-value of Fisher’s exact test) the locally adjusted -log10P(x > maxLogP). ). The original -log10P is the 6th field (starting from the 0th field) and the locally adjusted p value is the 7th field.
ParalleleGenomeAdjustChatSetRunnable The inputs of this step are previous outputs in “LSHCHATSimulationLocal” and “CHATSetLocalAdjust”. There are two output directories – “LSHCHATSimulationGenome” and “CHATSetGenomeAdjust”. This step only has four jobs, each for one of the four models. The code that runs each job is defined in LSHCHATGenomeAdjustJobRunnable.java, consisting of a series of functions identically named and similarly defined as those in the previous step. Basically, we will do a second round of fitting. While the previous fitting of p-values to extreme value distributions is “local” and bin-specific, i.e., limiting to each bin at each chromosome for each model across permutations, the current fitting is “global” and genome-wide, i.e., across different chromosomes per model. Currently CHAT only conducts these two rounds of fitting, although Kirk mentioned that there could be a third round of fitting across different models. Now let’s get to details.
First, the popoulateEvData() function reads in “LocalAdjustSim_ChromID_jobID.gz” files from the corresponding model subdirectory of the folder “LSHCHATSimulationLocal”. For every file, the function searches lines not starting with “Raw”, that is, lines containing locally adjusted p values, for the maximum of locally adjusted p values for each permutation (case-control label swapping) across all bins in all chromosomes. The information is stored in a map data structure (denoted evData) with the form of <permutation ID, maximum of locally adjusted p value for this permutation across all bins and all chromosome>. Then, collectChatSetData() function reads “LocalAdjustChatSet_ChromID_jobID.gz” files from the corresponding model subdirectory of the folder “CHATSetLocalAdjust”. The function gathers clique information from these files and puts them into a map data structure (denoted ChatSetData) with the form <ChromID, <start marker ID of the bin that contains the clique, the clique record line from Final ChatSet files with locally adjusted p value». Next, the calcEVDAndAdjust() function fits extreme value distributions using all locally adjusted p values stored in evData and creates a map data structure (denoted newEvData) of the same form as evData, but the original p values in evData are replaced by globally adjusted ones obtained from the fitted global extreme-value distribution. The calcEVDAndAdjust() function also creates a map data structure (denoted ChatSetNewP) of the same form as ChatSetData, with original p values in ChatSetData replaced with globally adjusted p values obtained from the above global extreme-value distribution.
Next, the writeGenomicGevParameterData() function will output a file named “GenomeGevParameter.gz” to the folder “LSHCHATSimulationGenome” (under the appropriate subdirectory) for the current job. There is only one line in this file describing parameters of the globally fitted extreme value distributions, with the following fields: • Type of the extreme value distribution used for this bin: general (‘E’) or gumbel (‘G’) • Parameters of the fitted GEV distribution (GevMu, GevSigma, and GevZeta) • GevMLE: final MLE of the fitted GEV distribution; -99.99 if this distribution could not be fit. • Parameters of the fitted Gumbel distribution (Gumleft and Gumright) • GumMLE: final MLE value of the fitted Gumbel distribution; -99.99 if this distribution could not be fit.
Next, the writeGenomicAdjustSimData() function will output a file named “GenomicAdjustSim.gz” to the folder “LSHCHATSimulationGenome” (under the appropriate subdirectory) for the current job. Each line in this file corresponds to a permutation and has three fields: • Permutation ID • globally adjusted p value for this permutation (as in ChatSetNewP) • maximal locally adjusted p values for this permutation (as in ChatSetData)
Next, the writeGenomicAdjustChatSet() function will output a file named “GenomeAdjustChatSet.gz” to the folder “CHATSetGenomeAdjust” (under the appropriate subdirectory) for the current job. Each line of this file is an almost identical copy of a clique (also referred to as a “ChatSet”) record extracted from previous “Final” ChatSet files (i.e., “Final_JobID.gz”) with -log10P >= 3 (i.e., where P is the p-value of Fisher’s exact test and its value <= 0.001), except that there are two adjusted p values after the original -log10P (6th field), both using the 6th field p value but different extreme-value distributions. The first one (7th field) was adjusted using local fitted extreme-value distribution and inserted at the previous step. The second one (8th field) was adjusted using global fitted extreme-value distribution and inserted at the current step.
Weak points
CHAT can only handle biallelic SNPs (check the filterHWE function in chat/common/DataSet.java). How can we improve that? Is there a HW equilibrium testing method that accommodate multiallelic SNPs?
As mentioned, CHAT aims for long shared haplotypes (LSH) with error tolerence and puts LSHs into cliques based on their indiviudal hosts. A clique can include too many and too long haplotypes and therefore incur extensive computing time, given either of two conditions:
- the tolerance for genotyping errors is set too high
- there are too few recombinations or mutations along the chromosome (when the data set is synthetic, it means the simulation model that generates the data set has inappropriate recombination or mutation rate)