Plink

15 Jul 2016

I need a place to write down some notes on running plink. BTW, I moved into a new apartment yesterday. Met two very polite and professional movers. Dished on their boss together (>100F but no AC in their truck. This is inhuman. I allowed their boss to charge me for longer time so that they can take breaks, enjoying the AC in my new apartment and some ice water.) I kinda lacked of sleeps three days in a row because of all the packing, moving and cleaning stuffs (not donw yet), so please bear with me. Anyway, let’s get to work. First of all, –memory [main workspace size, in MB] By default, PLINK 1.9 tries to reserve half of your system’s RAM for its main workspace (so maybe I need to get on a compute node and then lauch plink). If this amount is insufficient for your current job, or if it causes unwanted interference with other running processes (e.g. you’re using GNU parallel to run single-threaded instances of PLINK on each chromosome simultaneously), you can use –memory to change this behavior.

LD pruning

The following command computes the LD of rs1048535 to every SNP in a window of size 500 kb around the SNP. The option –ld-window-r2 0 ensures all pair-wise LD calculations are printed to the output file.

plink --bfile CEU_HapMap_GWAS_data --r2
--ld-snp rs1048535 --ld-window-kb 5000
--ld-window-r2 0 --out rs1048535_ld 

Sometimes it is useful to generate a pruned subset of SNPs that are in approximate linkage equilibrium with each other. This can be achieved via two commands: –indep which prunes based on the variance inflation factor (VIF), which recursively removes SNPs within a sliding window; second, –indep-pairwise which is similar, except it is based only on pairwise genotypic correlation. (Note: PLINK keeps the SNP with higher minor allele frequency, if the two MAFs are not identical. When there is a tie, the first SNP is kept.) The output of either of these commands is two lists of SNPs: those that are pruned out and those that are not. A separate command using the –extract or –exclude option is necessary to actually perform the pruning. The VIF pruning routine is performed:

plink --file data --indep 50 5 2

will create files plink.prune.in and plink.prune.out. Each is a simlpe list of SNP IDs; both these files can subsequently be specified as the argument for a –extract or –exclude command. The parameters for –indep are: window size in SNPs (e.g. 50), the number of SNPs to shift the window at each step (e.g. 5), the VIF threshold. The VIF is 1/(1-R^2) where R^2 is the multiple correlation coefficient for a SNP being regressed on all other SNPs simultaneously. That is, this considers the correlations between SNPs but also between linear combinations of SNPs. A VIF of 10 is often taken to represent near collinearity problems in standard multiple regression analyses (i.e. implies R^2 of 0.9). A VIF of 1 would imply that the SNP is completely independent of all other SNPs. Practically, values between 1.5 and 2 should probably be used; particularly in small samples, if this threshold is too low and/or the window size is too large, too many SNPs may be removed. The second procedure is performed:

plink --file data --indep-pairwise 50 5 0.5

This generates the same output files as the first version; the only difference is that a simple pairwise threshold is used. The first two parameters (50 and 5) are the same as above (window size and step); the third parameter represents the r^2 threshold. Note: this represents the pairwise SNP-SNP metric now, not the multiple correlation coefficient; also note, this is based on the genotypic correlation, i.e. it does not involve phasing. To give a concrete example: the command above that specifies 50 5 0.5 would a) consider a window of 50 SNPs, b) calculate LD between each pair of SNPs in the window, b) remove one of a pair of SNPs if the LD is greater than 0.5, c) shift the window 5 SNPs forward and repeat the procedure. To make a new, pruned file, then use something like (in this example, we also convert the standard PED fileset to a binary one): plink –file data –extract plink.prune.in –make-bed –out pruneddata

If you want to check whether there are other SNPs that can tag forcefully excluded ones, you can use the following plink function.

plink --bfile mydata --show-tags mysnps.txt

where mysnps.txt is just a list of SNP IDs. The function generates a file plink.tags that lists all the SNPs in the dataset that tag the SNPs in mysnps.txt (including the SNPs in the original file). A message is also written to the LOG file that indicates how many new SNPs were added:


Reading SNPs to tag from [ mysnps.txt ]
Read 10 SNPs to tag, of which 10 are unique and present
In total, added 2 tag SNPs
Writing tag list to [ plink.tags ]

meaning that plink.tags will contain 12 SNPs. The mysnps.txt would be the differences between dataset 1 map file and dataset 2 map file, which you can get using

comm -23 <(sort DataMerge_testcase_Set1.map) <(sort DataMerge_testcase_Set2.map) > mysnps.txt 

Back to plink, the option –tag-r2 0.5 to specify a minimum r-squared (based on the genotypic correlation, see above); in this case it is set to a value of 0.5 as being necessary to declare that one SNP tags another (the default is 0.8). plink LD pruning does not pick the minimal set of SNPs required to tag all common variation in a region, in the way tagging algorithms typically work (e.g. such as Tagger). Rather, this utility function is designed merely to indicate which other SNPs tag a one or more of a pre-specified list of SNPs.

Calculate allele frequency spectrum

To generate a list of minor allele frequencies (MAF) for each SNP, based on all founders in the sample: plink –noweb –file data –freq –out will create a file: .frq with five columns: CHR Chromosome SNP SNP identifier A1 Allele 1 code (minor allele) A2 Allele 2 code (major allele) MAF Minor allele frequency NCHROBS Non-missing allele count An example


cut -f 1,2 DataMerge_testcase_Set1.fam > tmp
awk 'NR <= 300' tmp > DataMerge_testcase_Set1_caseID.txt
awk 'NR > 300' tmp > DataMerge_testcase_Set1_ctrlID.txt
...#similarly for Set0 and Set2
plink --noweb --file DataMerge_testcase_Set1_plink --keep DataMerge_testcase_Set1_caseID.txt --freq --out DataMerge_testcase_Set1_case
plink --noweb --file DataMerge_testcase_Set1_plink --keep DataMerge_testcase_Set1_ctrlID.txt --freq --out DataMerge_testcase_Set1_ctrl
...#similarly for Set0 and Set2

File format conversion

The following code will read lgen file (note that there should also be corresponding fam and map file in the same directory with the same prefix file name DataMerge_testcase_Set0_noErr) and create .tped and .tmap files, both with the same file name prefix

plink --noweb --lfile DataMerge_testcase_Set0_noErr --noweb --recode --transpose --out DataMerge_testcase_Set0_noErr

LGEN file can be reformatted as a standard PED file using the following command (it implies mydata.lgen, mydata.map and mydata.map exist):

plink --noweb --lfile test --recode

This will create a PED file, plink.recode.ped, and the MAP file, plink.recode.map. To work with Kirk’s code, sometimes you need to convert back, i.e., from PED file to Lgen file. Plink provides an option –lgen

plink --noweb --file test --recode-lgen

This will create plink.lgen plink.fam and plink.map. As always, you can change the prefix using –out XXX in the above command.

To save space and time, you can make a binary ped file (.bed). This will store the pedigree/phenotype information in separate file (.fam) and create an extended MAP file (*.bim) (which contains information about the allele names, which would otherwise be lost in the BED file). The .fam and .bim files are still plain text files: these can be viewed with a standard text editor. Do not try to view the .bed file however: it is a compressed file and you’ll only see lots of strange characters on the screen.The follow code converts BED files to PED files

plink --noweb --bfile test --recode --out test

The following script separate one genotype file into several smaller files based on chromosomes.


for chr in $(seq 1 22); do
     plink --noweb \
           --bfile /projects/sequence_analysis/vol4/CHAT_simGWAS/realData/EC/2000GWAS/ESCC_Genotype/ESCC_afterQC \
           --chr $chr \
           --recode \
           --out /projects/sequence_analysis/vol4/CHAT_simGWAS/realData/EC/2000GWAS/ESCC_Chroms/$chr ;
done

Plink tutorial tried out using BED (Binary PED) files

To load a binary file, use –bfile instead of –file

plink --noweb --bfile ESCC_afterQC

When creating a binary ped file, the MAF and missingness filters are set to include everybody and all SNPs. If you want to change these, use –maf, –geno, etc, to manually specify these options: for example, Summary statistics

plink --noweb --bfile ESCC_afterQC --missing --out miss_stat --freq --out freq_stat 

Basic association test with multi-test adjustment

plink --noweb --bfile ESCC_afterQC --assoc --adjust --out as1

It generates two output files. The as1.assoc file contains the following fields

The as1.assoc.adjusted file contains the following fields

We can calculate association statistics based on the 2-by-3 genotype table as well as the standard allelic test; we can also calculate tests that assume dominant or recessive action of the minor allele; finally, we can perform the Cochran-Armitage trend test instead of the basic allelic test. All these tests are performed with the single command –model. Just as the –assoc command, this can be easily applied to all or specified SNPs.

plink --noweb --bfile ESCC_afterQC --model --snp rs1988293 --out mod1

Logistic regression