RNA seq data analysis

23 Jul 2017

Basics

“Read” refers to both single-end or paired-end reads. The concept of counting is the same with either type of read, as each read represents a fragment that was sequenced. “Feature” refers to an expression feature, which is a genomic region containing a sequence that can normally appear in an RNA-Seq experiment (e.g. gene, isoform, exon). In the rest of this post, the random variable X_i to denote the counts you observe from a feature of interest i. With alternative splicing you do not directly observe X_i, so often E[X_i] is used, which is estimated using the EM algorithm by a method like eXpress, RSEM, Sailfish, Cufflinks, or one of many other tools.

RNA-seq expression measures

It used to be when you did RNA-seq, you reported your results in RPKM (Reads Per Kilobase Million) or FPKM (Fragments Per Kilobase Million). FPKM is very similar to RPKM. RPKM was made for single-end RNA-seq, where every read corresponded to a single fragment that was sequenced. FPKM was made for paired-end RNA-seq. With paired-end RNA-seq, two reads can correspond to a single fragment, or, if one read in the pair did not map, one read can correspond to a single fragment. The only difference between RPKM and FPKM is that FPKM takes into account that two reads can map to one fragment (and so it doesn’t c ount this fragment twice).

The calculation of RPKM

  1. Count up the total reads in a sample and divide that number by 1,000,000 – this is our “per million” scaling factor.
  2. Divide the read counts by the “per million” scaling factor. This normalizes for sequencing depth, giving you reads per million (RPM).
  3. Divide the RPM values by the length of the gene, in kilobases. This gives you RPKM.

I found this post very helpful for rookies like me to understand the terminology.

Tools

Ballgown Perhaps most importantly, there is a component called pData that should hold a data frame of phenotype information for the samples in the experiment. This must be created manually. It is very important that the rows of pData are in the correct order. Each row corresponds to a sample, and the rows of pData should be ordered the same as the tables in the expr slot. You can check that order by running sampleNames(bg).

Q1: How to get gene abundance estimate using StringTie? Use the -A flag when running StringTie, e.g.,

stringtie -p 8 -G $RNA_REF_GTF -e -B -o UHR_Rep1/transcripts.gtf -A UHR_Rep1/gene_abundances.tsv $RNA_ALIGN_DIR/UHR_Rep1.bam

• ‘-p 8’ tells Stringtie to use eight CPUs • ‘-G ' reference annotation to use for guiding the assembly process (GTF/GFF3) • '-e' only estimate the abundance of given reference transcripts (requires -G) • '-B' enable output of Ballgown table files which will be created in the same directory as the output GTF (requires -G, -o recommended) • '-o' output path/file name for the assembled transcripts GTF (default: stdout) • '-A' output path/file name for gene abundance estimates Gene and transcript level expression values can be viewed in these two files: less -S UHR_Rep1/t_data.ctab less -S UHR_Rep1/gene_abundances.tsv You can also create a tidy expression matrix files for the StringTie results using this script. This will be done at both the gene and transcript level and also will take into account the various expression measures produced: coverage, FPKM, and TPM.


wget "https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial/master/scripts/stringtie_expression_matrix.pl"
chmod +x stringtie_expression_matrix.pl
./stringtie_expression_matrix.pl --expression_metric=TPM --result_dirs='HBR_Rep1,HBR_Rep2,HBR_Rep3,UHR_Rep1,UHR_Rep2,UHR_Rep3' --transcript_matrix_file=transcript_tpm_all_samples.tsv --gene_matrix_file=gene_tpm_all_samples.tsv
./stringtie_expression_matrix.pl --expression_metric=FPKM --result_dirs='HBR_Rep1,HBR_Rep2,HBR_Rep3,UHR_Rep1,UHR_Rep2,UHR_Rep3' --transcript_matrix_file=transcript_fpkm_all_samples.tsv --gene_matrix_file=gene_fpkm_all_samples.tsv
./stringtie_expression_matrix.pl --expression_metric=Coverage --result_dirs='HBR_Rep1,HBR_Rep2,HBR_Rep3,UHR_Rep1,UHR_Rep2,UHR_Rep3' --transcript_matrix_file=transcript_coverage_all_samples.tsv --gene_matrix_file=gene_coverage_all_samples.tsv
head transcript_tpm_all_samples.tsv gene_tpm_all_samples.tsv

You can also use the R package Ballgown, as suggested in the nature protocol paper “Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown”. More specifically,


bg = ballgown(dataDir=data_directory, samplePattern='sample', meas='all')
gene_expression = gexpr(bg)

Gene-level measurements are calculated by aggregating the transcript-level measurements for that gene.

FAQs

Q: Why does base quality of reads generally decreases at the end of the read? A: https://www.biostars.org/p/177027/ Q: How to treat the unpaired data after trimmomatic? A: Some background knowledge: When a trimming program trims data one of the reads (assuming you are using paired-end data) may become short and fail a criteria you have set (e.g. minimum length 25 bp). At the point that read is removed by the trimming program. Let us say that was from R2 file. A trimming program should remove the corresponding R1 read from R1-file (even though that may have passed) when R2 read is dropped. If it does not do that you are left with an unpaired read in R1 file. Aligners expect reads to be in the same order in R1/R2 files. If they are not then you can get strange results (e.g. discordant mapping). Presence of unpaired reads in main sequence files (software like trimmomatic collect them in separate files) generally signifies improper use of a trimming program (or using a trimmer that is not paired-end aware).

References

  1. rnaseq tutorial(highly recommended)
  2. RPKM, FPKM and TPM, clearly explained
  3. What the FPKM? A review of RNA-Seq expression units
  4. TCGA clustering example
  5. Introducing Clumpify