Skip to content

Expression

Malachi Griffith edited this page Mar 22, 2017 · 96 revisions

RNA-seq Flowchart - Module 4

3-i. Expression

Use Stringtie to generate expression estimates from the SAM/BAM files generated by HISAT2 in the previous module

Note on de novo transcript discovery and differential expression using Stringtie

In this module, we will run Stringtie in 'reference only' mode. For simplicity and to reduce run time, it is sometime useful to perform expression analysis with only known transcript models. However, Stringtie can predict the transcripts present in each library instead (by dropping the '-G' option in stringtie commands as described in the next module). Stringtie will then assign arbitrary transcript IDs to each transcript assembled from the data and estimate expression for those transcripts. One complication with this method is that in each library a different set of transcripts is likely to be predicted for each library. There may be a lot of similarities but the number of transcripts and their exact structure will differ in the output files for each library. Before you can compare across libraries you therefore need to determine which transcripts correspond to each other across the libraries.

  • Stringtie provides a merge command to combine predicted transcript GTF files from across different libraries
  • Once you have a merged GTF file you can run Stringtie again with this instead of the known transcripts GTF file we used above
  • Stringtie also provides 'gffcompare' to compare predicted transcripts to known transcripts
  • Refer to the Stringtie manual for a more detailed explanation:
  • https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual

Stringtie basic usage:

stringtie <aligned_reads.bam> [options]*

Extra options specified below:

  • '-p 8' tells Stringtie to use eight CPUs
  • '-G ' reference annotation to use for guiding the assembly process (GTF/GFF3)
  • '-e' only estimate the abundance of given reference transcripts (requires -G)
  • '-B' enable output of Ballgown table files which will be created in the same directory as the output GTF (requires -G, -o recommended)
  • '-o' output path/file name for the assembled transcripts GTF (default: stdout)
cd $RNA_HOME/
mkdir -p expression/stringtie/ref_only/
cd expression/stringtie/ref_only/

stringtie -p 8 -G $RNA_REF_GTF -e -B -o HBR_Rep1/transcripts.gtf $RNA_ALIGN_DIR/HBR_Rep1.bam
stringtie -p 8 -G $RNA_REF_GTF -e -B -o HBR_Rep2/transcripts.gtf $RNA_ALIGN_DIR/HBR_Rep2.bam
stringtie -p 8 -G $RNA_REF_GTF -e -B -o HBR_Rep3/transcripts.gtf $RNA_ALIGN_DIR/HBR_Rep3.bam

stringtie -p 8 -G $RNA_REF_GTF -e -B -o UHR_Rep1/transcripts.gtf $RNA_ALIGN_DIR/UHR_Rep1.bam
stringtie -p 8 -G $RNA_REF_GTF -e -B -o UHR_Rep2/transcripts.gtf $RNA_ALIGN_DIR/UHR_Rep2.bam
stringtie -p 8 -G $RNA_REF_GTF -e -B -o UHR_Rep3/transcripts.gtf $RNA_ALIGN_DIR/UHR_Rep3.bam

What does the raw output from Stringtie look like?

less UHR_Rep1/transcripts.gtf

awk '{if ($3=="transcript") print}' UHR_Rep1/transcripts.gtf | cut -f 1,4,9 | less

Press 'q' to exit the 'less' display

PRACTICAL EXERCISE 7

Assignment: Use StringTie to Calculate transcript-level expression estimates for the alignments (bam files) you created in Practical Exercise 5.

  • Hint: You should have six commands for 3 replicates each of tumor and normal samples.

Solution: When you are ready you can check your approach against the Solutions


HTSEQ-COUNT

[OPTIONAL]

Run htseq-count on alignments instead to produce raw counts instead of FPKM/TPM values for differential expression analysis

Refer to the HTSeq documentation for a more detailed explanation:

htseq-count basic usage:

htseq-count [options] <sam_file> <gff_file>

Extra options specified below:

  • '--format' specify the input file format one of BAM or SAM. Since we have BAM format files, select 'bam' for this option.
  • '--order' provide the expected sort order of the input file. Previously we generated position sorted BAM files so use 'pos'.
  • '--mode' determines how to deal with reads that overlap more than one feature. We believe the 'intersection-strict' mode is best.
  • '--stranded' specifies whether data is stranded or not. The TruSeq strand-specific RNA libraries suggest the 'reverse' option for this parameter.
  • '--minaqual' will skip all reads with alignment quality lower than the given minimum value
  • '--type' specifies the feature type (3rd column in GFF file) to be used. (default, suitable for RNA-Seq and Ensembl GTF files: exon)
  • '--idattr' The feature ID used to identity the counts in the output table. The default, suitable for RNA-SEq and Ensembl GTF files, is gene_id.

Run htseq-count and calculate gene-level counts:

cd $RNA_HOME/
mkdir -p expression/htseq_counts
cd expression/htseq_counts

htseq-count --format bam --order pos --mode intersection-strict --stranded reverse --minaqual 1 --type exon --idattr gene_id $RNA_ALIGN_DIR/UHR_Rep1.bam $RNA_REF_GTF > UHR_Rep1_gene.tsv
htseq-count --format bam --order pos --mode intersection-strict --stranded reverse --minaqual 1 --type exon --idattr gene_id $RNA_ALIGN_DIR/UHR_Rep2.bam $RNA_REF_GTF > UHR_Rep2_gene.tsv
htseq-count --format bam --order pos --mode intersection-strict --stranded reverse --minaqual 1 --type exon --idattr gene_id $RNA_ALIGN_DIR/UHR_Rep3.bam $RNA_REF_GTF > UHR_Rep3_gene.tsv

htseq-count --format bam --order pos --mode intersection-strict --stranded reverse --minaqual 1 --type exon --idattr gene_id $RNA_ALIGN_DIR/HBR_Rep1.bam $RNA_REF_GTF > HBR_Rep1_gene.tsv
htseq-count --format bam --order pos --mode intersection-strict --stranded reverse --minaqual 1 --type exon --idattr gene_id $RNA_ALIGN_DIR/HBR_Rep2.bam $RNA_REF_GTF > HBR_Rep2_gene.tsv
htseq-count --format bam --order pos --mode intersection-strict --stranded reverse --minaqual 1 --type exon --idattr gene_id $RNA_ALIGN_DIR/HBR_Rep3.bam $RNA_REF_GTF > HBR_Rep3_gene.tsv

Merge results files into a single matrix for use in edgeR:

join UHR_Rep1_gene.tsv UHR_Rep2_gene.tsv | join - UHR_Rep3_gene.tsv | join - HBR_Rep1_gene.tsv | join - HBR_Rep2_gene.tsv | join - HBR_Rep3_gene.tsv > gene_read_counts_table_all.tsv

Based on the above read counts, plot the linearity of the ERCC spike-in read counts versus the known concentration of the ERCC spike-in Mix:

cd $RNA_HOME/expression/htseq_counts
cp $RNA_EXT_DATA_DIR/rnaseq-tutorial/ERCC_Controls_Analysis.txt .
cat ERCC_Controls_Analysis.txt

wget https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial/master/scripts/Tutorial_Module4_ERCC_expression.pl
chmod +x Tutorial_Module4_ERCC_expression.pl
./Tutorial_Module4_ERCC_expression.pl
cat $RNA_HOME/expression/htseq_counts/ercc_read_counts.tsv

wget https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial/master/scripts/Tutorial_Module4_ERCC_expression.R
chmod +x Tutorial_Module4_ERCC_expression.R
./Tutorial_Module4_ERCC_expression.R ercc_read_counts.tsv

To view the resulting figure, navigate to the below URL replacing YOUR_IP_ADDRESS with your amazon instance IP address:

  • http://YOUR_IP_ADDRESS/rnaseq/expression/htseq_counts/Tutorial_Module4_ERCC_expression.pdf

| Previous Section | This Section | Next Section | |:-------------------------------------:|:---------------------------:|:--------------------------------------------------------:| | Alignment QC | Expression | Differential Expression |