This simple RNA-seq pipeline processes most RNA-seq protocols. It uses the ultra-fast pseudomapping-based kallisto to generate transcript abundance estimations from unmapped reads.
Python packages. This pipeline uses pypiper and looper. You can do a user-specific install of these like this:
pip install --user https://github.com/epigen/pypiper/zipball/v0.6
pip install --user https://github.com/epigen/looper/zipball/v0.7.2
Required executables. You will need some common bioinformatics software installed. The list is specified in the pipeline configuration file (rnaseq.yaml) tools section.
Static files. This pipeline requires static files which are specified in the pipeline configuration file.
- Clone the pipeline repository:
git clone [email protected]:epigen/open_pipelines.git
; - Adapt the pipeline configuration file to point to specific software if needed;
- Create a sample annotation sheet containing the variables
sample_name
,protocol
, andorganism
; - Create a project configuration file that points to the pipeline interface file and the sample annotation sheet;
- Run pipelines using looper
looper run project_config.yaml
.
More detailed instructions or creating a project configuration file and sample annotation sheet canbe found in the Looper documentation.
In the particular case of the RNA-seq pipeline, one special column in the annotation sheet can be used to pair samples for peak calling. Add a column named "compare_sample" containing the name ("sample_name" column) of the sample to use as background.
If given more than one BAM file as input, the pipeline will merge them before begining processing. The merged, unmapped inpu BAM file will be output in $sample_name/unmapped
. This file is temporary and will be removed if the pipeline finishes successfully.
FastQC is ran on the unaligned input BAM files for quality control.
An HTML report and accompaning zip file will be output in the root directory of each sample.
Reads are trimmed for adapters prior to alignment.
Adapter sequences to be trimmed can be specified in a FASTA file which is stated in the pipeline configuration file under resources: adapters
.
Two trimming programs can be selected: trimmomatic and skewer in the pipeline configuration file under parameters: trimmer
. While rigorous benchmarking of both trimmers could be performed, the reason to use skewer is its considerable speed compared with trimmomatic and the fact that it is available as a binary executable rather than a Java jar.
These produce FASTQ files for each read pair and one file with unpaired reads, which are stored under $sample_name/unmapped/
. These files are temporary and will be removed if the pipeline finishes sucessfully.
This pipeline uses Kallisto for transcript quantification without the need of alignment.
Kallisto needs a transcriptome index which should be specified in the pipeline configuration file under resources: kallisto_index
. This can be easily created with the kallisto index
command.
A TSV file containing estimation of transcript abundances is created under $sample_name/kallisto/
.
You can easily collect statistics from all runs using looper: lopper summarize project_config.yaml
Due to the minimal pipeline size/steps, statistics produced are limited. Here are the reported statistics and their description:
fastqc_GC_perc
: GC percentage of all sequenced reads from FASTQC report.fastqc_read_length
: read length as determined from FASTQC report.fastqc_total_pass_filter_reads
: number of pass filter reads from FASTQC report.fastqc_poor_quality_perc
: number of poor quality reads from FASTQC reporttrim_short_perc
: percentage of reads dropped because of too short length after trimmingtrim_empty_perc
: percentage of reads dropped because empty after trimmingtrim_trim_loss_perc
: percentage of reads lost during trimmingtrim_surviving_perc
: percentage of reads surviving after trimmingtrim_trimmed_perc
: percentage of reads that were trimmedtrim_untrimmed_perc
: percentage of reads that were untrimmedtranscripts
: number of transcripts quantifiedzero-count_transcripts
: number of transcripts with estimated zero counts.non-zero-count_transcripts
: number of transcripts with more than zero estimated counts.log2tpm_mean
: Mean expression (log2(1 + TPM)) of all transcriptslog2tpm_median
: Median expression (log2(1 + TPM)) of all transcriptslog2tpm_iqr
: Interquantile range (IQR) of the expression (log2(1 + TPM)) of all transcriptsnon-zero_log2tpm_mean
: Mean expression (log2 TPM) of transcripts with >0 countsnon-zero_log2tpm_median
: Mean expression (log2 TPM) of transcripts with >0 countsnon-zero_log2tpm_iqr
: Interquantile range (IQR) of the expression (log2(1 + TPM)) of transcripts with >0 countsTime
: pipeline run timeSuccess
: time pipeline run finished
Pull requests welcome. Active development should occur in the development branch.
- Andre Rendeiro, [email protected]