Skip to content
ariahahn edited this page Dec 14, 2015 · 21 revisions

Welcome to the SOFA wiki! Here we have a knowledge base detailing installation, usage, and suggest tools for downstream analysis.

SOFA: Short-ORF Functional Annotation Pipeline

Table of Contents

Overview

Accurate description of the microbial communities driving matter and energy transformation in complex ecosystems such as soils cannot yet be effectively accomplished using assembly-based approaches despite the rise of next generation sequencing technologies. Here we present SOFA, a modular open source pipeline enabling comparative functional annotation of unassembled short-read data. The pipeline attempts to merge mate pairs in fastq files, predicts open reading frames (ORFs) on merged and unmerged reads as small as 70 bps, and completes an additional step, we term 'deduplication'. Deduplication prevents the double counting of ORFs predicted from unmerged paired-end reads by checking for homologous annotations that span the same gene, allowing for quantitatively accurate gene counts. SOFA enables downstream processing stages within the existing MetaPathways pipeline.

Setup and Dependencies

SOFA has the following dependencies:

  • Python 2.7 (modules optparse, sys, re, csv, traceback, pickle, os, glob)
  • FLASH - the manual and the paper
  • FragGeneScan+ - a threaded version of FGS that is 5 times faster using 1 thread and 50 times faster using 8 hyper threaded cores
  • LAST - A version of LAST that has been adapted to produce BLAST like tabular output and provide BLAST like e-values and bit-scores
  • a reference database. We recommend using the RefSeq database clustered at 85% similarity

Source code for the binaries required by SOFA are included as separate directories in the project root directory. Please run make in the project root directory to generate the required binaries. The most recent versions of the required binaries can be found in the aforementioned repositories

Running SOFA

You can see the following messages by running "python SOFA1.2.py -h":

This script takes an interleaved FASTQ file and uses the SOFA pipeline:

         Stage 1. : SOFA uses FLASH to merge pairs of reads when possible and produces two
		 FASTQ file containing merged and unmerged reads

         Stage 2. : SOFA concatenates the FASTQ files from stage 1 and converts
		 the FASTQ to FASTA format and provides a mapping file

         Stage 3. : Predicts ORFs using FragGeneScan+ and produces a .faa file 

         Stage 4. : The resulting unmerged reads in the .faa file are LASTED against a 
                    clustered reference protein database (REFSEQ proteins at 85% similarity). 
                    If both the reads of a pairs have hits with the same protein function then
                    one of the reads is removed. The final ORFs are in a .FINAL.faa file

Usage: SOFA1.2.py -i -s -o --FlashExec <FLASH_executable> --FragGeneScanExec <FragGeneScan_executable> --LASTExec <LAST_executable> --stage s (optional flags: --tempdirs t --bitScore bs --evalue ev --tempdirs )

Options: -h, --help Show this help message and exit

   -i INPUTFOLDER 	     The path to the input folder

   -s SAMPLE_NAME 	     Sample name - DO NOT include the file extension

   -o OUTPUTFOLDER	     The output foldername

   --FlashExec	     FLASH Executable

   --LASTExec	     LAST Executable

   --refdb 	             RefDB for the de-deduplication

   --FragGeneScanExec    FragGeneScan+ Executable

   --stage 	             The stages to execute 1 : FLASH; 2 : Format files; 3 : FragGeneScan+ and  4 : Deduplication

   --threads 	     The number of threads to use with FragGeneScan+

   --bitScore	     Optional bit-Score cutoff value (default is 20)

   --evalue	             Optional e-value cutoff value (0.000001)

   --tempdirs 	     The temp directories to use - default is usually fine (no need to specify)

You can specify one or more stages and they will run consecutively (e.g.; --stage 1 --stage 2 --stage 3 --stage 4) We suggest testing each stage individually on your computer using a small FASTQ file to start - but once you're up and running it's usually most efficient to run all stages in the same command.

SOFA will produce a folder with the name specified in the -o option. Inside this there will be four additional folders.

 -stage 1 - contains FLASH outputs

 -stage 2 - contains a mapping file (with original sequence names and SOFA names. Merged sequences get an "_0"     and unmerged reads get "_1" or "_2" and a ".sofa" file - a FASTA file created by combining and converting the combined and uncombined FASTQ  files 

 -stage 3 - FGS+ output. A single FASTA file of amino acid sequences (ORFs predicted by FGS+)

 -stage 4 - .LASTout and .LASTout.tmp used for the deduplication. 
       .status file. A file containing the names and status (not a duplicate, duplicate, no hit) of every sequence
       .dups.removed.faa a FASTA file containing duplicate read pairs removed
       .nohit.faa a FASTA file containing sequences with no significant hits in the reference database 
       .SOFA.final.faa a FASTA file with all non duplicate sequences that also had hits in the reference database

Here is an example:

python SOFA1.2.py -i /PATH/TO/INPUT/FOLDER -s samplename -o /PATH/TO/OUTPUT/FOLDER --FlashExec /PATH/TO/FLASH-1.2.11/flash --LASTExec /PATH/TO/LAST/lastal --refdb /PATH/TO/Refdb --FragGeneScanExec /PATH/TO/FGS+/FGS+ --threads 8 --bitScore 20 --evalue 0.000001 --stage 1 --stage 2 --stage 3 --stage 4

Downstream Analysis

Ultimately SOFA provides a ".SOFA.final.faa" file - A FASTA file of amino acid sequences that is deduplicated and for which every sequence had at least 1 hit in the reference database used for deduplication. This means SOFA outputs are compatible with any software requiring a FASTA file of amino acid sequences, making downstream analysis flexible.

We recommend using MetaPathways - it's simple, the ".SOFA.final.faa" can be directly used as input. MetaPathways can then be used to annotate the sequences, produce environmental Pathway Genome Databases (ePGDBs), as well as create easily exportable tables of ORF counts and annotations.

FAQ

1. Why FLASH? FLASH, in our opinion, is a very well implemented software. It's fast, easy to install and easy to use.

2. Why FGS+? FGS+ is fast! and when processing large data, speed is important. Let us know if you need a version of SOFA compatible with other versions of FGS.

3. Why LAST? again it's ~80 times faster than BLAST and with the implementation found here the user can specify the bit-Score and e-value cutoff and produce BLAST like tabular output. Let us know if you need a version of SOFA compatible with other versions of LAST.

4. Large SOFA Outputs?

Large SOFA outputs, a list of ORFs, can cause functional annotation with LAST to fail. Alternative annotation methods like DIAMOND can be used as a work-around. To do this make sure your DIAMOND output is in a tab delimited files, renamed to LASTout.txt files. And paired gff files to be created from MetaPathways RefScores files, with a companion utility script my_script.py. SOFA mapping file from Stage 2 should be renamed to mapping.txt and provided to MetaPathways in the preprocessed/ directory of the output.

References