-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Welcome to the SOFA wiki! Here we have a knowledge base detailing installation, usage, and suggest tools for downstream analysis.
SOFA: Short-ORF Functional Annotation Pipeline
Accurate description of the microbial communities driving matter and energy transformation in complex ecosystems such as soils cannot yet be effectively accomplished using assembly-based approaches despite the rise of next generation sequencing technologies. Here we present SOFA, a modular open source pipeline enabling comparative functional annotation of unassembled short-read data. The pipeline attempts to merge mate pairs in fastq files, predicts open reading frames (ORFs) on merged and unmerged reads as small as 70 bps, and completes an additional step, we term 'deduplication'. Deduplication prevents the double counting of ORFs predicted from unmerged paired-end reads by checking for homologous annotations that span the same gene, allowing for quantitatively accurate gene counts. SOFA enables downstream processing stages within the existing MetaPathways pipeline.
SOFA has the following dependencies:
- Python 2.7 (modules optparse, sys, re, csv, traceback, pickle, os, glob)
- FLASH - the manual and the paper
- FragGeneScan+ - a threaded version of FGS that is 5 times faster using 1 thread and 50 times faster using 8 hyper threaded cores
- LAST - A version of LAST that has been adapted to produce BLAST like tabular output and provide BLAST like e-values and bit-scores
- a reference database. We recommend using the RefSeq database clustered at 85% similarity
You can see the following messages by running "python SOFA1.1.py -h":
This script takes an interleaved FASTQ file and uses the SOFA pipeline:
Stage 1. : SOFA uses FLASH to merge pairs of reads when possible and produces two
FASTQ file containing merged and unmerged reads
Stage 2. : SOFA concatenates the FASTQ files from stage 1 and converts
the FASTQ to FASTA format and provides a mapping file
Stage 3. : Predicts ORFs using FragGeneScan+ and produces a .faa file
Stage 4. : The resulting unmerged reads in the .faa file are LASTED against a
clustered reference protein database (REFSEQ proteins at 85% similarity).
If both the reads of a pairs have hits with the same protein function then
one of the reads is removed. The final ORFs are in a .FINAL.faa file
Usage: SOFA1.1.py -i -s -o --FlashExec <FLASH_executable> --FragGeneScanExec <FragGeneScan_executable> --LASTExec <LAST_executable> --stage s (optional flags: --tempdirs t --bitScore bs --evalue ev --tempdirs )
Options: -h, --help Show this help message and exit
-i INPUTFOLDER The path to the input folder
-s SAMPLE_NAME Sample name - DO NOT include the file extension
-o OUTPUTFOLDER The output foldername
--FlashExec FLASH Executable
--LASTExec LAST Executable
--refdb RefDB for the de-deduplication
--FragGeneScanExec FragGeneScan+ Executable
--stage The stages to execute 1 : FLASH; 2 : Format files; 3 : FragGeneScan+ and 4 : Deduplication
--threads The number of threads to use with FragGeneScan+
--bitScore Optional bit-Score cutoff value (default is 20)
--evalue Optional e-value cutoff value (0.000001)
--tempdirs The temp directories to use - default is usually fine (no need to specify)
You can specify one or more stages and they will run consecutively (e.g.; --stage 1 --stage 2 --stage 3 --stage 4) We suggest testing each stage individually on your computer using a small FASTQ file to start - but once you're up and running it's usually most efficient to run all stages in the same command.
SOFA will produce a folder with the name specified in the -o option. Inside this there will be four additional folders.
-stage 1 - contains FLASH outputs
-stage 2 - contains a mapping file (with original sequence names and SOFA names. Merged sequences get an "_0" and unmerged reads get "_1" or "_2" and a ".sofa" file - a FASTA file created by combining and converting the combined and uncombined FASTQ files
-stage 3 - FGS+ output. A single FASTA file of amino acid sequences (ORFs predicted by FGS+)
-stage 4 - .LASTout and .LASTout.tmp used for the deduplication.
.status file. A file containing the names and status (not a duplicate, duplicate, no hit) of every sequence
.dups.removed.faa a FASTA file containing duplicate read pairs removed
.nohit.faa a FASTA file containing sequences with no significant hits in the reference database
.SOFA.final.faa a FASTA file with all non duplicate sequences that also had hits in the reference database
Here is an example:
python SOFA1.1.py -i /PATH/TO/INPUT/FOLDER -s samplename -o /PATH/TO/OUTPUT/FOLDER --FlashExec /PATH/TO/FLASH-1.2.11/flash --LASTExec /PATH/TO/LAST/lastal --refdb /PATH/TO/Refdb --FragGeneScanExec /PATH/TO/FGS+/FGS+ --threads 8 --bitScore 20 --evalue 0.000001 --stage 1 --stage 2 --stage 3 --stage 4
Ultimately SOFA provides a ".SOFA.final.faa" file - A FASTA file of amino acid sequences that is deduplicated and for which every sequence had at least 1 hit in the reference database used for deduplication. This means SOFA outputs are compatible with any software requiring a FASTA file of amino acid sequences, making downstream analysis flexible.
We recommend using MetaPathways - it's simple, the ".SOFA.final.faa" can be directly used as input. MetaPathways can then be used to annotate the sequences, produce environmental Pathway Genome Databases (ePGDBs), as well as create easily exportable tables of ORF counts and annotations.
1. Why FLASH? FLASH, in our opinion is a very well implemented software. It's fast and easy to install.
2. Why FGS+? FGS+ is fast! and when processing large data, speed is important. Let us know if you need a version of SOFA compatible with other versions of FGS.
3. Why LAST? again it's ~80 times faster than BLAST and with the implementation found here the user can specify the bit-Score and e-value cutoff and produce BLAST like tabular output. Let us know if you need a version of SOFA compatible with other versions of LAST.