-
Notifications
You must be signed in to change notification settings - Fork 1
1. Scalable Delineation of Conserved Segments Along a Reference Scaffold
The main program for finding conserved segments is called delineateSegmentsOnReference.py
. In brief, this program finds lengthy shared/conserved and contiguous segments between a reference scaffold and some set of alternate scaffolds.
It first uses mega-BLASTn to identify lengthy HSPs between the reference and each of the alternate database of scaffolds in a pairwise fashion. The database of alternate scaffolds should also include the reference scaffold itself with the same name, because reflexive alignment information is used to identify regions along the reference which have missing base-calls or are difficult to align to.
Next, a sliding-window scan is performed across the reference scaffold to identify contiguous stretches at high identity (single HSP covering the full window with 99% identity by default). Windows found to be conserved between the reference and at least one alternate scaffold are clustered and ordered in adjacent blocks along the reference. A simple algorithm was developed to iterate through windows and identify larger conserved segments spanning multiple adjacent windows with similar conservation profiles. In other words, a segment can encapsulate two windows which are adjacent in positional ordering along the reference scaffold and have an some set of aligning alternate scaffolds in common.
Currently, the program is written in Python and not very fast. Future to-do's include implementing certain components of the program in C++ to provide a significant spreed boost. However, even its current state, if an HPC is available, this step can be parallelized for all scaffolds of interest and enable a comprehensive identification of conserved segments.
usage: delineateSegmentsOnReference.py [-h] -r REF_FASTA -q QUERY_MULTIFASTA
-o OUTDIR [-rc] [-qc QUERY_CIRCULAR]
[-w SCAN_WINDOW_SIZE]
[-s SCAN_SLIDE_STEP]
[-c MATCHES_PERCENTAGE] [-f]
Program: delineateSegmentsOnReference.py
Author: Rauf Salamzade
The Broad Institute of MIT and Harvard
Earl Lab / Bacterial Genomics Group
This program will take a reference scaffold and identify conserved
and contiguous segments it shares with one or more query scaffolds.
If facing difficulties, please raise issues on the github page:
https://github.com/broadinstitute/consequences
optional arguments:
-h, --help show this help message and exit
-r REF_FASTA, --ref_fasta REF_FASTA
FASTA (single entry) for reference scaffold upon which to call windows on.
-q QUERY_MULTIFASTA, --query_multifasta QUERY_MULTIFASTA
Multi-FASTA for query scaffolds to use in search (should include
reference scaffold as well).
-o OUTDIR, --outdir OUTDIR
-rc, --ref_circular Is the reference scaffold circular/complete?
-qc QUERY_CIRCULAR, --query_circular QUERY_CIRCULAR
A file specifying which of the query scaffolds are circular/complete.
Each query identifier should be a separate line and match an identifier
in the query Multi-FASTA file
-w SCAN_WINDOW_SIZE, --scan_window_size SCAN_WINDOW_SIZE
length of windows to use.
-s SCAN_SLIDE_STEP, --scan_slide_step SCAN_SLIDE_STEP
granularity of sliding.
-c MATCHES_PERCENTAGE, --matches_percentage MATCHES_PERCENTAGE
cutoff for number of matches in common.
-f, --unfilter_segs Do not filter delineated segments for those representative
of the sample set of a specific window.