Bioinformatic sequence recovery for universal target-capture bait kits can be substantially improved by appropriate tailoring of target files to the group under study. To enable the best possible locus recovery from Angiosperms353 capture data, we have developed an expanded target file (mega353.fasta
) incorporating sequences from over 550 transcriptomes from the 1KP project. To maximise computational efficiency we provide the script filter_mega353.py
, which can be used to subsample the mega353.fasta
file based on user-selected taxa or taxon groups. These groups can be defined using unique 1KP transcriptome codes, species, families, orders, or broader groups (e.g. Basal Eudicots, Monocots, etc). In addition, we provide the script BYO_transcriptome.py
, which can be used to add sequences from any transcriptome to any protein-coding nucleotide target file. These tailored and customised target files can be used directly in target-capture pipelines such as HybPiper.
Data files
mega353.fasta
A target file for use with target enrichment datasets captured using the Angiosperms353 bait kit.filtering_options.csv
A comma-separated values file listing the options available for filtering themega353.fasta
file. This reference file can also be produced by thefilter_mega353.py
script (see below).
Scripts
filter_mega353.py
A script to filter themega353.fasta
target file.BYO_transcriptome.py
A script to add sequences from any transcriptome dataset to any target file containing protein-coding sequences.
Manuscript
Dependencies for filter_mega353.py
Dependencies for BYO_transcriptome.py
- Python 3.7 or higher
- EXONERATE 2.4.0
- HMMER 3.2.1 or higher
- MAFFT 7.407 or higher
- BioPython 1.76 or higher
Please see the Wiki page Installing dependencies for further details.
Assuming all dependencies are installed, either:
- Download the NewTargets package directly from the repository home page and unzip it. Note that the
mega353.fasta
file in the unzipped package is also provided as a.zip
file, and will need to be unzipped separately. - Clone the repository using the command
git clone https://github.com/chrisjackson-pellicle/NewTargets.git
. Unzip themega353.zip
file.
Input:
mega353.fasta
. The expanded Angiosperms353 target file.select_file
. A text file containing a list of IDs; sequences from these samples will be retained in the filtered output target file.
Output:
filtered_target_file
. The filtered target file containing the default Angiosperms353 sequences and any additional sequences corresponding to IDs in theselect_file
.report_file
. A report file in.csv
format, listing samples with sequences retained in the filtered target file (excluding default Angiosperms353 samples).
Quick usage:
python filter_mega353.py [-h] [-filtered_target_file FILTERED_TARGET_FILE]
[-report_filename REPORT_FILENAME]
[-list_filtering_options]
mega353_file select_file
Example command line:
python filter_mega353.py mega353.fasta select_asparagales.txt -filtered_target_file asparagales_targetfile.fasta -report_filename asparagales_report.csv
To generate the filtering_options.csv
file:
python filter_mega353.py -list_filtering_options
Please see the Wiki page filter_mega353 for further details.
Input:
target_file
. A target file containing protein-coding nucleotide sequences in fasta format.transcriptomes_folder
. A directory containing one or more transcriptomes in fasta format.
Output:
These are the main results and reports folders; see the Wiki page for full output details.
17_mega_target_file
. A folder containing the final target fileBYO_target.fasta
.18_reports
. A folder containing the general report filesummary_report.csv
, and the filereport_per_gene.csv
containing a presence/absence matrix of transcriptome hits for each gene/transcriptome.
Quick usage:
python BYO_transcriptome.py [-h] [-num_hits_to_recover <integer>]
[-python_threads <integer>]
[-external_program_threads <integer>]
[-length_percentage <float>]
[-hmmsearch_evalue <number in scientific notation; default is 1e-50>]
[-trim_to_refs <taxon_name>] [-no_n]
[-skip_exonerate_frameshift_fix]
[-discard_short]
target_file transcriptomes_folder
Example command line:
python BYO_transcriptome.py asparagales_targetfile.fasta additional_asparagales_transcriptomes_folder -python_threads 4 -external_program_threads 4
Please see the Wiki page BYO_transcriptome for further details.