Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add busco #6

Merged
merged 57 commits into from
Jan 31, 2024
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
43615ad
config file and test data
dorien-er Jan 24, 2024
22217d3
organize config in argument groups and add script
dorien-er Jan 25, 2024
3fbb5ff
add busco help text
dorien-er Jan 25, 2024
2f692f0
add tests
dorien-er Jan 25, 2024
7c9c42d
add examples to config
dorien-er Jan 25, 2024
38d93b3
add version script
dorien-er Jan 25, 2024
52f8d95
Merge remote-tracking branch 'origin/main' into add-busco
rcannood Jan 26, 2024
92b83aa
Update src/busco/config.vsh.yaml
dorien-er Jan 26, 2024
3492a11
Update src/busco/script.sh
dorien-er Jan 26, 2024
21d90d7
Update src/busco/script.sh
dorien-er Jan 26, 2024
c8182ea
Update src/busco/config.vsh.yaml
dorien-er Jan 26, 2024
cbb4b0f
Update version
dorien-er Jan 26, 2024
f96a31d
add script to obtain test data
dorien-er Jan 26, 2024
8146482
add changelog entry
dorien-er Jan 26, 2024
06a0d0d
Delete src/busco/version.sh
dorien-er Jan 26, 2024
7308f7c
update cpus input
dorien-er Jan 29, 2024
c14c29c
Merge branch 'main' into add-busco
rcannood Jan 30, 2024
f2404b3
Merge branch 'add-busco' of github.com:dorien-er/biobase into add-busco
rcannood Jan 30, 2024
4312d39
Update src/busco/config.vsh.yaml
dorien-er Jan 30, 2024
cab5eab
fix version
dorien-er Jan 30, 2024
367d10d
Update src/busco/config.vsh.yaml
dorien-er Jan 30, 2024
deb1268
Update src/busco/script.sh
dorien-er Jan 30, 2024
ef9f244
Update src/busco/script.sh
dorien-er Jan 30, 2024
6bb94fe
move into separate module
dorien-er Jan 30, 2024
f7acecc
merge
dorien-er Jan 30, 2024
557f540
add outputs
dorien-er Jan 30, 2024
eec3aad
update tests
dorien-er Jan 30, 2024
032b78c
remove download flags - to be a separate component
dorien-er Jan 30, 2024
12722d1
modify description of list dataset
dorien-er Jan 30, 2024
dd97b4f
fix tests
dorien-er Jan 30, 2024
a416c21
remove files new comp
dorien-er Jan 30, 2024
82eabd4
remove defaults
dorien-er Jan 30, 2024
2243994
Update src/busco/busco/config.vsh.yaml
dorien-er Jan 30, 2024
26cf05b
Update src/busco/busco/config.vsh.yaml
dorien-er Jan 30, 2024
9a148f0
Update src/busco/busco/config.vsh.yaml
dorien-er Jan 30, 2024
af60ba0
Update src/busco/busco/config.vsh.yaml
dorien-er Jan 30, 2024
620c28b
Update src/busco/busco/script.sh
dorien-er Jan 30, 2024
75533f9
fix typo
dorien-er Jan 30, 2024
0f7d03a
remove unrequired params
dorien-er Jan 30, 2024
4af981a
remove unused vars
dorien-er Jan 30, 2024
b4edd71
opt out of run stats by default
dorien-er Jan 30, 2024
8e1700d
update tests
dorien-er Jan 30, 2024
41c0926
update test
dorien-er Jan 30, 2024
0f625a6
remove directory level
dorien-er Jan 31, 2024
f4dfed2
add mkdir
dorien-er Jan 31, 2024
2b60091
enable copying from symlink
dorien-er Jan 31, 2024
d130e08
remove sleep command
dorien-er Jan 31, 2024
1edbcb2
Update src/busco/config.vsh.yaml
dorien-er Jan 31, 2024
ca30ba6
Update src/busco/test.sh
dorien-er Jan 31, 2024
0b957d3
Update src/busco/test.sh
dorien-er Jan 31, 2024
1a50400
add output tests
dorien-er Jan 31, 2024
699631f
fix typo
dorien-er Jan 31, 2024
2e0e486
add genome test data and script
dorien-er Jan 31, 2024
db01bfd
fix typo
dorien-er Jan 31, 2024
ae6e81e
typo
dorien-er Jan 31, 2024
fb2b32f
Merge remote-tracking branch 'origin/main' into add-busco
rcannood Jan 31, 2024
6bee501
use smaller genome
rcannood Jan 31, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
## NEW FEATURES

* `arriba`: Detect gene fusions from RNA-seq data (PR #1).
* `busco`: Assess genome assembly and annotation completeness with single copy orthologs (PR #6).

* `fastp`: An ultra-fast all-in-one FASTQ preprocessor (PR #3).

Expand Down
210 changes: 210 additions & 0 deletions src/busco/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
functionality:
name: busco
description: Assessment of genome assembly and annotation completeness with single copy orthologs
info:
keywords: [Genome assembly, quality control]
homepage: https://busco.ezlab.org/
documentation: https://busco.ezlab.org/busco_userguide.html
repository: https://gitlab.com/ezlab/busco
reference: "10.1007/978-1-4939-9173-0_14"
licence: MIT
argument_groups:
- name: Inputs
arguments:
- name: --input
alternatives: ["-i"]
type: file
description: |
Input fasta file or directory containing input fasta files to analyse. Fasta files can either be a nucleotide or protein fasta file, depending on the BUSCO mode.
required: true
example: file.fasta
- name: --mode
alternatives: ["-m"]
type: string
choices: ["protein", "genome", "transcriptome"]
required: true
description: |
Busco assesment mode
example: protein
- name: --lineage_dataset
alternatives: ["-l"]
type: string
required: false
description: |
Specify a BUSCO lineage dataset that is most closely related to the assembly or gene set being assessed.
The full list of available datasets can be viewed using "busco --list-datasets".
dorien-er marked this conversation as resolved.
Show resolved Hide resolved
When unsure, the "--auto-lineage" flag can be set to automatically find the optimal lineage path.
dorien-er marked this conversation as resolved.
Show resolved Hide resolved
Requested datasets will automatically be downloaded if not already present in the download folder.
example: stramenopiles_odb10

- name: Outputs
arguments:
- name: --output_dir
alternatives: ["-o"]
required: true
direction: output
type: file
description: |
Path to output directory for publishing BUSCO results
example: output
- name: --output_prefix
type: string
required: false
description: |
Name of the analysis run, output folders and files will be labeled with this name. if not specified the output will be the input file name."
example: busco_protein
dorien-er marked this conversation as resolved.
Show resolved Hide resolved

- name: Resource and Run Settings
arguments:
- name: --force
type: boolean_true
description: |
Force rewriting of existing files. Must be used when output files with the provided name already exist.
- name: --offline
type: boolean_true
description: |
In offline mode BUSCO will not attempt to download files. Ensure all required dataset files are already downloaded and available.
dorien-er marked this conversation as resolved.
Show resolved Hide resolved
- name: --opt_out_run_stats
type: boolean_true
description: |
Opt out of data collection (from v5.6.0). Collected data is used to improve BUSCO.
All collected data is anonymised and includes the pipelines used, the datasets selected, options used and runtime statistics.
- name: --quiet
alternatives: ["-q"]
type: boolean_true
description: |
Disable the info logs, displays only errors.
- name: --restart
alternatives: ["-r"]
type: boolean_true
description: |
Continue a run that had already partially completed. Restarting skips calls to tools that have completed but performs all pre- and post-processing steps.
- name: --tar
type: boolean_true
description: |
Compress some subdirectories with many files to save space.

- name: Download Settings
arguments:
- name: --download
type: string
required: false
description: |
Download dataset. Possible values are a specific dataset name, "all", "prokaryota", "eukaryota", or "virus".
- name: --download_base_url
type: string
description: |
Set the url to the remote BUSCO dataset location.
- name: --download_path
type: string
description: |
Specify filepath for storing BUSCO dataset downloads. The default is a busco_downloads subdirectory in the current working directory.
dorien-er marked this conversation as resolved.
Show resolved Hide resolved

- name: Lineage Dataset Settings
arguments:
- name: --auto_lineage
type: boolean_true
description: |
Run auto-lineage pipelilne to automatically determine BUSCO lineage dataset that is most closely related to the assembly or gene set being assessed.
- name: --auto_lineage_euk
type: boolean_true
description: |
Run auto-placement just on eukaryota tree to find optimal lineage path.
- name: --auto_lineage_prok
type: boolean_true
description: |
Run auto_lineage just on prokaryota trees to find optimum lineage path.
- name: --datasets_version
type: string
required: false
description: |
Specify the version of BUSCO datasets
example: odb10

- name: Augustus Settings
arguments:
- name: --augustus
type: boolean_true
description: |
Use augustus gene predictor for eukaryote runs.
- name: --augustus_parameters
type: string
required: false
description: |
Additional parameters to be passed to Augustus (see Augustus documentation: https://github.com/Gaius-Augustus/Augustus/blob/master/docs/RUNNING-AUGUSTUS.md).
Parameters should be contained within a single string, without whitespace and seperated by commas.
example: "--PARAM1=VALUE1,--PARAM2=VALUE2"
- name: --augustus_species
type: string
required: false
description: |
Specify the augustus species
- name: --long
type: boolean_true
description: |
Optimize Augustus self-training mode. This adds considerably to the run time, but can improve results for some non-model organisms.

- name: BBTools Settings
arguments:
- name: --contig_break
type: integer
default: 10
description: |
Number of contiguous Ns to signify a break between contigs in BBTools analysis.
- name: --limit
type: integer
default: 3
description: |
Number of candidate regions (contig or transcript) from the BLAST output to consider per BUSCO.
This option is only effective in pipelines using BLAST, i.e. the genome pipeline (see --augustus) or the prokaryota transcriptome pipeline.
- name: --scaffold_composition
type: boolean_true
description: |
Writes ACGTN content per scaffold to a file scaffold_composition.txt.

- name: BLAST Settings
arguments:
- name: --e_value
type: double
default: 0.001
description: |
E-value cutoff for BLAST searches.

- name: Protein Gene Prediction settings
arguments:
- name: --miniprot
type: boolean_true
description: |
Use Miniprot gene predictor.

- name: MetaEuk Settings
arguments:
- name: --metaeuk_parameters
type: string
description: |
Pass additional arguments to Metaeuk for the first run (see Metaeuk documentation https://github.com/soedinglab/metaeuk).
All parameters should be contained within a single string with no white space, with each parameter separated by a comma.
example: "--max-overlap=15,--min-exon-aa=15"
- name: --metaeuk_rerun_parameters
type: string
description: |
Pass additional arguments to Metaeuk for the second run (see Metaeuk documentation https://github.com/soedinglab/metaeuk).
All parameters should be contained within a single string with no white space, with each parameter separated by a comma.
example: "--max-overlap=15,--min-exon-aa=15"

resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
platforms:
- type: docker
image: quay.io/biocontainers/busco:5.6.1--pyhdfd78af_0
setup:
- type: docker
run: |
busco: "$(busco --version 2>&1 | sed -n 's/BUSCO \([0-9.]*\)/\1/p')" > /var/software_versions.txt
dorien-er marked this conversation as resolved.
Show resolved Hide resolved
- type: nextflow
60 changes: 60 additions & 0 deletions src/busco/help.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
```bash
busco -h
```

Welcome to BUSCO 5.6.1: the Benchmarking Universal Single-Copy Ortholog assessment tool.
For more detailed usage information, please review the README file provided with this distribution and the BUSCO user guide. Visit this page https://gitlab.com/ezlab/busco#how-to-cite-busco to see how to cite BUSCO

optional arguments:
-i SEQUENCE_FILE, --in SEQUENCE_FILE
Input sequence file in FASTA format. Can be an assembled genome or transcriptome (DNA), or protein sequences from an annotated gene set. Also possible to use a path to a directory containing multiple input files.
-o OUTPUT, --out OUTPUT
Give your analysis run a recognisable short name. Output folders and files will be labelled with this name. The path to the output folder is set with --out_path.
-m MODE, --mode MODE Specify which BUSCO analysis mode to run.
There are three valid modes:
- geno or genome, for genome assemblies (DNA)
- tran or transcriptome, for transcriptome assemblies (DNA)
- prot or proteins, for annotated gene sets (protein)
-l LINEAGE, --lineage_dataset LINEAGE
Specify the name of the BUSCO lineage to be used.
--augustus Use augustus gene predictor for eukaryote runs
--augustus_parameters --PARAM1=VALUE1,--PARAM2=VALUE2
Pass additional arguments to Augustus. All arguments should be contained within a single string with no white space, with each argument separated by a comma.
--augustus_species AUGUSTUS_SPECIES
Specify a species for Augustus training.
--auto-lineage Run auto-lineage to find optimum lineage path
--auto-lineage-euk Run auto-placement just on eukaryote tree to find optimum lineage path
--auto-lineage-prok Run auto-lineage just on non-eukaryote trees to find optimum lineage path
-c N, --cpu N Specify the number (N=integer) of threads/cores to use.
--config CONFIG_FILE Provide a config file
--contig_break n Number of contiguous Ns to signify a break between contigs. Default is n=10.
--datasets_version DATASETS_VERSION
Specify the version of BUSCO datasets, e.g. odb10
--download [dataset [dataset ...]]
Download dataset. Possible values are a specific dataset name, "all", "prokaryota", "eukaryota", or "virus". If used together with other command line arguments, make sure to place this last.
--download_base_url DOWNLOAD_BASE_URL
Set the url to the remote BUSCO dataset location
--download_path DOWNLOAD_PATH
Specify local filepath for storing BUSCO dataset downloads
-e N, --evalue N E-value cutoff for BLAST searches. Allowed formats, 0.001 or 1e-03 (Default: 1e-03)
-f, --force Force rewriting of existing files. Must be used when output files with the provided name already exist.
-h, --help Show this help message and exit
--limit N How many candidate regions (contig or transcript) to consider per BUSCO (default: 3)
--list-datasets Print the list of available BUSCO datasets
--long Optimization Augustus self-training mode (Default: Off); adds considerably to the run time, but can improve results for some non-model organisms
--metaeuk_parameters "--PARAM1=VALUE1,--PARAM2=VALUE2"
Pass additional arguments to Metaeuk for the first run. All arguments should be contained within a single string with no white space, with each argument separated by a comma.
--metaeuk_rerun_parameters "--PARAM1=VALUE1,--PARAM2=VALUE2"
Pass additional arguments to Metaeuk for the second run. All arguments should be contained within a single string with no white space, with each argument separated by a comma.
--miniprot Use miniprot gene predictor
--skip_bbtools Skip BBTools for assembly statistics
--offline To indicate that BUSCO cannot attempt to download files
--opt-out-run-stats Opt out of data collection. Information on the data collected is available in the user guide.
--out_path OUTPUT_PATH
Optional location for results folder, excluding results folder name. Default is current working directory.
-q, --quiet Disable the info logs, displays only errors
-r, --restart Continue a run that had already partially completed.
--scaffold_composition
Writes ACGTN content per scaffold to a file scaffold_composition.txt
--tar Compress some subdirectories with many files to save space
-v, --version Show this version and exit
60 changes: 60 additions & 0 deletions src/busco/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
#!/bin/bash

## VIASH START
## VIASH END


[[ "$par_tar" == "false" ]] && unset par_tar
[[ "$par_force" == "false" ]] && unset par_force
[[ "$par_offline" == "false" ]] && unset par_offline
[[ "$par_opt_out_run_stats" == "false" ]] && unset par_opt_out_run_stats
[[ "$par_quiet" == "false" ]] && unset par_quiet
[[ "$par_restart" == "false" ]] && unset par_restart
[[ "$par_auto_lineage" == "false" ]] && unset par_auto_lineage
[[ "$par_auto_lineage_euk" == "false" ]] && unset par_auto_lineage_euk
[[ "$par_auto_lineage_prok" == "false" ]] && unset par_auto_lineage_prok
[[ "$par_augustus" == "false" ]] && unset par_augustus
[[ "$par_long" == "false" ]] && unset par_long
[[ "$par_scaffold_composition" == "false" ]] && unset par_scaffold_composition
[[ "$par_miniprot" == "false" ]] && unset par_miniprot

if [[ -n "$par_output_prefix" ]]; then
prefix="$par_output_prefix"
else
prefix="$(basename -- $par_input)"
fi

busco \
--in "$par_input" \
--mode "$par_mode" \
--out "$prefix" \
dorien-er marked this conversation as resolved.
Show resolved Hide resolved
${meta_cpus:+--cpu "${meta_cpus}"} \
${par_lineage_dataset:+--lineage_dataset "$par_lineage_dataset"} \
${par_augustus:+--augustus} \
${par_augustus_parameters:+--augustus_parameters "$par_augustus_parameters"} \
${par_augustus_species:+--augustus_species "$par_augustus_species"} \
${par_auto_lineage:+--auto-lineage} \
${par_auto_lineage_euk:+--auto-lineage-euk} \
${par_auto_lineage_prok:+--auto-lineage-prok} \
${par_contig_break:+--contig_break $par_contig_break} \
${par_datasets_version:+--datasets_version "$par_datasets_version"} \
${par_e_value:+--evalue "$par_e_value"} \
${par_force:+--force} \
${par_limit:+--limit "$par_limit"} \
${par_long:+--long} \
${par_metaeuk_parameters:+--metaeuk_parameters "$par_metaeuk_parameters"} \
${par_metaeuk_rerun_parameters:+--metaeuk_rerun_parameters "$par_metaeuk_rerun_parameters"} \
${par_miniprot:+--miniprot} \
${par_offline:+--offline} \
${par_opt_out_run_stats:+--opt-out-run-stats} \
${par_quiet:+--quiet} \
${par_restart:+--restart} \
${par_scaffold_composition:+--scaffold_composition} \
${par_tar:+--tar} \
${par_download_base_url:+--download_base_url "$par_download_base_url"} \
${par_download_path:+--download_path "$par_download_path"} \
${par_download:+--download "$par_download"}

mkdir $par_output_dir
mv $prefix/* $par_output_dir
dorien-er marked this conversation as resolved.
Show resolved Hide resolved

29 changes: 29 additions & 0 deletions src/busco/test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
test_dir="$meta_resources_dir/test_data"

echo "> Running busco"
echo "$(busco --version 2>&1 | sed -n 's/BUSCO \([0-9.]*\)/\1/p')"

"$meta_executable" \
--input $test_dir/protein.fasta \
--mode protein \
--lineage_dataset stramenopiles_odb10 \
--output_dir output

echo ">> Checking output"
[ ! -f "output/short_summary.specific.stramenopiles_odb10.protein.fasta.json" ] && echo "specific_short_summary.json does not exist" && exit 1
[ ! -f "output/short_summary.specific.stramenopiles_odb10.protein.fasta.txt" ] && echo "specific_short_summary.txt does not exist" && exit 1
[ ! -f "output/run_stramenopiles_odb10/full_table.tsv" ] && echo "full_table.tsv does not exist" && exit 1
[ ! -f "output/run_stramenopiles_odb10/missing_busco_list.tsv" ] && echo "missing_busco_list.tsv does not exist" && exit 1
[ ! -f "output/run_stramenopiles_odb10/short_summary.json" ] && echo "short_summary.json does not exist" && exit 1
[ ! -f "output/run_stramenopiles_odb10/short_summary.txt" ] && echo "short_summary.txt does not exist" && exit 1

echo ">> Checking if output is empty"
[ ! -s "output/short_summary.specific.stramenopiles_odb10.protein.fasta.json" ] && echo "specific_short_summary.json is empty" && exit 1
[ ! -s "output/short_summary.specific.stramenopiles_odb10.protein.fasta.txt" ] && echo "specific_short_summary.txt is empty" && exit 1
[ ! -s "output/run_stramenopiles_odb10/full_table.tsv" ] && echo "full_table.tsv is empty" && exit 1
[ ! -s "output/run_stramenopiles_odb10/missing_busco_list.tsv" ] && echo "missing_busco_list.tsv is empty" && exit 1
[ ! -s "output/run_stramenopiles_odb10/short_summary.json" ] && echo "short_summary.json is empty" && exit 1
[ ! -s "output/run_stramenopiles_odb10/short_summary.txt" ] && echo "short_summary.txt is empty" && exit 1


rm -r output/
Loading
Loading