Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add checkm2 #6542

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions tools/checkm2/.shed.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
name: checkm2
owner: iuc
description: Rapid assessment of genome bin quality using machine learning
long_description: Enhanced version of checkm, using machine learning models for greater speed and accuracy
homepage_url: https://github.com/chklovski/CheckM2
remote_repository_url: https://github.com/galaxyproject/tools-iuc/
bernt-matthias marked this conversation as resolved.
Show resolved Hide resolved
categories:
- Metagenomics
type: unrestricted
10 changes: 10 additions & 0 deletions tools/checkm2/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
For server admins:

The databases for sylph have associated metadata files. These files MUST be paired with the correct databases to output correctly. Here is the easiest location to download databases and metadata files:
For databases: https://github.com/bluenote-1577/sylph/wiki/Pre%E2%80%90built-databases
For metadata: https://github.com/bluenote-1577/sylph-utils

The tool assumes the directory the data_table references to be
<name_of_organism>
- database.syldb
- metadata.tsv.gz
126 changes: 126 additions & 0 deletions tools/checkm2/checkm2.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
<tool id="checkm2" name="checkm2" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@">
<description>Rapid assessment of genome bin quality using machine learning</description>
<macros>
<token name="@TOOL_VERSION@">1.0.2</token>
<token name="@VERSION_SUFFIX@">1</token>
</macros>
<xrefs>
<xref type="bio.tools">dada2</xref>
bernt-matthias marked this conversation as resolved.
Show resolved Hide resolved
</xrefs>
<requirements>
<requirement type="package" version="@TOOL_VERSION@">checkm2</requirement>
</requirements>
<command detect_errors="exit_code"><![CDATA[
mkdir input_dir &&
#for $i, $file in enumerate($input):
cp $file input_dir/${file.element_identifier}.dat &&
bernt-matthias marked this conversation as resolved.
Show resolved Hide resolved
#end for
checkm2 predict
--input input_dir
$model
$genes
#if $ttable_manual.set_ttable == "yes":
--ttable $ttable_manual.ttable
#end if
-x .dat
--threads "\${GALAXY_SLOTS:-1}"
--database_path "\${CHECKM2_DB_PATH:-$__tool_directory__/tool-data/CheckM2_database/uniref100.KO.1.dmnd}"
bernt-matthias marked this conversation as resolved.
Show resolved Hide resolved
--output-directory output
]]></command>
<inputs>
<param name="input" type="data" format="fasta" label="Input MAG/SAG datasets" multiple="true"/>
<param argument="genes" type="boolean" truevalue="--genes" falsevalue="" label="Treat input files as protein files"/>
<param name="model" type="select" label="Model options">
<option value="">None</option>
<option value="--general">Force the use of the general quality prediction model (gradient boost)</option>
<option value="--specific">Force the use of the specific quality prediction model (neural network)</option>
<option value="--allmodels">Output quality prediction for both models for each genome.</option>
</param>
<conditional name="ttable_manual">
bernt-matthias marked this conversation as resolved.
Show resolved Hide resolved
<param name="set_ttable" type="select" help="If not chosen, tool will automatically determine either 11 or 4" label="Manually set specific progidal translation table?">
<option value='no'>No</option>
<option value="yes">Yes</option>
</param>
<when value="no"/>
<when value="yes">
<!-- It's not all numbers and there's a check internally if it's in a specific list, so it had to be spelled out -->
<param argument="ttable" type="select" label="Prodigal table">
bernt-matthias marked this conversation as resolved.
Show resolved Hide resolved
<option value="1">1</option>
<option value="2">2</option>
<option value="3">3</option>
<option value="4">4</option>
<option value="5">5</option>
<option value="6">6</option>
<option value="9">9</option>
<option value="10">10</option>
<option value="11">11</option>
<option value="12">12</option>
<option value="13">13</option>
<option value="14">14</option>
<option value="16">16</option>
<option value="21">21</option>
<option value="22">22</option>
<option value="23">23</option>
<option value="24">24</option>
<option value="25">25</option>
<option value="26">26</option>
<option value="27">27</option>
<option value="28">28</option>
<option value="29">29</option>
<option value="30">30</option>
<option value="31">31</option>
<option value="33">33</option>
</param>
</when>
</conditional>
</inputs>
<outputs>
<data name="quality" label="${tool.name} on ${on_string}: Quality report" format="tabular" from_work_dir="output/quality_report.tsv"/>
<collection name="protein_files" label="${tool.name} on ${on_string}: protein files" type="list">
<discover_datasets pattern="__name__" format="fasta" directory="output/protein_files"/>
bernt-matthias marked this conversation as resolved.
Show resolved Hide resolved
</collection>
<collection name="diamond_files" label="${tool.name} on ${on_string}: Diamond files" type="list">
bernt-matthias marked this conversation as resolved.
Show resolved Hide resolved
<discover_datasets pattern="__name__" ext="tsv" directory="output/diamond_output"/>
</collection>
</outputs>
<tests>
<!-- These cannot run without a multi-gb db and will therefore fail. See README for details -->
<test expect_exit_code="1" expect_failure="true">
<param name="input" value="test1.faa,test2.faa"/>
<param name="model" value="--allmodels"/>
<param name="genes" value="--genes"/>
<conditional name="ttable_manual">
<param name="set_ttable" value="yes"/>
<param name="ttable" value="13"/>
</conditional>
<assert_command>
<has_text text="checkm2 predict --input input_dir"/>
<has_text text="--allmodels --genes --ttable 13 -x .dat"/>
<has_text text="--output-directory output"/>
</assert_command>
</test>
<test expect_exit_code="1" expect_failure="true">
<param name="input" value="test1.tst,test2.tst"/>
<param name="model" value="--specific"/>
<assert_command>
<has_text text="checkm2 predict --input input_dir"/>
<has_text text="--specific -x .dat"/>
<has_text text="--output-directory output"/>
</assert_command>
</test>
</tests>
<help><![CDATA[
Unlike CheckM1, CheckM2 has universally trained machine learning models it applies regardless of taxonomic lineage to predict the completeness and contamination of genomic bins.
This allows it to incorporate many lineages in its training set that have few - or even just one - high-quality genomic representatives, by putting it in the context of all other organisms in the training set.
As a result of this machine learning framework, CheckM2 is also highly accurate on organisms with reduced genomes or unusual biology, such as the Nanoarchaeota or Patescibacteria.

CheckM2 uses two distinct machine learning models to predict genome completeness. The 'general' gradient boost model is able to generalize well and is intended to be used on organisms not well
represented in GenBank or RefSeq (roughly, when an organism is novel at the level of order, class or phylum). The 'specific' neural network model is more accurate when predicting completeness
of organisms more closely related to the reference training set (roughly, when an organism belongs to a known species, genus or family). CheckM2 uses a cosine similarity calculation to automatically
determine the appropriate completeness model for each input genome, but you can also force the use of a particular completeness model, or get the prediction outputs for both. There is only one contamination
model (based on gradient boost) which is applied regardless of taxonomic novelty and works well across all cases.
]]></help>
<citations>
<citation type="doi">10.1038/s41592-023-01940-w</citation>
</citations>
</tool>
Loading
Loading