-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Fairy #6647
base: main
Are you sure you want to change the base?
Add Fairy #6647
Changes from 4 commits
2617364
f93520a
9071a35
6a9629b
fecf9b3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
name: Fairy | ||
owner: iuc | ||
description: Fast approximate contig coverage for metagenomic binning | ||
homepage_url: https://github.com/bluenote-1577/fairy | ||
long_description: | | ||
Fairy is a software package, written in Rust, which create coverage file | ||
for metagenomic binning. This tool can create coverage files 100x-1000x | ||
faster than read alignment. | ||
remote_repository_url: https://github.com/bluenote-1577/fairy | ||
type: unrestricted | ||
categories: | ||
- Metagenomics |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,107 @@ | ||
<tool id="fairy_cov" name="Fairy coverage" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="@PROFILE@"> | ||
<description>Create coverage file for specific binners</description> | ||
<macros> | ||
<import>macros.xml</import> | ||
</macros> | ||
<expand macro="requirements"/> | ||
<command detect_errors="exit_code"> | ||
<![CDATA[ | ||
|
||
ln -s '$contig' '$contig.element_identifier' && | ||
ln -s '$bcsp_file' '$bcsp_file.element_identifier' && | ||
|
||
fairy coverage | ||
'$contig.element_identifier' | ||
'$bcsp_file.element_identifier' | ||
-t \${GALAXY_SLOTS:-8} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Double quotes around bash variables are often a good idea. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should have done if i did understand it correct |
||
-m ${minimum_ani} | ||
-M ${min_number_kmers} | ||
-c ${c} | ||
-k ${k.value} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This just There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. change it |
||
--min-spacing ${min_spacing} | ||
${full_contig_name} | ||
#if $output_type.value == 'semi': | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also here no You could simplify this, by using the command line parameter as value in the options and then just use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Change it and there is no need for the value. I just forgot it to remove it since it did work with it |
||
--aemb-format | ||
#end if | ||
#if $output_type.value == 'max': | ||
--maxbin-format | ||
#end if | ||
-o '$output' | ||
|
||
]]> | ||
</command> | ||
<inputs> | ||
<param name="contig" type="data" format="fasta,fasta.gz" label="Input fasta contig file" help="Input the raw fasta contig file. It can be gzip!"/> | ||
<param name="bcsp_file" type="data" format="bcsp" label="Input the pre-sketched file (.bcsp file)" help="This file will be generated with the fairy sketch tool."/> | ||
<param argument="--minimum-ani" type="integer" optional="true" min="0" max="100" value="95" label="Set minimum ANI" help="Set the minimum adjusted ANI for the coverage calculation. CARE: only adjust it when you know what it does!"/> | ||
<param argument="--min-number-kmers" type="integer" value="8" optional="true" label="Genome filter" help="Set the number for exclude genomes with less than this number of sampled k-mers."/> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't undersand this parameter. For the default sequences with less than 8 kmers are removed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I did rewrote this help, maybe it is better now to understand but yes this paramtere should be a filter to filter out genomes with less k-mers then this paramter There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I still do not understand this parameter. What does it mean if a genome has x k-mers? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It means that a genome have less then x (the value of the parameter) k-mers it will execlude from the algorithm. It should not appear in the coverage file generated by Fairy. It this more cleare? And sorry for this missunderstanding There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unfortunately no. I just do not understand what it means that a genome has a kmer. A genome of length n has n-k+1 kmers. There must be something special about the x kmers the help text is talking about. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This x stands for a lower border to filter out genomes with less then x k-mers to not appear in the coverage file. For an example there are 3 genome in a contig. The first hast 5 k-mers, the second has 10 k-mers and the last one has 8 k-mers. We set this paramter to a 7 which means that the first genome with 5 k-mers will not be included since it hast less then 7 k-mers. When we set this to 2 then there will be no excluding of a genome since all 3 has more then 2 k-mers. Is this explenation better? |
||
<param argument="-c" type="integer" value="50" optional="true" label="Set subsampling rate" help="This value does not interact with the .bcsp file which was used as input."/> | ||
<param argument="-k" type="select" label="Select k-mer size" help="This value does not interact with the .bcsp file which was used as input."> | ||
<option value="31">31</option> | ||
<option value="21">21</option> | ||
</param> | ||
<param argument="--min-spacing" type="integer" value="30" label="Set spacing between k-mers" help=" Minimum spacing between selected k-mers on the contigs."/> | ||
<param argument="--full-contig-name" type="boolean" falsevalue="" truevalue="--full-contig-name" label="Full contig name"/> | ||
<param name="output_type" type="select" label="Select for which binner the output should be generated"> | ||
<option value="meta">MetaBAT2</option> | ||
<option value="semi">SemiBin2</option> | ||
<option value="max">MaxBin2</option> | ||
</param> | ||
</inputs> | ||
<outputs> | ||
<data name="output" format="tabular" label="${tool.name} on ${on_string}"/> | ||
</outputs> | ||
<tests> | ||
<test> | ||
<param name="contig" value="single_test.fasta.gz" ftype="fasta.gz"/> | ||
<param name="bcsp_file" value="single_test.fasta.gz.bcsp" ftype="bcsp"/> | ||
<output name="output" value="normal_test.tsv"/> | ||
</test> | ||
<test> | ||
<param name="contig" value="single_test.fasta.gz" ftype="fasta.gz"/> | ||
<param name="bcsp_file" value="single_test.fasta.gz.bcsp" ftype="bcsp"/> | ||
<param name="minimum-ani" value="99"/> | ||
<param name="min-number-kmers" value="2"/> | ||
<param name="full-contig-name" value="true"/> | ||
<param name="output_type" value="semi"/> | ||
<output name="output" value="test_2.tsv"/> | ||
</test> | ||
<test> | ||
<param name="contig" value="single_test.fasta.gz" ftype="fasta.gz"/> | ||
<param name="bcsp_file" value="single_test.fasta.gz.bcsp" ftype="bcsp"/> | ||
<param name="k" value="21"/> | ||
<param name="c" value="45"/> | ||
<param name="min-spacing" value="10"/> | ||
<param name="output_type" value="max"/> | ||
<output name="output" value="test_3.tsv"/> | ||
</test> | ||
</tests> | ||
<help> | ||
<![CDATA[ | ||
|
||
Fairy computes multi-sample contig coverage for metagenome-assembled genome (MAG) binning. | ||
|
||
Fairy is used after metagenomic assembly and before binning. It can | ||
|
||
- Calculate coverage 100x-1000x faster than read alignment (e.g. BWA) | ||
- Give comparable bins for multi-sample binning (short read or nanopore reads) | ||
- Output formats that are compatible with MetaBAT2, MaxBin2, SemiBin2, and more | ||
|
||
Caveats: | ||
|
||
- Don't use fairy for single-sample binning | ||
- Don't use fairy for PacBio HiFi | ||
|
||
For more information visit `the wiki site on GitHub <https://github.com/bluenote-1577/fairy/wiki/Introduction-to-fairy>`_., | ||
|
||
.. class:: infomark | ||
|
||
Fairy usage for SemiBin2 is different than other tools: SemiBin2 requires separate coverage files for each read sample -- other tools require a single coverage matrix. | ||
|
||
.. class:: infomark | ||
|
||
The default output format from Fairy is the MetaBAT2 format. Any tool using this or the format from the other 2 binners work also with Fairys coverage files! | ||
]]> | ||
</help> | ||
<expand macro="citations"/> | ||
</tool> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
<tool id="fairy_sketch" name="Fairy sketch" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="@PROFILE@"> | ||
<description>sketching of k-mers for coverage into a hashtable</description> | ||
<macros> | ||
<import>macros.xml</import> | ||
</macros> | ||
<expand macro="requirements"/> | ||
<command detect_errors="exit_code"> | ||
<![CDATA[ | ||
|
||
mkdir -p res && | ||
|
||
#if $input.is_select == "single": | ||
#set $filename = $reads.element_identifier + '.bcsp' | ||
ln -s '$reads' '$reads.element_identifier' && | ||
#else | ||
#set $filename = $first_pairs.element_identifier + '.paired.bcsp' | ||
ln -s '$first_pairs' '$first_pairs.element_identifier' && | ||
ln -s '$second_pairs' '$second_pairs.element_identifier' && | ||
#end if | ||
|
||
fairy sketch | ||
-t \${GALAXY_SLOTS:-8} | ||
-c ${c} | ||
-k ${k.value} | ||
-d 'res' | ||
#if $input.is_select == "single": | ||
-r '$reads.element_identifier' | ||
#else | ||
-1 '$first_pairs.element_identifier' | ||
-2 '$second_pairs.element_identifier' | ||
#end if | ||
&& | ||
|
||
cp './res/${filename}' $output | ||
|
||
]]> | ||
</command> | ||
<inputs> | ||
<conditional name="input"> | ||
<param name="is_select" type="select" label="Single or paired-end reads"> | ||
<option value="single">Single</option> | ||
<option value="pair">Paired</option> | ||
</param> | ||
<when value="single"> | ||
<param argument="--reads" type="data" format="fastqsanger,fasta,fastq,fasta.gz,fastq.gz" label="Input single-end reads"/> | ||
</when> | ||
<when value="pair"> | ||
<param argument="--first_pairs" type="data" format="fastqsanger,fasta,fastq,fasta.gz,fastq.gz" label="Input first paired-end reads"/> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would prefer input as paired collection, at least add it as option. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add this option with a test! |
||
<param argument="--second_pairs" type="data" format="fastqsanger,fasta,fastq,fasta.gz,fastq.gz" label="Input second paired-end reads"/> | ||
</when> | ||
</conditional> | ||
<param argument="-c" type="integer" value="50" optional="true" label="Set the subsampling rate"/> | ||
<param argument="-k" type="select" label="Select k-mer size"> | ||
<option value="31">31</option> | ||
<option value="21">21</option> | ||
</param> | ||
</inputs> | ||
<outputs> | ||
<data name="output" format="bcsp" label="${tool.name} on ${on_string}"/> | ||
</outputs> | ||
<tests> | ||
<test> | ||
<conditional name="input"> | ||
<param name="is_select" value="single"/> | ||
<param name="reads" value="single_test.fasta.gz" ftype="fasta.gz"/> | ||
</conditional> | ||
<output name="output" file="single_test.fasta.gz.bcsp"/> | ||
</test> | ||
<test> | ||
<conditional name="input"> | ||
<param name="is_select" value="pair"/> | ||
<param name="first_pairs" value="test_paired_1.fq.gz" ftype="fastq.gz"/> | ||
<param name="second_pairs" value="test_paired_2.fq.gz" ftype="fastq.gz"/> | ||
</conditional> | ||
<output name="output" file="test_paired_1.fq.gz.paired.bcsp"/> | ||
</test> | ||
</tests> | ||
<help> | ||
<![CDATA[ | ||
|
||
This tool sketch the k-mer into a hashtable which will be needed for the fairy coverage tool to create the coverage file. | ||
|
||
.. class:: infomark | ||
|
||
This tool can either use single-end or paired-end reads as input in multiple file formats. | ||
|
||
Fairy can only handle a k-mer size of 31 or 21 not more or less currently! | ||
|
||
]]> | ||
</help> | ||
<expand macro="citations"/> | ||
</tool> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
<macros> | ||
<xml name="requirements"> | ||
<requirements> | ||
<requirement type="package" version="@TOOL_VERSION@">fairy</requirement> | ||
<yield/> | ||
</requirements> | ||
</xml> | ||
<token name="@TOOL_VERSION@">0.5.7</token> | ||
<token name="@VERSION_SUFFIX@">0</token> | ||
<token name="@PROFILE@">24.1</token> | ||
<xml name="citations"> | ||
<citations> | ||
<citation type="doi">10.1101/2024.04.23.590803</citation> | ||
<yield/> | ||
</citations> | ||
</xml> | ||
</macros> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
contigName contigLen totalAvgDepth single_test.fasta.gz single_test.fasta.gz-var | ||
NZ_CP017438.1 3123040 0.05509718146076748 0.05509718146076748 0.014618167653679848 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
NZ_CP017438.1 0.05509718146076748 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
contigName | ||
NZ_CP017438.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a good idea to just use the element identifiers, since they might contain dangerous or forbidden characters. We usually solve this like so
tools-iuc/tools/coverm/macros.xml
Line 55 in 295026b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did change it on both wrappers