Develop a swarm-plugin for Qiime 2 #89

frederic-mahe · 2016-10-17T16:17:20Z

Qiime 2 now offers an interface for third-party plugins. The plugin creation does not seem complicated: the plugin is a python 3 wrapper presenting some or all the functionalities of swarm.

torognes · 2016-10-19T10:48:30Z

Sounds like a good idea!

stheil15 · 2017-09-25T14:12:24Z

Any update ?

colinbrislawn · 2019-01-25T21:41:23Z

With swarm 3.0 fast approaching (#122), the increasing popularity of Exact Sequence Variants, and the publication of the Qiime 2 paper, this might be the perfect time to build a q2-swarm plugin.

🚀
Colin

frederic-mahe · 2019-02-07T08:20:09Z

@colinbrislawn you are right, but I don't really know where to begin. Would you help me kickstart that plugin?

colinbrislawn · 2019-02-07T17:30:30Z

Thanks @frederic-mahe! I'm honored you reached out to me, but I'm not sure where to begin either. I guess I would look to the q2-vsearch plugin as a template, then build from there. https://github.com/qiime2/q2-vsearch

@thermokarst, could you make us an official q2-swarm repo and invite us as contributors?

thermokarst · 2019-02-07T21:39:14Z

Hey there @colinbrislawn! This plugin idea sounds really interesting, and good news, no need for us to make you a repo! Since QIIME 2 is decentralized, you can create the plugin wherever you want, then you can share it with users by registering it at the QIIME 2 Library! The Library entry can contain instructions letting users know how to get your plugin and install it.

colinbrislawn · 2024-04-17T20:23:52Z

Summary of steps in Fred's-metabarcoding-pipeline, as I understand it, and what's already wrapped in Qiime2:

program	idea	existing q2 plugin	what
		qiime tools import
cutadapt		q2-cutadapt
vsearch	fastq_mergepairs	q2-vsearch
vsearch	fastq_filter	extend q2-vsearch	add --fastq_filter
vsearch	per-sample derep	extend q2-vsearch
sed	per-read quality	none	the lowest expected error rate observed for each unique sequence
vsearch	global derep	q2-vsearch
swarm	make ASVs!	none	then sort with vsearch
vsearch	uchime_denovo	q2-vsearch
OTU_contingency_table.py	make feature table

This is a fully featured pipeline that differs from what's already in Qiime2 in a number of ways. Specifically the per-sample derep...

One easy way forward is to make a q2-swarm plugin that replaces only the vsearch cluster-features-de-novo.

This is in contrast to the DADA2 plugin that implements its full, unique SOP.
This may be more powerful; per-sample derep and per-feature quality are interesting ideas!

Either way, adding --fastq_filter to q2-vsearch seems like a natural first step.

frederic-mahe · 2024-04-18T08:09:34Z

I should have pointed that sooner, here is my current swarm-based pipeline.

The way the pipeline is described (and scripts numbered) might be confusing. The beginning is quite similar to the old pipeline you were referring to: --fastq_filter is indeed required.

One easy way forward is to make a q2-swarm plugin that replaces only the vsearch cluster-features-de-novo.

I realize that replicating the whole pipeline in Qiime2 might not be easy, so I agree we should aim for an easier first target.

swarm has three major modes: --differences 0 (dereplication), --differences 1 (fast, high-resolution clustering), --differences 2+ (slower, lower-resolution clustering)

In my own work, I only use --differences 1, with the --fastidious option when clustering the whole project (clustering()), or without the --fastidious option when working at the sample level (list_local_clusters()).

list_local_clusters() {
    # retain only clusters with more than 2 reads
    # (do not use the fastidious option here)
    ${SWARM} \
        --differences 1 \
        --threads "${THREADS}" \
        --usearch-abundance \
        --log /dev/null \
        --output-file /dev/null \
        --statistics-file - \
        "${SAMPLE}.fas" | \
        awk 'BEGIN {FS = OFS = "\t"} $2 > 2' > "${SAMPLE}.stats"
}

clustering() {
    # swarm 3 or more recent
    "${SWARM}" \
        --differences 1 \
        --fastidious \
        --usearch-abundance \
        --threads "${THREADS}" \
        --internal-structure "${OUTPUT_STRUCT}" \
        --output-file "${OUTPUT_SWARMS}" \
        --statistics-file "${OUTPUT_STATS}" \
        --seeds "${OUTPUT_REPRESENTATIVES}" \
        "${FINAL_FASTA}" 2> "${OUTPUT_LOG}"
}

The input file is a dereplicated fasta file with abundance annotations (;size=123[;]) (option --usearch-abundance), and only ACGT nucleotides. The command line options listed in these shell functions are the most relevant, at least in my opinion.

colinbrislawn · 2024-04-18T21:18:18Z

Thank you, this is extremely helpful!

I like the idea of starting small with the q2-swarm plugin.

I only use --differences 1, with the --fastidious option when clustering the whole project (clustering())...

Naturally!

I'm not sure how best to track feature counts through per-sample derep and clustering.

I understand what per-sample derep does and why it's faster to do this double-derep step.
For some reason, I can't wrap my head around that q2-type it should be. Is this just another feature table?
(I also had trouble with this last time.)

We could get counts for the feature table by remapping reads, like we did historically, but that loses the efficiency of the pre-sample derep and ignores the internal structure of the swarms.

frederic-mahe · 2024-04-23T16:21:08Z

I understand what per-sample derep does and why it's faster to do this double-derep step.
For some reason, I can't wrap my head around that q2-type it should be. Is this just another feature table?

My pipeline must be confusing for anyone else but me, sorry about that. The loop processes each pair of fastq files in 6 steps:

merge_fastq_pair (R1-R2 merging with vsearch)
trim_primers (cutadapt)
convert_fastq_to_fasta (vsearch)
extract_expected_error_values
dereplicate_fasta (dereplication with vsearch)
list_local_clusters (local clustering with swarm)

The double-derep step allows me to keep track of the origin of each unique sequence. The fasta files are parsed when building the final occurrence table.

extract_expected_error_values produces a table containing the best (lowest) expected error observed for each unique sequence. This is also used to build the occurrence table, to filter out low quality observations (a cluster with high-EE seed sequence is discarded). The goal is to delay quality-based filtering, so it is performed after the clustering step.

list_local_clusters produces a table. The goal is to create a list of cluster seeds for each sample. This is used to cleave clusters into subclusters when the subclusters display different patterns of distribution. In practice, it allows to distinguish entities with a single-nucleotide difference, but only when there is an ecological signal.

frederic-mahe added the enhancement label Oct 17, 2016

frederic-mahe self-assigned this Oct 17, 2016

frederic-mahe added this to the swarm 4.0 milestone Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop a swarm-plugin for Qiime 2 #89

Develop a swarm-plugin for Qiime 2 #89

frederic-mahe commented Oct 17, 2016 •

edited

Loading

torognes commented Oct 19, 2016

stheil15 commented Sep 25, 2017

colinbrislawn commented Jan 25, 2019

frederic-mahe commented Feb 7, 2019

colinbrislawn commented Feb 7, 2019

thermokarst commented Feb 7, 2019

colinbrislawn commented Apr 17, 2024

frederic-mahe commented Apr 18, 2024

colinbrislawn commented Apr 18, 2024

frederic-mahe commented Apr 23, 2024

Develop a swarm-plugin for Qiime 2 #89

Develop a swarm-plugin for Qiime 2 #89

Comments

frederic-mahe commented Oct 17, 2016 • edited Loading

torognes commented Oct 19, 2016

stheil15 commented Sep 25, 2017

colinbrislawn commented Jan 25, 2019

frederic-mahe commented Feb 7, 2019

colinbrislawn commented Feb 7, 2019

thermokarst commented Feb 7, 2019

colinbrislawn commented Apr 17, 2024

frederic-mahe commented Apr 18, 2024

colinbrislawn commented Apr 18, 2024

frederic-mahe commented Apr 23, 2024

frederic-mahe commented Oct 17, 2016 •

edited

Loading