Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop a swarm-plugin for Qiime 2 #89

Open
frederic-mahe opened this issue Oct 17, 2016 · 10 comments
Open

Develop a swarm-plugin for Qiime 2 #89

frederic-mahe opened this issue Oct 17, 2016 · 10 comments
Assignees
Milestone

Comments

@frederic-mahe
Copy link
Collaborator

frederic-mahe commented Oct 17, 2016

Qiime 2 now offers an interface for third-party plugins. The plugin creation does not seem complicated: the plugin is a python 3 wrapper presenting some or all the functionalities of swarm.

@torognes
Copy link
Owner

Sounds like a good idea!

@stheil15
Copy link

Any update ?

@colinbrislawn
Copy link

With swarm 3.0 fast approaching (#122), the increasing popularity of Exact Sequence Variants, and the publication of the Qiime 2 paper, this might be the perfect time to build a q2-swarm plugin.

🚀
Colin

@frederic-mahe
Copy link
Collaborator Author

@colinbrislawn you are right, but I don't really know where to begin. Would you help me kickstart that plugin?

@colinbrislawn
Copy link

Thanks @frederic-mahe! I'm honored you reached out to me, but I'm not sure where to begin either. I guess I would look to the q2-vsearch plugin as a template, then build from there. https://github.com/qiime2/q2-vsearch

@thermokarst, could you make us an official q2-swarm repo and invite us as contributors?

@thermokarst
Copy link

Hey there @colinbrislawn! This plugin idea sounds really interesting, and good news, no need for us to make you a repo! Since QIIME 2 is decentralized, you can create the plugin wherever you want, then you can share it with users by registering it at the QIIME 2 Library! The Library entry can contain instructions letting users know how to get your plugin and install it.

@colinbrislawn
Copy link

Summary of steps in Fred's-metabarcoding-pipeline, as I understand it, and what's already wrapped in Qiime2:

program idea existing q2 plugin what
qiime tools import
cutadapt q2-cutadapt
vsearch fastq_mergepairs q2-vsearch
vsearch fastq_filter extend q2-vsearch add --fastq_filter
vsearch per-sample derep extend q2-vsearch
sed per-read quality none the lowest expected error rate observed for each unique sequence
vsearch global derep q2-vsearch
swarm make ASVs! none then sort with vsearch
vsearch uchime_denovo q2-vsearch
OTU_contingency_table.py make feature table

This is a fully featured pipeline that differs from what's already in Qiime2 in a number of ways. Specifically the per-sample derep...

One easy way forward is to make a q2-swarm plugin that replaces only the vsearch cluster-features-de-novo.

This is in contrast to the DADA2 plugin that implements its full, unique SOP.
This may be more powerful; per-sample derep and per-feature quality are interesting ideas!

Either way, adding --fastq_filter to q2-vsearch seems like a natural first step.

@frederic-mahe
Copy link
Collaborator Author

I should have pointed that sooner, here is my current swarm-based pipeline.

The way the pipeline is described (and scripts numbered) might be confusing. The beginning is quite similar to the old pipeline you were referring to: --fastq_filter is indeed required.

One easy way forward is to make a q2-swarm plugin that replaces only the vsearch cluster-features-de-novo.

I realize that replicating the whole pipeline in Qiime2 might not be easy, so I agree we should aim for an easier first target.

swarm has three major modes: --differences 0 (dereplication), --differences 1 (fast, high-resolution clustering), --differences 2+ (slower, lower-resolution clustering)

In my own work, I only use --differences 1, with the --fastidious option when clustering the whole project (clustering()), or without the --fastidious option when working at the sample level (list_local_clusters()).

list_local_clusters() {
    # retain only clusters with more than 2 reads
    # (do not use the fastidious option here)
    ${SWARM} \
        --differences 1 \
        --threads "${THREADS}" \
        --usearch-abundance \
        --log /dev/null \
        --output-file /dev/null \
        --statistics-file - \
        "${SAMPLE}.fas" | \
        awk 'BEGIN {FS = OFS = "\t"} $2 > 2' > "${SAMPLE}.stats"
}
clustering() {
    # swarm 3 or more recent
    "${SWARM}" \
        --differences 1 \
        --fastidious \
        --usearch-abundance \
        --threads "${THREADS}" \
        --internal-structure "${OUTPUT_STRUCT}" \
        --output-file "${OUTPUT_SWARMS}" \
        --statistics-file "${OUTPUT_STATS}" \
        --seeds "${OUTPUT_REPRESENTATIVES}" \
        "${FINAL_FASTA}" 2> "${OUTPUT_LOG}"
}

The input file is a dereplicated fasta file with abundance annotations (;size=123[;]) (option --usearch-abundance), and only ACGT nucleotides. The command line options listed in these shell functions are the most relevant, at least in my opinion.

@frederic-mahe frederic-mahe added this to the swarm 4.0 milestone Apr 18, 2024
@colinbrislawn
Copy link

Thank you, this is extremely helpful!

I like the idea of starting small with the q2-swarm plugin.

I only use --differences 1, with the --fastidious option when clustering the whole project (clustering())...

Naturally!

I'm not sure how best to track feature counts through per-sample derep and clustering.

I understand what per-sample derep does and why it's faster to do this double-derep step.
For some reason, I can't wrap my head around that q2-type it should be. Is this just another feature table?
(I also had trouble with this last time.)

We could get counts for the feature table by remapping reads, like we did historically, but that loses the efficiency of the pre-sample derep and ignores the internal structure of the swarms.

@frederic-mahe
Copy link
Collaborator Author

I understand what per-sample derep does and why it's faster to do this double-derep step.
For some reason, I can't wrap my head around that q2-type it should be. Is this just another feature table?

My pipeline must be confusing for anyone else but me, sorry about that. The loop processes each pair of fastq files in 6 steps:

  • merge_fastq_pair (R1-R2 merging with vsearch)
  • trim_primers (cutadapt)
  • convert_fastq_to_fasta (vsearch)
  • extract_expected_error_values
  • dereplicate_fasta (dereplication with vsearch)
  • list_local_clusters (local clustering with swarm)

The double-derep step allows me to keep track of the origin of each unique sequence. The fasta files are parsed when building the final occurrence table.

extract_expected_error_values produces a table containing the best (lowest) expected error observed for each unique sequence. This is also used to build the occurrence table, to filter out low quality observations (a cluster with high-EE seed sequence is discarded). The goal is to delay quality-based filtering, so it is performed after the clustering step.

list_local_clusters produces a table. The goal is to create a list of cluster seeds for each sample. This is used to cleave clusters into subclusters when the subclusters display different patterns of distribution. In practice, it allows to distinguish entities with a single-nucleotide difference, but only when there is an ecological signal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants