Initial docs update before v5

Remove tutorial - Vamb's Python interface has been internal since v4, and I really don't want to keep it up to date. That takes too much time. Add new sections in Documentation, and move some material from the README to these doc pages. Small tweaks to the documentation. The docs are not done yet, grep for "TODO" before v5 launch, including: * No link to TaxVamb preview in README * No description of how to run AVAMB and TaxVamb * No description of the Taxonomy input to TaxVamb or Taxometer
RasmussenLab · Jul 11, 2024 · 7dc2f98 · 7dc2f98
1 parent adeb157
commit 7dc2f98
Show file tree

Hide file tree

Showing 8 changed files with 200 additions and 1,017 deletions.
diff --git a/README.md b/README.md
diff --git a/doc/how_to_run.md b/doc/how_to_run.md
@@ -0,0 +1,85 @@
+## Running Vamb
+Most users will want to copy and change the commands from the quickstart section below.
+Users with more advanced data, or who really wants to dig into Vamb to get the most out of Vamb should read the in-depth sections below.
+
+### Quickstart
+#### Vamb
+```shell
+$ # Assemble your reads, one assembly per sample, e.g. with SPAdes
+$ for sample in 1 2 3; do
+      spades.py --meta ${sample}.{fw,rv}.fq.gz  -t 24 -m 100gb -o asm_${sample};
+  done    
+
+$ # Concatenate your assemblies, and rename the contigs to the naming scheme
+$ # S{sample}C{original contig name}. This can be done with a script provided by Vamb:
+$ python concatenate.py contigs.fna.gz asm_{1,2,3}/contigs.fasta
+
+$ # Estimate sample-wise abundance by mapping reads to the contigs.
+$ # Any mapper will do, but we recommend strobealign with the --aemb flag
+$ mkdir aemb
+$ for sample in 1 2 3; do
+      strobealign -t 8 --aemb contigs.fna.gz ${sample}.{fw,rv}.fq.gz > aemb/${sample}.tsv;
+  done
+
+$ # Run Vamb using the contigs and the directory with abundance files
+$ vamb bin default --outdir vambout --fasta contigs.fna.gz --aemb aemb
+```
+
+#### TaxVamb
+[TODO]
+
+#### AVAMB
+[TODO]
+
+### The different Vamb inputs
+All modes of Vamb takes various _inputs_ and produces various _outputs_.
+Currently, all modes take the following two central inputs:
+
+* The kmer-composition of the sequence (the _composition_).
+* The abundance of the contigs in each sample (the _abundance_).
+
+For inputs that take significant time to produce, Vamb will serialize the parsed input to a file, such that future runs of Vamb can use that instead of re-computing it.
+
+#### Composition
+The composition is computed from the input contig file in FASTA format (the 'catalogue').
+From command line, this looks like:
+
+```
+--fasta contigs.fna.gz
+```
+
+Where the catalogue may be either gzipped or a plain file.
+When parsed, Vamb will write the composition in the output file `composition.npz`.
+Future runs can then instead use:
+
+```
+--composition composition.npz
+```
+
+#### Abundance
+The abundance may be computed from:
+* [recommended ]A directory containing TSV files obtained by mapping reads from
+  each individual sample against the contig catalogue using `strobealign --aemb`
+* A directory of BAM files generated the same way, except using any aligner that
+  produces a BAM file, e.g. `minimap2`
+
+Thus, it can be specified as:
+```
+--aemb dir_with_aemb_files
+```
+or
+```
+--bamdir dir_with_bam_files
+```
+
+When parsed, Vamb will produce the file `abundance.npz`, which can be used for future
+Vamb runs instead:
+```
+--rpkm abundance.npz
+```
+
+__Note:__ Vamb will check that the sequence names in the TSV / BAM files correspond
+to the names in the composition.
+
+#### Taxonomy
+[TODO]
diff --git a/doc/index.rst b/doc/index.rst
@@ -1,4 +1,4 @@
-Variational Autoencoder for Metagenomic Binning (VAMB)
+Variational Autoencoders for Metagenomic Binning (VAMB)
 =======================================================
 
 .. include:: ../README.md
@@ -7,14 +7,27 @@ Variational Autoencoder for Metagenomic Binning (VAMB)
 
 .. toctree::
    :maxdepth: 2
-   :caption: Tutorial
+   :caption: How to run
 
-   tutorial.md
+   how_to_run.md
 
 .. toctree::
-   :caption: Setup Documentation
+   :maxdepth: 2
+   :caption: Vamb output
+
+   outputs.md
 
-   README
+.. toctree::
+   :maxdepth: 2
+   :caption: Tips for running Vamb
+
+   tips.md
+
+.. toctree::
+   :maxdepth: 2
+   :caption: In-depth walkthrough
+
+   tutorial.md
 
 Indices and tables
 ==================

diff --git a/doc/outputs.md b/doc/outputs.md
@@ -0,0 +1,36 @@
+# Outputs
+
+### Vamb
+- `log.txt` - A text file with information about the Vamb run. Look here (and at stderr) if you experience errors.
+- `composition.npz`: A Numpy .npz file that contain all kmer composition information computed by Vamb from the FASTA file. This can be provided to another run of Vamb to skip the composition calculation step.
+- `abundance.npz`: Similar to `composition.npz`, but this file contains information calculated from the BAM files. Using this as input instead of BAM files will skip re-parsing the BAM files, which take a significant amount of time.
+- `model.pt`: A file containing the trained VAE model. When running Vamb from a Python interpreter, the VAE can be loaded from this file to skip training.
+- `latent.npz`: This contains the output of the VAE model.
+- `vae_clusters_unsplit.tsv` - A two-column text file with one row per sequence:
+  Left column for the cluster (i.e bin) name, right column for the sequence name.
+  You can create the FASTA-file bins themselves using the script in `src/create_fasta.py`
+- (if binsplitting is enabled:) `vae_clusters_split.tsv`, similar to the unsplit version, but after binsplitting
+- `vae_clusters_metadata.tsv`: A file with some metadata about clusters.
+    - Name: The name of the cluster
+    - Radius: Cosine radius in latent space. Small clusters are usually more likely to be pure.
+    - Peak/valley ratio: A small PVR means the cluster's edges is more well defined, and hence the cluster is more likely pure
+    - Kind: Currently, Vamb produces three kinds of clusters:
+        - Normal: Defined by a local density in latent space. Most good clusters are of this type
+        - Loner: A contig far away from everything else in latent space.
+        - Fallback: After failing to produce good clusters for some time, these (usually poor) clusters are created
+          to not get stuck in an infinite loop when clustering
+    - Bp: Sum of length of all sequences in the cluster
+    - Ncontigs: Number of sequences in the cluster
+
+### TaxVamb
+[TODO]
+
+### Taxometer
+[TODO]
+
+### AVAMB
+Same as VAMB, but also:
+- `aae_y_clusters_{split,unsplit}.tsv`: The clusters obtained from the categorical latent space
+- `aae_z_latent.npz`: Like `latent.npz`, but of the adversarial Z latent space
+- `aae_z_clusters_{metadata,split,unsplit}.tsv`: Like the corresponding `vae_clusters*` files, but from the adversarial Z latent space
+
diff --git a/doc/tips.md b/doc/tips.md
@@ -0,0 +1,47 @@
+# Tips for running Vamb
+
+### Garbage in, garbage out
+For the best results when running Vamb, make sure the inputs to Vamb are as good as they can be.
+In particular, the assembly process is a main bottleneck in the total binning workflow, so improving assembly
+by e.g. preprocessing reads, using a better assembler, or switching to long read technology can make a big difference.
+
+### Postprocess your bins
+On principle, Vamb will bin every single input contig.
+Currently, Vamb's bins are also _disjoint_, meaning each contig is present in only one bin.
+
+Having to place every contig into a big, even those with a weak binning signal,
+means that a large number of contigs will be binned poorly.
+Often, these poor-quality contigs are put in a bin of their own, or with just one or two smaller contigs.
+Practically speaking, this means _most bins produces by Vamb will be of poor quality_.
+
+Hence, to use bins you can rely on, you will need to postprocess your bins:
+* You may filter the bins by size, if you are only looking for organisms
+  and not e.g. plasmids.
+  For example, removing all bins < 250,000 bp in size will remove most poor quality bins,
+  while keeping all bacterial genomes with a reasonable level of completeness.
+* Using tools such as CheckM2 to score your bins, you can keep only the bins
+  that pass some scoring criteria
+* You may use the information in the `vae_clusters_metadata.tsv` file (see Output),
+  and e.g. remove all clusters marked as "Fallback", below a certain size, or with a too
+  high peak-valley ratio.
+
+### How binsplitting works
+In the recommended workflow, each sample is assembled independently, then the contigs are pooled
+and binning together.
+After Vamb have encoded the input features into the embedding (latent space), the embedding is clustered
+to clusters.
+The clusters thus may contain contigs from multiple samples, and may represent the same genome assembled
+in different samples.
+To obtain mono-sample bins from the clusters, the clusters then split by their sample of origin in a process we call binsplitting.
+This reduces duplication in the output bins, and better preserves inter-sample diversity.
+
+Binplsitting is done by looking at the identifiers (headers) of the contigs in the FASTA file:
+They are assumed to be named according to the scheme `<sample identifier><separator><contig identifier>`,
+where:
+* The sample identifier uniquely identifies the same that the contig came from,
+* The separator separates the sample- and contig identifier, and is guaranteed to not be contained in the sample identifier
+* The contig identifier uniquely identifies the contig within the sample.
+When using the provided `src/concatenate.py` script, the names conform to this scheme, being named e.g.
+`S5C1042`, for sample 5, contig 1042. In this case, the separator is 'C'.
+
+The separator can be set on command-line with the flag `-o`.