diff --git a/workflow_taxVamb/README.md b/workflow_taxVamb/README.md deleted file mode 100644 index fb30ebe7..00000000 --- a/workflow_taxVamb/README.md +++ /dev/null @@ -1,170 +0,0 @@ -# taxVamb snakemake workflow -This is a snakemake workflow that performs all the necessary steps to run TaxVamb, as input it takes quality control filtered paired end reads and individual sample _de novo_ assemblies, process them through best practice pipeline (multi-split), obtains taxonomic annotations with MMseqs2, runs vaevae, and runs CheckM2 to assess the quality of the bins. - -In short it will: - -``` -1. Filter contigs for 2000bp and rename them to conform with the multi-split workflow -2. Index resulting contig-file with minimap2 and create a sequence dictionary -3. Map reads with minimap2 to the combined contig set -4. Sort bam-files -5. Run MMeqs2 -6. Run vaevae, and reclustering. -7. Determine completeness and contamination of the bins using CheckM2 -``` - -The nice thing about using snakemake for this is that it will keep track of which jobs have finished and it allows the workflow to be run on different hardware such as a laptop, a linux workstation and a HPC facility (currently with qsub). Keep in mind that there are three different paths, (named directed acyclic graphs in snakemake), that can be executed by snakemake depending on the outputs generated during the workflow and complicating a bit the interpretation of the snakemake file. That's why we added some comments for each rule briefily explaining their purpose. Feel free to reach us if you encounter any problems. - -## Installation -To run the workflow first install a Python3 version of [Miniconda](https://docs.conda.io/en/latest/miniconda.html) and [mamba](https://mamba.readthedocs.io/en/latest/installation.html#fresh-install). TaxVamb uses CheckM2 to score the bins, unfortunately, due to some dependencies conflicts, CheckM2 can not be installed in the same environment than taxVamb, therefore a specific environment should be created for CheckM2 and for taxVamb: - -``` - # Install taxVamb and mmseqs2 in taxVamb environment - git clone https://github.com/sgalkina/vamb.git --branch mmseq-vae - mamba create -n taxVamb python=3.9.16 - mamba env update -n taxVamb --file vamb/workflow_taxVamb/envs/taxVamb.yaml - conda activate taxVamb - cd vamb && pip install -e . && cd .. - # mmseqs relies on the gtdb_tk database, so if not present already, it has to be downloaded. - mmseqs databases GTDB databases/ - -# Install CheckM2 in checkm2 environment - git clone https://github.com/chklovski/CheckM2.git - mamba create -n checkm2 python=3.8.15 - mamba env update -n checkm2 --file vamb/workflow_taxVamb/envs/checkm2.yml - conda activate checkm2 - cd CheckM2 && git checkout e563159 && python setup.py install && cd .. - checkm2 database --download - conda deactivate -``` -However, despite taxVamb and CheckM2 being in different environments, snakemake will be taking care of which is the right environment for each task. So now we should be ready to move forward and configure the input data to run our workflow. Installation should not take more than 30 minutes on a normal laptop, depending on your internet connexion and computer. - -## Set up configuration with your data - -To run the snakemake workflow you need to set up three files: the configuration file (`config.json`), a file with paths to your contig-files (`contigs.txt`) and a file with paths to your reads (`samples2data.tsv`). Example files are included and described here for an example dataset of four samples: - -`contigs.txt` contains paths to each of the per-sample assemblies: -``` -assemblies/contigs.sample_1.fna.gz -assemblies/contigs.sample_2.fna.gz -assemblies/contigs.sample_3.fna.gz -assemblies/contigs.sample_4.fna.gz -``` - -`samples2data.tsv` contains sample name, path to read-pair1 and path to read-pair2 (tab-separated): -``` -sample_1 reads/sample_1.r1.fq.gz reads/sample_1.r2.fq.gz -sample_2 reads/sample_2.r1.fq.gz reads/sample_2.r2.fq.gz -sample_3 reads/sample_3.r1.fq.gz reads/sample_3.r2.fq.gz -sample_4 reads/sample_4.r1.fq.gz reads/sample_4.r2.fq.gz - -``` - -Then the configuration file (`config.json`). The first two lines points to the files containing contigs and read information; `index_size` is size of the minimap2 index(es); `min_contig_size` defines the minimum contig length considered for binning; `min_bin_size` defines the minimum bin length that will be written as fasta file; `min_identity` referes to the minimum read alignment threshold for the contig abundances calculation; `mem` and `ppn` shows the amount of memory and cores set aside for minimap2, vaevae, CheckM2 and CheckM2 dereplicate, `vaevae_params` gives the parameters used by vaevae, finally `outdir` points to the directory that wiil be created and where all output files will be stored. Here we tell it to use the multi-split approach (`-o C`) and to run the vaevae model. With regard to `index_size` this is the amount of Gbases that minimap will map to at the same time. You can increase the size of this if you have more memory available on your machine (a value of `12G` can be run using `"minimap_mem": "12gb"`). - -``` -{ - "contigs": "contigs.txt", - "sample_data": "samples2data.tsv", - "index_size": "3G", - "min_contig_size": "2000", - "min_bin_size": "200000", - "min_identity": "0.95", - "minimap_mem": "15GB", - "minimap_ppn": "15", - "vaevae_mem": "50GB", - "vaevae_ppn": "30", - "checkm2_mem": "15GB", - "checkm2_ppn": "15", - "mmseq_mem": "260GB", - "mmseq_ppn": "30", - "vaevae_params": " -l 64 -e 500 -q 25 75 150 -pe 100 -pq 25 75 --model vaevae -o C ", - "vaevae_preload": "", - "outdir": "taxVamb_outdir", - "min_comp": "0.9", - "max_cont": "0.05" -} - -``` -## Example run - -When running the workflow use snakemake, give it the maximum number of cores you want to use and the path to the configfile as well as the snakemake file. You can then run snakemake on a laptop or workstation as below - remember to activate the conda environment first before running snakemake. - -``` -conda activate taxVamb -snakemake --cores 20 --configfile /path/to/vamb/workflow_taxVamb/config.json --snakefile /path/to/vamb/workflow_taxVamb/taxVamb.snake.conda.smk --use-conda -``` - -If you want to use snakemake on a compute cluster using `qsub` we add the following `--cluster` option below. -``` - -conda activate taxVamb -snakemake --jobs 20 --configfile /path/to/vamb/workflow_taxVamb/config.json --snakefile /path/to/vamb/workflow_taxVamb/taxVamb.snake.conda.smk --latency-wait 60 --use-conda --cluster "qsub -l walltime={params.walltime} -l nodes=1:ppn={params.ppn} -l mem={resources.mem} -e {log.e} -o {log.o}" -``` - -Or if you want to use snakemake on a compute cluster using `slurm`, we add the following `--slurm` option below. -``` - -conda activate taxVamb -snakemake --slurm --cores 20 --jobs 10 --configfile /path/to/vamb/workflow_taxVamb/config.json --snakefile /path/to/vamb/workflow_taxVamb/taxVamb.snake.conda.smk --latency-wait 60 --use-conda -``` - -Note 1: If you want to re-run with different parameters of taxVamb you can change `vaevae_params` in the config-file, but remember to rename the `outdir` configuration file entry, otherwise it will overwrite it. - -## Outputs -taxVamb produces the following output files: -- `Final_bins`: folder containing the final set of bins by running CheckM2 on taxVamb bins per sample. This folder also contains a `quality_report.tsv` file with the completeness and contamination of the final set of near complete bins determined by CheckM2. Bins contamination and completeness thresholds can be modified by setting the `max_cont` and `min_comp` in the `config.json` file. -- `Final_clusters.tsv`: final set of clusters file product of running CheckM2 on taxVamb bins per sample. `Final_bins` and `Final_clusters.tsv` have the same contigs per bin/cluster. Clusters contamination and completeness thresholds can be modified by setting the `max_cont` and `min_comp` in the `config.json` file. -- `tmp`: folder containing the intermediate files and directories generated during the workflow: - - `mapped`: folder containing the sorted BAM files per sample, those BAM files were used by TaxVamb to generate the `abundance.npz` file, which expresses the contig abudnaces along samples. Those files might be quite large, so they can be deleted to free space, on the other hand they take a while to be generated, so might be worth keeping them. - - `abundance.npz`: file aggregating contig abundances along samples from the BAM files. Using this as input instead of BAM files will skip re-parsing the BAM files, which take a significant amount of time. - - `contigs.flt.fna.gz`: a gunzipped fasta file containing contigs aggregated from all samples, renamed and filtered by the minimum contig length defined by `min_contig_size` in the `config.json`. - - `snakemake_tmp`: folder containing all log files from the snakemake workflow, specially from the binning part. Those files are the evidences that each workflow rule has been executed, and some of them contain the actual `stderr` `stdout` of the commands executed by the rules. If encountering any problem with snakemake, check them folder for debugging. -- `log`: folder containing all the log files from running the snakemake workflow in high computing cluster as well as the log files from the non binning part of the snakemake workflow, i.e. mapping, contig filtering, etc. If any snakemake rule fails to be executed, check the rule log for debugging. -- `vaevae`: folder containing the actual binning output files: - - `bins`: folder containing the bins per sample obtained after clustering the VAEVAE latents and before reclustering. - - `bins_reclustered`: folder containing the bins per sample obtained after clustering the VAEVAE latents and reclustering them accounting for single copy genes. - - `composition.npz`: a Numpy .npz file that contain all kmer composition information computed by TaxVamb from the FASTA file. This can be provided to another run of TaxVamb to skip the composition calculation step. - - `contignames`: text file containing a list of the contigs remaining after the minimum contig size allowed, and defined on the `min_contig_size` in the `config.json` file. - - `lengths.npz`: Numpy object that contains the contig length, same order than the contignames. - - `log.txt`: a text file with information about the TaxVamb run. Look here (and at stderr) if you experience errors. - - `mask.npz`: considering the contigs abundances and tetra nucleotide frequencies computed per contig, some contigs might have been filtered out before binning, this numpy boolean object contains this masking. - - `model.pt`: a file containing the trained VAEVAE model. When running TaxVamb from a Python interpreter, the VAEVAE can be loaded from this file to skip training. - - `vae_clusters.tsv`: file generated by clustering the VAEVAE latent space, where each row is a sequence: Left column for the cluster (i.e bin) name, right column for the sequence name. You can create the FASTA-file bins themselves using the script in `src/create_fasta.py` - - - -## Using a GPU to speed up taxVamb - -Using a GPU can speed up taxVamb considerably - especially when you are binning millions of contigs. In order to enable it you need to make a couple of changes to the configuration file. Basically we need to add `--cuda` to the `vaevae_params` to tell taxVamb to use the GPU. Then if you are using the `--cluster` option, you also need to update `vaevae_ppn` accordingly - e.g. on our system (qsub) we exchange `"vaevae_ppn": "10"` to `"vaevae_ppn": "10:gpus=1"`. Therefore the `config.json` file looks like this if I want to use GPU acceleration: - -``` -{ - "contigs": "contigs.txt", - "sample_data": "samples2data.tsv", - "index_size": "3G", - "min_contig_size": "2000", - "min_bin_size": "200000", - "min_identity": "0.95", - "minimap_mem": "15GB", - "minimap_ppn": "15", - "vaevae_mem": "50GB", - "vaevae_ppn": "30", - "checkm2_mem": "15GB", - "checkm2_ppn": "15", - "mmseq_mem": "260GB", - "mmseq_ppn": "30:gpus=1", - "vaevae_params": " -l 64 -e 500 -q 25 75 150 -pe 100 -pq 25 75 --model vaevae -o C ", - "vaevae_preload": "", - "outdir": "taxVamb_outdir", - "min_comp": "0.9", - "max_cont": "0.05" -} - -``` - -Note that I could not get `taxVamb` to work with `cuda` on our cluster when installing from bioconda. Therefore I added a line to preload cuda toolkit to the configuration file that will load this module when running `taxVamb`. - -Please let us know if you have any issues and we can try to help out. - - diff --git a/workflow_taxVamb/config.json b/workflow_taxVamb/config.json deleted file mode 100644 index 15533983..00000000 --- a/workflow_taxVamb/config.json +++ /dev/null @@ -1,24 +0,0 @@ -{ - "contigs": "contigs.txt", - "sample_data": "samples2data.tsv", - "index_size": "3G", - "min_contig_size": "2000", - "min_bin_size": "200000", - "min_identity": "0.95", - "minimap_mem": "15GB", - "minimap_ppn": "15", - "vaevae_mem": "50GB", - "vaevae_ppn": "30", - "reclust_mem": "50GB", - "reclust_ppn": "30", - "checkm2_mem": "15GB", - "checkm2_ppn": "15", - "mmseq_mem": "260GB", - "mmseq_ppn": "30", - "mmseq_db": "/path/to/mmseq2/gtdb", - "vaevae_params": " -l 64 -e 500 -q 25 75 150 -pe 100 -pq 25 75 --model vaevae -o C ", - "vaevae_preload": "", - "outdir": "taxVamb_outdir", - "min_comp": "0.9", - "max_cont": "0.05" -} diff --git a/workflow_taxVamb/envs/checkm2.yml b/workflow_taxVamb/envs/checkm2.yml deleted file mode 100644 index 95b0d61d..00000000 --- a/workflow_taxVamb/envs/checkm2.yml +++ /dev/null @@ -1,19 +0,0 @@ -channels: - - conda-forge - - bioconda - - defaults -dependencies: - - python=3.8.15 - - scikit-learn=0.23.2 - - h5py=2.10.0 - - numpy=1.23.2 - - tensorflow=2.9.1 - - lightgbm=3.3.2 - - pandas=1.4.3 - - scipy=1.9.0 - - setuptools=65.3.0 - - requests=2.28.1 - - packaging=21.3 - - tqdm=4.64.0 - - diamond=2.0.15 - - prodigal=2.6.3 diff --git a/workflow_taxVamb/envs/minimap2.yaml b/workflow_taxVamb/envs/minimap2.yaml deleted file mode 100644 index bbe81442..00000000 --- a/workflow_taxVamb/envs/minimap2.yaml +++ /dev/null @@ -1,6 +0,0 @@ -name: minimap2 -channels: - - bioconda -dependencies: - - minimap2 - - samtools diff --git a/workflow_taxVamb/envs/samtools.yaml b/workflow_taxVamb/envs/samtools.yaml deleted file mode 100644 index 61dfbc38..00000000 --- a/workflow_taxVamb/envs/samtools.yaml +++ /dev/null @@ -1,5 +0,0 @@ -name: samtools -channels: - - bioconda -dependencies: - - samtools diff --git a/workflow_taxVamb/envs/taxVamb.yaml b/workflow_taxVamb/envs/taxVamb.yaml deleted file mode 100644 index a8d0a466..00000000 --- a/workflow_taxVamb/envs/taxVamb.yaml +++ /dev/null @@ -1,19 +0,0 @@ -name: vamb -channels: - - conda-forge - - bioconda - - defaults -dependencies: -- python=3.9.16 -- snakemake=7.22.0 -- pip=23.0.1 -- biopython=1.81 -- networkx=3.0 -- scikit-learn=1.2.2 -- pandas=2.0.0 -- mmseqs2 -- prodigal -- hmmer -- pip: - - dadaptation==3.0 - - loguru==0.7.2 diff --git a/workflow_taxVamb/src/computerome_vaevae.sh b/workflow_taxVamb/src/computerome_vaevae.sh deleted file mode 100755 index 4b8843bc..00000000 --- a/workflow_taxVamb/src/computerome_vaevae.sh +++ /dev/null @@ -1,12 +0,0 @@ -#!/usr/bin/bash -module load tools computerome_utils/2.0 -module load anaconda3/2023.03 -module unload gcc -module load gcc/11.1.0 -module load minimap2/2.17r941 samtools/1.10 - -source ~/.bashrc -conda init bash -conda activate /home/projects/cpr_10006/people/svekut/.conda/vamb - -./$1 $2 \ No newline at end of file diff --git a/workflow_taxVamb/src/create_cluster_scores_bin_path_dict.py b/workflow_taxVamb/src/create_cluster_scores_bin_path_dict.py deleted file mode 100644 index 80e7cd3a..00000000 --- a/workflow_taxVamb/src/create_cluster_scores_bin_path_dict.py +++ /dev/null @@ -1,65 +0,0 @@ -import numpy as np -import os -import json -import argparse - -from typing import cast - - -def get_cluster_score_bin_path( - path_checkm_all: str, path_bins: str, bins: set[str] -) -> tuple[dict[str, tuple[float, float]], dict[str, str]]: - """Given CheckM has been run for all samples, create 2 dictionaries: - - {bin:path_bin} - - {bin:[completeness, contamination]}""" - cluster_score: dict[str, tuple[float, float]] = dict() - bin_path: dict[str, str] = dict() - for sample in os.listdir(path_checkm_all): - path_quality_s = os.path.join(path_checkm_all, sample, "quality_report.tsv") - c_com_con = np.loadtxt( - path_quality_s, - delimiter="\t", - skiprows=1, - usecols=(0, 1, 2), - dtype=str, - ndmin=2, - ) - - for row in c_com_con: - cluster, com, con = row - cluster = cast(str, cluster) - com, con = float(com), float(con) - bin_name = cluster + ".fna" - if bin_name in bins: - cluster_score[cluster] = (com, con) - bin_path[cluster + ".fna"] = os.path.abspath( - os.path.join(path_bins, sample, cluster + ".fna") - ) - return cluster_score, bin_path - - -if __name__ == "__main__": - parser = argparse.ArgumentParser() - parser.add_argument("--s", type=str, help="path checkm2 that contains all samples") - parser.add_argument("--b", type=str, help="path all bins ") - parser.add_argument( - "--cs_d", type=str, help="cluster_score dictionary will be stored here" - ) - parser.add_argument( - "--bp_d", type=str, help="bin_path dictionary will be stored here " - ) - - opt = parser.parse_args() - - bins_set = set() - for sample in os.listdir(opt.b): - for bin_ in os.listdir(os.path.join(opt.b, sample)): - if ".fna" in bin_: - bins_set.add(bin_) - - cluster_score, bin_path = get_cluster_score_bin_path(opt.s, opt.b, bins_set) - with open(opt.cs_d, "w") as f: - json.dump(cluster_score, f) - - with open(opt.bp_d, "w") as f: - json.dump(bin_path, f) diff --git a/workflow_taxVamb/src/longread_human.sh b/workflow_taxVamb/src/longread_human.sh deleted file mode 100755 index 41eae369..00000000 --- a/workflow_taxVamb/src/longread_human.sh +++ /dev/null @@ -1,24 +0,0 @@ -#!/usr/bin/bash - -vamb \ - --model vaevae \ - --outdir /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/vaevaeout \ - --fasta /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/contigs_2kbp.fna \ - --rpkm /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/vambout/abundance.npz \ - --taxonomy /home/projects/cpr_10006/people/svekut/mmseq2/longread_taxonomy_2023.tsv \ - -l 64 \ - -e 1000 \ - -q 25 75 150 500 \ - -pe 100 \ - -pq 25 75 \ - --cuda \ - --minfasta 200000 - -vamb \ - --model reclustering \ - --latent_path /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/vaevaeout/vaevae_latent.npy \ - --clusters_path /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/vaevaeout/vaevae_clusters.tsv \ - --fasta /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/contigs_2kbp.fna \ - --rpkm /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/vambout/abundance.npz \ - --outdir /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/vaevaeout_reclustering \ - --minfasta 200000 diff --git a/workflow_taxVamb/src/longread_sludge.sh b/workflow_taxVamb/src/longread_sludge.sh deleted file mode 100755 index eb9f9607..00000000 --- a/workflow_taxVamb/src/longread_sludge.sh +++ /dev/null @@ -1,27 +0,0 @@ -#!/usr/bin/bash - -# --taxonomy /home/projects/cpr_10006/people/paupie/vaevae/mmseq2_annotations/long_read_sludge/lr_sludge_taxonomy.tsv \ - -vamb \ - --model vaevae \ - --outdir /home/projects/cpr_10006/projects/semi_vamb/data/sludge/vaevaeout \ - --fasta /home/projects/cpr_10006/projects/semi_vamb/data/sludge/contigs_2kbp.fna \ - --rpkm /home/projects/cpr_10006/projects/semi_vamb/data/sludge/vambout/abundance.npz \ - --taxonomy_predictions /home/projects/cpr_10006/people/svekut/vamb/results_taxonomy_predictor.csv \ - -l 64 \ - -e 500 \ - -q 150 \ - -pe 100 \ - -pq 25 75 \ - --cuda \ - --minfasta 200000 - -vamb \ - --model reclustering \ - --latent_path /home/projects/cpr_10006/projects/semi_vamb/data/sludge/vaevaeout/vaevae_latent.npy \ - --clusters_path /home/projects/cpr_10006/projects/semi_vamb/data/sludge/vaevaeout/vaevae_clusters.tsv \ - --fasta /home/projects/cpr_10006/projects/semi_vamb/data/sludge/contigs_2kbp.fna \ - --rpkm /home/projects/cpr_10006/projects/semi_vamb/data/sludge/vambout/abundance.npz \ - --outdir /home/projects/cpr_10006/projects/semi_vamb/data/sludge/vaevaeout_reclustering2 \ - --hmmout_path /home/projects/cpr_10006/projects/semi_vamb/data/marker_genes/markers_sludge.hmmout \ - --minfasta 200000 diff --git a/workflow_taxVamb/src/shortread_CAMI2.sh b/workflow_taxVamb/src/shortread_CAMI2.sh deleted file mode 100755 index cbb20a41..00000000 --- a/workflow_taxVamb/src/shortread_CAMI2.sh +++ /dev/null @@ -1,26 +0,0 @@ -#!/usr/bin/bash -dataset=$1 - -vamb \ - --model vaevae \ - --outdir /home/projects/cpr_10006/people/svekut/cami2_${dataset}_out \ - --fasta /home/projects/cpr_10006/projects/vamb/data/datasets/cami2_${dataset}/contigs_2kbp.fna.gz \ - --rpkm /home/projects/cpr_10006/projects/vamb/data/datasets/cami2_${dataset}/abundance.npz \ - --taxonomy /home/projects/cpr_10006/people/svekut/mmseq2/${dataset}_taxonomy_2023.tsv \ - -l 64 \ - -e 500 \ - -q 25 75 150 \ - -pe 100 \ - -pq 25 75 \ - --cuda \ - --minfasta 200000 - -vamb \ - --model reclustering \ - --latent_path /home/projects/cpr_10006/people/svekut/cami2_${dataset}_out/vaevae_latent.npy \ - --clusters_path /home/projects/cpr_10006/people/svekut/cami2_${dataset}_out/vaevae_clusters.tsv \ - --fasta /home/projects/cpr_10006/projects/vamb/data/datasets/cami2_${dataset}/contigs_2kbp.fna.gz \ - --rpkm /home/projects/cpr_10006/projects/vamb/data/datasets/cami2_${dataset}/abundance.npz \ - --outdir /home/projects/cpr_10006/people/svekut/cami2_${dataset}_out_reclustering2 \ - --hmmout_path /home/projects/cpr_10006/projects/semi_vamb/data/marker_genes/markers_cami_${dataset}.hmmout \ - --minfasta 200000 diff --git a/workflow_taxVamb/src/shortread_almeida.sh b/workflow_taxVamb/src/shortread_almeida.sh deleted file mode 100755 index 0dcef2e2..00000000 --- a/workflow_taxVamb/src/shortread_almeida.sh +++ /dev/null @@ -1,17 +0,0 @@ -#!/usr/bin/bash - -vamb \ - --model vaevae \ - --outdir /home/projects/cpr_10006/projects/vamb/almeida_vaevaeout \ - --fasta /home/projects/cpr_10006/projects/vamb/analysis/almeida/data/almeida.fa.gz \ - --rpkm /home/projects/cpr_10006/projects/vamb/analysis/almeida/data/almeida.jgi.depth.npz \ - --taxonomy /home/projects/cpr_10006/people/svekut/mmseq2/almeida_taxonomy.tsv \ - -l 32 \ - -t 512 \ - -e 200 \ - -q \ - -pe 50 \ - -pq \ - --n_species 100 \ - --cuda \ - --minfasta 200000 diff --git a/workflow_taxVamb/src/symlink_nc_bins.py b/workflow_taxVamb/src/symlink_nc_bins.py deleted file mode 100644 index 6c0020f5..00000000 --- a/workflow_taxVamb/src/symlink_nc_bins.py +++ /dev/null @@ -1,39 +0,0 @@ -import os -import json -import argparse - - -parser = argparse.ArgumentParser() -parser.add_argument("-o", type=str, help="outdir where symlinks will be stores ") -parser.add_argument("--cs_d", type=str, help="cluster_score dictionary") -parser.add_argument("--bp_d", type=str, help="bin_path dictionary") - -parser.add_argument("-x", type=str, default=".fna", help="bins extensions ") - -parser.add_argument("--min_comp", type=float, help="Minimum bin completeness") -parser.add_argument("--max_cont", type=float, help="Maximum bin contamination") - -opt = parser.parse_args() - -with open(opt.cs_d, "r") as f: - cs_d = json.load(f) - - -with open(opt.bp_d, "r") as f: - bp_d = json.load(f) - -try: - os.mkdir(opt.o) -except: - pass - -bin_format = opt.x -min_comp = opt.min_comp * 100 -max_cont = opt.max_cont * 100 - -for cluster, comp_cont in cs_d.items(): - comp, cont = comp_cont - bin_ = cluster + bin_format - if comp >= min_comp and cont <= max_cont: - # print(bp_d[bin_]) - os.symlink(bp_d[bin_], os.path.join(opt.o, bin_)) diff --git a/workflow_taxVamb/src/write_clusters_from_final_bins.sh b/workflow_taxVamb/src/write_clusters_from_final_bins.sh deleted file mode 100644 index 93ceab42..00000000 --- a/workflow_taxVamb/src/write_clusters_from_final_bins.sh +++ /dev/null @@ -1,29 +0,0 @@ -#!/usr/bin/bash - - -while getopts "d:o:" opt; do - case $opt in - d) drep_dir=$OPTARG ;; - o) clusters_file=$OPTARG ;; - *) echo 'error' >&2 - exit 1 - esac -done -echo 'creating z y v clusters from the final set of bins' -cd $drep_dir -for bin in $(ls . 2> /dev/null) - -do -if [[ $bin == **".fna" ]] -then - -cluster_name=$(echo $bin | sed 's=.fna==g' | sed 's=.fa==g') -echo $cluster_name -#for contig in $(grep '>' $bin | sed 's=>==g') -#do -#echo -e "$cluster_name""\t""$contig" >> $clusters_file -#done - - -fi -done diff --git a/workflow_taxVamb/taxVamb.snake.conda.smk b/workflow_taxVamb/taxVamb.snake.conda.smk deleted file mode 100644 index 77b4532b..00000000 --- a/workflow_taxVamb/taxVamb.snake.conda.smk +++ /dev/null @@ -1,644 +0,0 @@ -import re -import os -import sys -from vamb.vambtools import concatenate_fasta, hash_refnames -import numpy as np -SNAKEDIR = os.path.dirname(workflow.snakefile) - -sys.path.append(os.path.join(SNAKEDIR, 'src')) - - -def get_config(name, default, regex): - res = config.get(name, default).strip() - m = re.match(regex, res) - if m is None: - raise ValueError( - f"Config option \"{name}\" is \"{res}\", but must conform to regex \"{regex}\"") - return res - - -##### set configurations - -CONTIGS = get_config("contigs", "contigs.txt", r".*") # each line is a contigs path from a given sample -SAMPLE_DATA = get_config("sample_data", "samples2data.txt", r".*") # each line is composed by 3 elements: sample id, forward_reads_path , backward_reads_path -INDEX_SIZE = get_config("index_size", "12G", r"[1-9]\d*[GM]$") -MIN_CONTIG_SIZE = int(get_config("min_contig_size", "2000", r"[1-9]\d*$")) -MIN_BIN_SIZE = int(get_config("min_bin_size", "200000", r"[1-9]\d*$")) -MIN_IDENTITY = float(get_config("min_identity", "0.95", r".*")) -MM_MEM = get_config("minimap_mem", "35gb", r"[1-9]\d*GB$") -MM_PPN = get_config("minimap_ppn", "10", r"[1-9]\d*$") -MMSEQ_MEM = get_config("mmseq_mem", "260gb", r"[1-9]\d*GB$") -MMSEQ_PPN = get_config("mmseq_ppn", "30", r"[1-9]\d*$") -MMSEQ_DB = get_config("mmseq_db", "", r".*") -VAEVAE_PARAMS = get_config("vaevae_params"," -o C --minfasta 200000 ", r".*") -VAEVAE_PRELOAD = get_config("vaevae_preload", "", r".*") -VAEVAE_MEM = get_config("vaevae_mem", "20gb", r"[1-9]\d*GB$") -VAEVAE_PPN = get_config("vaevae_ppn", "10", r"[1-9]\d*(:gpus=[1-9]\d*)?$") -RECLUST_MEM = get_config("reclust_mem", "20gb", r"[1-9]\d*GB$") -RECLUST_PPN = get_config("reclust_ppn", "10", r"[1-9]\d*(:gpus=[1-9]\d*)?$") -CHECKM_MEM = get_config("checkm2_mem", "10gb", r"[1-9]\d*GB$") -CHECKM_PPN = get_config("checkm2_ppn", "10", r"[1-9]\d*$") -MIN_COMP = get_config("min_comp", "0.9", r".*") -MAX_CONT = get_config("max_cont", "0.05", r".*") -OUTDIR = get_config("outdir", "taxVamb_outdir", r".*") - -try: - os.makedirs(os.path.join(OUTDIR,"log"), exist_ok=True) -except FileExistsError: - pass - - -# parse if GPUs is needed # -vaevae_threads, sep, vaevae_gpus = VAEVAE_PPN.partition(":gpus=") -VAEVAE_PPN = vaevae_threads -CUDA = len(vaevae_gpus) > 0 - -## read in sample information ## - -# read in sample2path -IDS = [] -sample2path = {} -fh_in = open(SAMPLE_DATA, 'r') -for line in fh_in: - line = line.rstrip() - fields = line.split('\t') - IDS.append(fields[0]) - sample2path[fields[0]] = [fields[1], fields[2]] - -# read in list of per-sample assemblies -contigs_list = [] -fh_in = open(CONTIGS, 'r') -for line in fh_in: - line = line.rstrip() - contigs_list.append(line) - -# target rule -rule all: - input: - os.path.join(OUTDIR,"log/workflow_finished_taxvamb.log") - -# Filter contigs for 2000bp and rename them to conform with the multi-split workflow -rule cat_contigs: - input: - contigs_list - output: - os.path.join(OUTDIR,"contigs.flt.fna.gz") - params: - path=os.path.join(os.path.dirname(SNAKEDIR), "src", "concatenate.py"), - walltime="864000", - nodes="1", - ppn="1", - resources: - mem="5GB" - threads: - 1 - log: - o = os.path.join(OUTDIR,"log/contigs/catcontigs.o"), - e = os.path.join(OUTDIR,"log/contigs/catcontigs.e") - - conda: - "taxVamb" - shell: "python {params.path} {output} {input} -m {MIN_CONTIG_SIZE}" - - - - - -# Run mmseq2 over the contigs -rule mmseq2: - input: - os.path.join(OUTDIR,"contigs.flt.fna.gz") - output: - taxonomytsv = os.path.join(OUTDIR,"tmp/mmseq/taxonomy.tsv"), - out_log_file = os.path.join(OUTDIR,"tmp/mmseq_finished.log") - - params: - walltime="86400", - nodes="1", - ppn=MMSEQ_PPN - resources: - mem=MMSEQ_MEM - threads: - int(MMSEQ_PPN) - log: - o=os.path.join(OUTDIR,'log','mmseq.out'), - e=os.path.join(OUTDIR,'log','mmseq.err') - - conda: - "taxVamb" - shell: - """ - mmseqs createdb {input} {OUTDIR}/tmp/mmseq/qdb - mmseqs taxonomy {OUTDIR}/tmp/mmseq/qdb {MMSEQ_DB} {OUTDIR}/tmp/mmseq/taxonomy {OUTDIR}/tmp/mmseq/tmp --threads {threads} --tax-lineage 1 - mmseqs createtsv {OUTDIR}/tmp/mmseq/qdb {OUTDIR}/tmp/mmseq/taxonomy {output.taxonomytsv} - touch {output.out_log_file} - """ - - - -# Index resulting contig-file with minimap2 -rule index: - input: - contigs = os.path.join(OUTDIR,"contigs.flt.fna.gz") - output: - mmi = os.path.join(OUTDIR,"contigs.flt.mmi") - params: - walltime="864000", - nodes="1", - ppn="1" - resources: - mem="90GB" - threads: - 1 - log: - out_ind = os.path.join(OUTDIR,"log/contigs/index.log"), - o = os.path.join(OUTDIR,"log/contigs/index.o"), - e = os.path.join(OUTDIR,"log/contigs/index.e") - - - conda: - "envs/minimap2.yaml" - shell: - "minimap2 -I {INDEX_SIZE} -d {output} {input} 2> {log.out_ind}" - -# This rule creates a SAM header from a FASTA file. -# We need it because minimap2 for truly unknowable reasons will write -# SAM headers INTERSPERSED in the output SAM file, making it unparseable. -# To work around this mind-boggling bug, we remove all header lines from -# minimap2's SAM output by grepping, then re-add the header created in this -# rule. -rule dict: - input: - contigs = os.path.join(OUTDIR,"contigs.flt.fna.gz") - output: - dict = os.path.join(OUTDIR,"contigs.flt.dict") - params: - walltime="864000", - nodes="1", - ppn="1" - resources: - mem="10GB" - threads: - 1 - log: - out_dict= os.path.join(OUTDIR,"log/contigs/dict.log"), - o = os.path.join(OUTDIR,"log/contigs/dict.o"), - e = os.path.join(OUTDIR,"log/contigs/dict.e") - - conda: - "envs/samtools.yaml" - shell: - "samtools dict {input} | cut -f1-3 > {output} 2> {log.out_dict}" - -# Generate bam files -rule minimap: - input: - fq = lambda wildcards: sample2path[wildcards.sample], - mmi = os.path.join(OUTDIR,"contigs.flt.mmi"), - dict = os.path.join(OUTDIR,"contigs.flt.dict") - output: - bam = temp(os.path.join(OUTDIR,"mapped/{sample}.bam")) - params: - walltime="864000", - nodes="1", - ppn=MM_PPN - resources: - mem=MM_MEM - threads: - int(MM_PPN) - log: - out_minimap = os.path.join(OUTDIR,"log/map/{sample}.minimap.log"), - o = os.path.join(OUTDIR,"log/map/{sample}.minimap.o"), - e = os.path.join(OUTDIR,"log/map/{sample}.minimap.e") - - conda: - "envs/minimap2.yaml" - shell: - # See comment over rule "dict" to understand what happens here - "minimap2 -t {threads} -ax sr {input.mmi} {input.fq} -N 5" - " | grep -v '^@'" - " | cat {input.dict} - " - " | samtools view -F 3584 -b - " # supplementary, duplicate read, fail QC check - " > {output.bam} 2> {log.out_minimap}" - -# Sort bam files -rule sort: - input: - os.path.join(OUTDIR,"mapped/{sample}.bam") - output: - os.path.join(OUTDIR,"mapped/{sample}.sort.bam") - params: - walltime="864000", - nodes="1", - ppn="2", - prefix=os.path.join(OUTDIR,"mapped/tmp.{sample}") - resources: - mem="15GB" - threads: - 2 - log: - out_sort = os.path.join(OUTDIR,"log/map/{sample}.sort.log"), - o = os.path.join(OUTDIR,"log/map/{sample}.sort.o"), - e = os.path.join(OUTDIR,"log/map/{sample}.sort.e") - - conda: - "envs/samtools.yaml" - shell: - "samtools sort {input} -T {params.prefix} --threads 1 -m 3G -o {output} 2> {log.out_sort}" - -# Extract header lengths from a BAM file in order to determine which headers -# to filter from the abundance (i.e. get the mask) -rule get_headers: - input: - os.path.join(OUTDIR, "mapped", f"{IDS[1]}.sort.bam") - output: - os.path.join(OUTDIR,"abundances/headers.txt") - params: - walltime = "86400", - nodes = "1", - ppn = "1" - resources: - mem = "4GB" - threads: - 1 - conda: - "envs/samtools.yaml" - log: - head = os.path.join(OUTDIR,"log/abundance/headers.log"), - o = os.path.join(OUTDIR,"log/abundance/get_headers.o"), - e = os.path.join(OUTDIR,"log/abundance/get_headers.e") - - shell: - "samtools view -H {input}" - " | grep '^@SQ'" - " | cut -f 2,3" - " > {output} 2> {log.head} " - -# Using the headers above, compute the mask and the refhash -rule abundance_mask: - input: - os.path.join(OUTDIR,"abundances/headers.txt") - output: - os.path.join(OUTDIR,"abundances/mask_refhash.npz") - - log: - mask = os.path.join(OUTDIR,"log/abundance/mask.log"), - o = os.path.join(OUTDIR,"log/abundance/mask.o"), - e = os.path.join(OUTDIR,"log/abundance/mask.e") - params: - path = os.path.join(SNAKEDIR, "src", "abundances_mask.py"), - walltime = "86400", - nodes = "1", - ppn = "4" - resources: - mem = "1GB" - threads: - 4 - conda: - "taxVamb" - - shell: - """ - python {params.path} --h {input} --msk {output} --minsize {MIN_CONTIG_SIZE} 2> {log.mask} - """ - - -# For every sample, compute the abundances given the mask and refhash above -rule bam_abundance: - input: - bampath=os.path.join(OUTDIR,"mapped/{sample}.sort.bam"), - mask_refhash=os.path.join(OUTDIR,"abundances/mask_refhash.npz") - output: - os.path.join(OUTDIR,"abundances/{sample}.npz") - params: - path = os.path.join(SNAKEDIR, "src", "write_abundances.py"), - walltime = "86400", - nodes = "1", - ppn = "4" - resources: - mem = "1GB" - threads: - 4 - conda: - "taxVamb" - log: - bam = os.path.join(OUTDIR,"log/abundance/bam_abundance_{sample}.log"), - o = os.path.join(OUTDIR,"log/abundance/bam_abundance.{sample}.o"), - e = os.path.join(OUTDIR,"log/abundance/bam_abundance.{sample}.e") - - shell: - """ - python {params.path} --msk {input.mask_refhash} --b {input.bampath} --min_id {MIN_IDENTITY} --out {output} 2> {log.bam} - """ - -# Merge the abundances to a single Abundance object and save it -rule create_abundances: - input: - npzpaths=expand(os.path.join(OUTDIR,"abundances","{sample}.npz"), sample=IDS), - mask_refhash=os.path.join(OUTDIR,"abundances","mask_refhash.npz") - output: - os.path.join(OUTDIR,"abundance.npz") - params: - path = os.path.join(SNAKEDIR, "src", "create_abundances.py"), - abundance_dir = os.path.join(OUTDIR, "abundances"), - walltime = "86400", - nodes = "1", - ppn = "4" - resources: - mem = "1GB" - threads: - 4 - conda: - "taxVamb" - log: - create_abs = os.path.join(OUTDIR,"log/abundance/create_abundances.log"), - o = os.path.join(OUTDIR,"log/abundance/create_abundances.o"), - e = os.path.join(OUTDIR,"log/abundance/create_abundances.e") - - shell: - "python {params.path} --msk {input.mask_refhash} --ab {input.npzpaths} --min_id {MIN_IDENTITY} --out {output} 2> {log.create_abs} && " - "rm -r {params.abundance_dir}" - -# run vaevae -rule run_vaevae: - input: - contigs = os.path.join(OUTDIR,"contigs.flt.fna.gz"), - abundance = os.path.join(OUTDIR,"abundance.npz"), - taxonomytsv = os.path.join(OUTDIR,"tmp/mmseq/taxonomy.tsv"), - mmsq_log = os.path.join(OUTDIR,"tmp/mmseq_finished.log") - output: - clusters = os.path.join(OUTDIR,"vaevae/vaevae_clusters.tsv"), - latents = os.path.join(OUTDIR,"vaevae/vaevae_latent.npy"), - vaevae_log = os.path.join(OUTDIR,"tmp/vaevae_finished.log") - params: - walltime="86400", - nodes="1", - ppn=VAEVAE_PPN, - cuda="--cuda" if CUDA else "" - resources: - mem=VAEVAE_MEM - threads: - int(vaevae_threads) - conda: - "taxVamb" - log: - o = os.path.join(OUTDIR,'log','run_vaevae.out'), - e = os.path.join(OUTDIR,'log','run_vaevae.err'), - - shell: - """ - #rm -rf {OUTDIR}/abundances - #rm -rf {OUTDIR}/contigs.flt.dict - #rm -f {OUTDIR}/contigs.flt.mmi - rm -rf {OUTDIR}/vaevae - {VAEVAE_PRELOAD} - mkdir -p {OUTDIR}/Final_bins - vamb \ - --outdir {OUTDIR}/vaevae \ - --fasta {input.contigs} \ - --rpkm {input.abundance} \ - --taxonomy {input.taxonomytsv} \ - --minfasta {MIN_BIN_SIZE} \ - -p {threads} \ - -m {MIN_CONTIG_SIZE} \ - {params.cuda} \ - {VAEVAE_PARAMS} - - touch {output.vaevae_log} - - """ - -rule reclustering: - input: - clusters = os.path.join(OUTDIR,"vaevae/vaevae_clusters.tsv"), - latents = os.path.join(OUTDIR,"vaevae/vaevae_latent.npy"), - contigs = os.path.join(OUTDIR,"contigs.flt.fna.gz"), - abundance = os.path.join(OUTDIR,"abundance.npz"), - vaevae_log = os.path.join(OUTDIR,"tmp/vaevae_finished.log") - - - output: - #reclustered_bins = directory(os.path.join(OUTDIR,"vaevae/bins_reclustered")), - #hmm_markers = os.path.join(OUTDIR,"tmp/markers.hmmout"), - recluster_log = os.path.join(OUTDIR,"tmp/reclustering_finished.log") - - resources: - mem = RECLUST_MEM - - threads: - int(RECLUST_PPN) - - conda: - "taxVamb" - log: - o = os.path.join(OUTDIR,'log','run_reclustering.out'), - e = os.path.join(OUTDIR,'log','run_reclustering.err') - shell: - """ - rm -rf {OUTDIR}/vaevae/bins_reclustered - vamb \ - --model reclustering \ - --latent_path {input.latents} \ - --clusters_path {input.clusters} \ - --fasta {input.contigs} \ - --rpkm {input.abundance} \ - --outdir {OUTDIR}/vaevae/bins_reclustered \ - --minfasta {MIN_BIN_SIZE} - touch {output.recluster_log} - """ - # --hmmout_path {output.hmm_markers} \ -# Evaluate in which samples bins were reconstructed -checkpoint samples_with_bins: - input: - os.path.join(OUTDIR,"tmp/reclustering_finished.log") - output: - os.path.join(OUTDIR,"tmp/samples_with_bins.txt") - params: - walltime="300", - nodes="1", - ppn="1" - resources: - mem="1GB" - log: - o=os.path.join(OUTDIR,'log','samples_with_bins.out'), - e=os.path.join(OUTDIR,'log','samples_with_bins.err') - - threads: - 1 - shell: - "find {OUTDIR}/vaevae/bins_reclustered/bins/*/ -type d ! -empty | sed 's=.*bins/==g' |sed 's=/==g' > {output}" - - -def samples_with_bins_f(wildcards): - # decision based on content of output file - with checkpoints.samples_with_bins.get().output[0].open() as f: - samples_with_bins = [sample.strip() for sample in f.readlines()] - samples_with_bins_paths=expand(os.path.join(OUTDIR,"tmp/checkm2_all_{sample}_bins_finished.log"),sample=samples_with_bins) - return samples_with_bins_paths - -# Run CheckM2 for each sample with bins -rule run_checkm2_per_sample_all_bins: - output: - out_log_file=os.path.join(OUTDIR,"tmp/checkm2_all_{sample}_bins_finished.log") - params: - walltime="86400", - nodes="1", - ppn=CHECKM_PPN - resources: - mem=CHECKM_MEM - threads: - int(CHECKM_PPN) - log: - o=os.path.join(OUTDIR,'log','checkm2_{sample}.out'), - e=os.path.join(OUTDIR,'log','checkm2_{sample}.err') - - conda: - "checkm2" - shell: - "checkm2 predict --threads {threads} --input {OUTDIR}/vaevae/bins_reclustered/bins/{wildcards.sample}/*.fna --output-directory {OUTDIR}/tmp/checkm2_all/{wildcards.sample} > {output.out_log_file}" - -# this rule will be executed when all CheckM2 runs per sample finish, so it can move to the next step -rule cat_checkm2_all: - input: - samples_with_bins_f - output: - os.path.join(OUTDIR,"tmp/checkm2_finished.txt") - params: - walltime="86400", - nodes="1", - ppn="1" - resources: - mem="1GB" - threads: - 1 - log: - o=os.path.join(OUTDIR,'log','cat_checkm2.out'), - e=os.path.join(OUTDIR,'log','cat_checkm2.err') - - shell: - "touch {output}" - -# Generate a 2 python dictionaries stored in json files: -# - {bin : [completeness, contamination]} -# - {bin : bin_path} -rule create_cluster_scores_bin_path_dictionaries: - input: - checkm2_finished_log_file = os.path.join(OUTDIR,"tmp/checkm2_finished.txt") - output: - cluster_score_dict_path = os.path.join(OUTDIR,"tmp/cs_d.json"), - bin_path_dict_path = os.path.join(OUTDIR,"tmp/bp_d.json"), - params: - path = os.path.join(SNAKEDIR, "src", "create_cluster_scores_bin_path_dict.py"), - walltime = "86400", - nodes = "1", - ppn = "4" - resources: - mem = "1GB" - threads: - 4 - conda: - "taxVamb" - log: - o=os.path.join(OUTDIR,'log','cs_bp_dicts.out'), - e=os.path.join(OUTDIR,'log','cs_bp_dicts.err') - - shell: - "python {params.path} --s {OUTDIR}/tmp/checkm2_all --b {OUTDIR}/vaevae/bins_reclustered/bins --cs_d {output.cluster_score_dict_path} --bp_d {output.bin_path_dict_path} " - - - - -rule sym_NC_bins: - input: - cluster_score_dict_path = os.path.join(OUTDIR,"tmp/cs_d.json"), - bin_path_dict_path = os.path.join(OUTDIR,"tmp/bp_d.json") - output: - nc_bins_dir = directory(os.path.join(OUTDIR,"Final_bins")) - params: - path = os.path.join(SNAKEDIR, "src", "symlink_nc_bins.py"), - walltime = "86400", - nodes = "1", - ppn = "4" - resources: - mem = "1GB" - threads: - 4 - conda: - "taxVamb" - log: - ncs_syml=os.path.join(OUTDIR,'tmp','sym_final_bins_finished.log'), - o=os.path.join(OUTDIR,'log','sym_NC_bins.out'), - e=os.path.join(OUTDIR,'log','sym_NC_bins.err') - - shell: - """ - python {params.path} --cs_d {input.cluster_score_dict_path} --bp_d {input.bin_path_dict_path} --min_comp {MIN_COMP} --max_cont {MAX_CONT} -o {output.nc_bins_dir} - touch {log.ncs_syml} - """ - - -# # Write final clusters from the Final_bins folder # NC_bins contains symlinks and bins from all samples in the same dir -# rule write_clusters_from_nc_folders: -# input: -# ncs_syml = os.path.join(OUTDIR,'tmp','sym_final_bins_finished.log') - -# output: -# os.path.join(OUTDIR,"Final_clusters.tsv") -# log: -# log_fin = os.path.join(OUTDIR,"tmp/final_taxvamb_clusters_written.log"), -# o=os.path.join(OUTDIR,'log','create_final_clusters.out'), -# e=os.path.join(OUTDIR,'log','create_final_clusters.err') - -# params: -# path = os.path.join(SNAKEDIR, "src", "write_clusters_from_final_bins.sh"), -# walltime = "86400", -# nodes = "1", -# ppn = "1" -# resources: -# mem = "1GB" -# threads: -# 1 -# conda: -# "taxVamb" - -# shell: -# "sh {params.path} -d {OUTDIR}/Final_bins -o {output} ;" -# "touch {log.log_fin} " - - -# Rename and move some files and folders -rule workflow_finished: - input: - #log_fin = os.path.join(OUTDIR,'tmp','final_taxvamb_clusters_written.log') - ncs_syml=os.path.join(OUTDIR,'tmp','sym_final_bins_finished.log') - output: - os.path.join(OUTDIR,"log/workflow_finished_taxvamb.log") - params: - walltime = "86400", - nodes = "1", - ppn = "1" - resources: - mem = "1GB" - threads: - 1 - log: - o=os.path.join(OUTDIR,'log','workflow_finished_taxvamb.out'), - e=os.path.join(OUTDIR,'log','workflow_finished_taxvamb.err') - shell: - """ - #rm -r {OUTDIR}/tmp/checkm2_all/*/protein_files - - #mkdir {OUTDIR}/tmp/snakemake_tmp/ - #mv {OUTDIR}/tmp/*log {OUTDIR}/tmp/snakemake_tmp/ - #mv {OUTDIR}/tmp/*json {OUTDIR}/tmp/snakemake_tmp/ - #mv {OUTDIR}/tmp/*tsv {OUTDIR}/tmp/snakemake_tmp/ - #mv {OUTDIR}/tmp/*txt {OUTDIR}/tmp/snakemake_tmp/ - #mv {OUTDIR}/tmp/checkm2_all {OUTDIR}/tmp/snakemake_tmp/ - - mv {OUTDIR}/abundance.npz {OUTDIR}/tmp/ - mv {OUTDIR}/mapped {OUTDIR}/tmp/ - mv {OUTDIR}/contigs.flt.fna.gz {OUTDIR}/tmp/ - touch {output} - """ - - diff --git a/workflow_vaevae/envs/vaevae.yaml b/workflow_vaevae/envs/vaevae.yaml deleted file mode 100644 index 6896d0ea..00000000 --- a/workflow_vaevae/envs/vaevae.yaml +++ /dev/null @@ -1,16 +0,0 @@ -name: vamb -channels: - - conda-forge - - bioconda - - defaults -dependencies: -- python=3.9.16 -- snakemake=7.22.0 -- pip=23.0.1 -- biopython=1.81 -- networkx=3.0 -- scikit-learn=1.2.2 -- pandas=2.0.0 - -- pip: - - ordered-set==4.1.0 diff --git a/workflow_vaevae/src/computerome_vaevae.sh b/workflow_vaevae/src/computerome_vaevae.sh deleted file mode 100755 index 1833a96d..00000000 --- a/workflow_vaevae/src/computerome_vaevae.sh +++ /dev/null @@ -1,12 +0,0 @@ -#!/usr/bin/bash -module load tools computerome_utils/2.0 -module load anaconda3/2023.03 -module unload gcc -module load gcc/11.1.0 -module load minimap2/2.17r941 samtools/1.10 - -source ~/.bashrc -conda init bash -conda activate /home/projects/cpr_10006/people/svekut/.conda/vamb - -./$1 $2 $3 $4 diff --git a/workflow_vaevae/src/eval.jl b/workflow_vaevae/src/eval.jl deleted file mode 100644 index 8a978a81..00000000 --- a/workflow_vaevae/src/eval.jl +++ /dev/null @@ -1,114 +0,0 @@ -# module load tools -# module load julia/1.9.0 -# export JULIA_DEPOT_PATH=/home/projects/cpr_10006/people/svekut/julia_packages -# julia --startup-file=no --project=@vambbench - -using DataFrames, CSV -using VambBenchmarks - - -function get_nbins(binning::Binning, recall::Real, precision::Real; tax_level::Integer=0, assembly::Bool=false)::Integer - ri = searchsortedfirst(binning.recalls, recall) - ri > length(binning.recalls) && error("Binning did not benchmark at that high recall") - pi = searchsortedfirst(binning.precisions, precision) - pi > length(binning.precisions) && error("Binning did not benchmark at that high precision") - matrices = assembly ? binning.recovered_asms : binning.recovered_genomes - if tax_level + 1 ∉ eachindex(matrices) - error(lazy"Requested bins at taxonomic level $tax_level but have only level 0:$(lastindex(matrices)-1)") - end - m = matrices[tax_level + 1] - m[pi, ri] -end - -result_df = DataFrame(folder=String[], genomes=Int[], assemblies=Int[], level=String[]) - -vars = ["Oral", "Skin", "Urogenital", "Airways", "Gastrointestinal"] -rec = 0.9 -prec = 0.95 -exp = "0.6" -for var in vars - ref_file = "/home/projects/cpr_10006/people/paupie/vaevae/spades_ef_refs_/ref_spades_$(var).json" - ref = open(i -> Reference(i), ref_file) - bins = gold_standard(ref) - output_genomes = get_nbins(bins, rec, prec, tax_level=0, assembly=false) - output_as = get_nbins(bins, rec, prec, tax_level=0, assembly=true) - - output_genomes_sp = get_nbins(bins, rec, prec, tax_level=1, assembly=false) - output_as_sp = get_nbins(bins, rec, prec, tax_level=1, assembly=true) - - output_genomes_g = get_nbins(bins, rec, prec, tax_level=2, assembly=false) - output_as_g = get_nbins(bins, rec, prec, tax_level=2, assembly=true) - - # Append the result to the DataFrame - push!(result_df, ("gold_standard_$(var)", output_genomes, output_as, "Strain")) - push!(result_df, ("gold_standard_$(var)", output_genomes_sp, output_as_sp, "Species")) - push!(result_df, ("gold_standard_$(var)", output_genomes_g, output_as_g, "Genus")) - for file_path in [ - "/home/projects/cpr_10006/people/svekut/cami2_$(var)_reassembled_$(exp)/vaevae_clusters.tsv", - "/home/projects/cpr_10006/people/svekut/cami2_$(var)_reclustering_reassembled_$(exp)/clusters_reclustered.tsv", - "/home/projects/cpr_10006/people/paupie/vaevae/semibin2_ptracker/semibin2_$(var)_multy_easy_bin_070723/samples/postreclustered_clusters.tsv", - "/home/projects/cpr_10006/people/paupie/vaevae/semibin2_ptracker/semibin2_$(var)_multy_easy_bin_070723/samples/prereclustered_clusters.tsv", - "/home/projects/cpr_10006/people/svekut/cami2_$(var)_vamb_1/vae_clusters.tsv", - "/home/projects/cpr_10006/people/svekut/cami2_$(var)_reclustering_vamb_1/clusters_reclustered.tsv", - "/home/projects/cpr_10006/people/svekut/cami2_$(var)_vamb_2/vae_clusters.tsv", - "/home/projects/cpr_10006/people/svekut/cami2_$(var)_reclustering_vamb_2/clusters_reclustered.tsv", - ] - # Check if the file exists before calling the function - if isfile(file_path) - bins = open(i -> Binning(i, ref), file_path) - - output_genomes = get_nbins(bins, rec, prec, tax_level=0, assembly=false) - output_as = get_nbins(bins, rec, prec, tax_level=0, assembly=true) - - output_genomes_sp = get_nbins(bins, rec, prec, tax_level=1, assembly=false) - output_as_sp = get_nbins(bins, rec, prec, tax_level=1, assembly=true) - - output_genomes_g = get_nbins(bins, rec, prec, tax_level=2, assembly=false) - output_as_g = get_nbins(bins, rec, prec, tax_level=2, assembly=true) - - # Append the result to the DataFrame - push!(result_df, (file_path, output_genomes, output_as, "Strain")) - push!(result_df, (file_path, output_genomes_sp, output_as_sp, "Species")) - push!(result_df, (file_path, output_genomes_g, output_as_g, "Genus")) - else - println("File $file_path not found") - end - end -end - - -CSV.write("/home/projects/cpr_10006/people/svekut/results_$(exp).csv", result_df) - -vars = ["Oral", "Skin", "Urogenital", "Airways", "Gastrointestinal"] - -for var in vars - result_df = DataFrame(genome_id=String[], source_id=String[], scgs=Int[], total=Int[]) - println("$var") - ref_file = "/home/projects/cpr_10006/people/paupie/vaevae/spades_ef_refs_/ref_spades_$(var).json" - ref = open(i -> Reference(i), ref_file) - marker_path = "/home/projects/cpr_10006/projects/semi_vamb/data/marker_genes/cami_reassembled_$(var).hmmout" - df = CSV.File(marker_path, comment="#") |> DataFrame - new_column = [split(string, ".")[2] for string in df[!, 1]] - contigs = Set([split(string, "_")[1] for string in new_column]) - for genome in genomes(ref) - for source in genome.sources - seqs = Set([seq[1].name for seq in source.sequences]) - n = source.name - println("$n") - push!(result_df, (genome.name, source.name, length(intersect(seqs, contigs)), length(seqs))) - end - end - CSV.write("/home/projects/cpr_10006/people/svekut/scgs_$(var).csv", result_df) -end - - -for var in vars - result_df = DataFrame(genome_id=String[], source_id=String[], scgs=Int[], total=Int[]) - println("$var") - ref_file = "/home/projects/cpr_10006/people/paupie/vaevae/spades_ef_refs_/ref_spades_$(var).json" - ref = open(i -> Reference(i), ref_file) - VambBenchmarks.subset!(ref; genomes=g -> Flags.organism in flags(g)) - res = Array(Float64, n) - for i in 1:9 - c = top_clade(ref) - diff --git a/workflow_vaevae/src/longread_human.sh b/workflow_vaevae/src/longread_human.sh deleted file mode 100755 index 01d5ff63..00000000 --- a/workflow_vaevae/src/longread_human.sh +++ /dev/null @@ -1,34 +0,0 @@ -#!/usr/bin/bash -annotator=$1 -thres=$2 - # --taxonomy_predictions /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/vaevaeout/results_taxonomy_predictor.csv \ - - -vamb bin taxvamb \ - --outdir /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/vaevae_${annotator}_predictor_${thres} \ - --fasta /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/contigs_2kbp.fna \ - --rpkm /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/vambout/abundance.npz \ - --taxonomy /home/projects/cpr_10006/people/svekut/04_mmseq2/taxonomy_cami_kfold/human_longread_taxonomy_${annotator}.tsv \ - -l 64 \ - -e 1000 \ - -t 1024 \ - -q \ - -pe 100 \ - -pt 1024 \ - -pq \ - -pthr ${thres} \ - -o C \ - --cuda \ - --minfasta 200000 - -# vamb \ -# --model reclustering \ -# --latent_path /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/vaevae_${annotator}_predictor_${thres}/vaevae_latent.npy \ -# --clusters_path /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/vaevae_${annotator}_predictor_${thres}/vaevae_clusters.tsv \ -# --fasta /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/contigs_2kbp.fna \ -# --rpkm /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/vambout/abundance.npz \ -# --outdir /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/vaevae_${annotator}_predictor_${thres}_reclustering \ -# --hmmout_path /home/projects/cpr_10006/projects/semi_vamb/data/marker_genes/markers_human.hmmout \ -# --taxonomy_predictions /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/vaevae_${annotator}_predictor_${thres}/results_taxonomy_predictor.csv \ -# --algorithm dbscan \ -# --minfasta 200000 diff --git a/workflow_vaevae/src/longread_human_no_predictor.sh b/workflow_vaevae/src/longread_human_no_predictor.sh deleted file mode 100755 index 7a3d04a1..00000000 --- a/workflow_vaevae/src/longread_human_no_predictor.sh +++ /dev/null @@ -1,36 +0,0 @@ -#!/usr/bin/bash -annotator=$1 -thres=$2 - - # --taxonomy_predictions /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/vaevaeout/results_taxonomy_predictor.csv \ - # --taxonomy /home/projects/cpr_10006/people/svekut/mmseq2/longread_taxonomy_2023.tsv \ - - # --cuda \ - - -vamb \ - --model vaevae \ - --outdir /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/vaevae_${annotator} \ - --fasta /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/contigs_2kbp.fna \ - --rpkm /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/vambout/abundance.npz \ - --taxonomy /home/projects/cpr_10006/people/svekut/04_mmseq2/taxonomy_cami_kfold/human_longread_taxonomy_${annotator}.tsv \ - --no_predictor \ - -l 64 \ - -e 500 \ - -t 1024 \ - -q \ - -o C \ - --minfasta 200000 - -# vamb \ -# --model reclustering \ -# --latent_path /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/raw_dbscan_full/vaevae_latent.npy \ -# --clusters_path /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/raw_dbscan_full/vaevae_clusters.tsv \ -# --fasta /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/contigs_2kbp.fna \ -# --rpkm /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/vambout/abundance.npz \ -# --outdir /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/raw_dbscan_reclustering \ -# --hmmout_path /home/projects/cpr_10006/projects/semi_vamb/data/marker_genes/markers_human.hmmout \ -# --taxonomy_predictions /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/vaevaeout/results_taxonomy_predictor.csv \ -# --algorithm dbscan \ -# --minfasta 200000 - diff --git a/workflow_vaevae/src/longread_human_predictor.sh b/workflow_vaevae/src/longread_human_predictor.sh deleted file mode 100755 index a94d3d25..00000000 --- a/workflow_vaevae/src/longread_human_predictor.sh +++ /dev/null @@ -1,17 +0,0 @@ -#!/usr/bin/bash -# --taxonomy /home/projects/cpr_10006/people/svekut/mmseq2/almeida_taxonomy.tsv \ -# --taxonomy_predictions /home/projects/cpr_10006/projects/vamb/almeida_vaevaeout/results_taxonomy_predictor.csv \ -run_id=$1 - # --cuda \ - - -vamb taxometer \ - --outdir /home/projects/cpr_10006/people/svekut/long_read_human_kfold_predictor_v207_${run_id}_gpu \ - --fasta /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/contigs_2kbp.fna \ - --rpkm /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/vambout/abundance.npz \ - --taxonomy /home/projects/cpr_10006/people/svekut/04_mmseq2/taxonomy_cami_kfold/human_longread_taxonomy_${run_id}.tsv \ - -pe 100 \ - -pq \ - -pt 1024 \ - --cuda \ - -ploss flat_softmax diff --git a/workflow_vaevae/src/longread_human_vamb.sh b/workflow_vaevae/src/longread_human_vamb.sh deleted file mode 100755 index b328d4cd..00000000 --- a/workflow_vaevae/src/longread_human_vamb.sh +++ /dev/null @@ -1,24 +0,0 @@ -#!/usr/bin/bash - -# vamb \ -# --outdir /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/vambout64_20102023 \ -# --fasta /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/contigs_2kbp.fna \ -# --rpkm /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/vambout/abundance.npz \ -# -l 64 \ -# -e 500 \ -# -q 25 75 150 \ -# -o C \ -# --cuda \ -# --minfasta 200000 - -vamb \ - --model reclustering \ - --latent_path /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/vambout64_20102023/latent.npz \ - --clusters_path /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/vambout64_20102023/vae_clusters.tsv \ - --fasta /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/contigs_2kbp.fna \ - --rpkm /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/vambout/abundance.npz \ - --outdir /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/vambout64_20102023_reclustering \ - --hmmout_path /home/projects/cpr_10006/projects/semi_vamb/data/marker_genes/markers_human.hmmout \ - --taxonomy_predictions /home/projects/cpr_10006/projects/semi_vamb/data/human_longread/vaevae_full_predictor_0.5/results_taxonomy_predictor.csv \ - --algorithm dbscan \ - --minfasta 200000 diff --git a/workflow_vaevae/src/longread_sludge.sh b/workflow_vaevae/src/longread_sludge.sh deleted file mode 100755 index 56d2d788..00000000 --- a/workflow_vaevae/src/longread_sludge.sh +++ /dev/null @@ -1,34 +0,0 @@ -#!/usr/bin/bash -annotator=$1 -thres=$2 - # --taxonomy_predictions /home/projects/cpr_10006/projects/semi_vamb/data/sludge/vaevaeout_sp100/results_taxonomy_predictor.csv \ - - -vamb bin taxvamb \ - --outdir /home/projects/cpr_10006/projects/semi_vamb/data/sludge/vaevae_${annotator}_predictor_${thres} \ - --fasta /home/projects/cpr_10006/projects/semi_vamb/data/sludge/contigs_2kbp.fna \ - --rpkm /home/projects/cpr_10006/projects/semi_vamb/data/sludge/vambout/abundance.npz \ - --taxonomy /home/projects/cpr_10006/people/svekut/04_mmseq2/taxonomy_cami_kfold/sludge_taxonomy_${annotator}.tsv \ - -l 64 \ - -e 1000 \ - -t 1024 \ - -q \ - -pe 100 \ - -pt 1024 \ - -pq \ - -pthr ${thres} \ - -o C \ - --cuda \ - --minfasta 200000 - -# vamb \ -# --model reclustering \ -# --latent_path /home/projects/cpr_10006/projects/semi_vamb/data/sludge/vaevae_${annotator}_predictor_${thres}/vaevae_latent.npy \ -# --clusters_path /home/projects/cpr_10006/projects/semi_vamb/data/sludge/vaevae_${annotator}_predictor_${thres}/vaevae_clusters.tsv \ -# --fasta /home/projects/cpr_10006/projects/semi_vamb/data/sludge/contigs_2kbp.fna \ -# --rpkm /home/projects/cpr_10006/projects/semi_vamb/data/sludge/vambout/abundance.npz \ -# --outdir /home/projects/cpr_10006/projects/semi_vamb/data/sludge/vaevae_${annotator}_predictor_${thres}_reclustering \ -# --hmmout_path /home/projects/cpr_10006/projects/semi_vamb/data/marker_genes/markers_sludge.hmmout \ -# --taxonomy_predictions /home/projects/cpr_10006/projects/semi_vamb/data/sludge/vaevae_${annotator}_predictor_${thres}/results_taxonomy_predictor.csv \ -# --algorithm dbscan \ -# --minfasta 200000 diff --git a/workflow_vaevae/src/longread_sludge_no_predictor.sh b/workflow_vaevae/src/longread_sludge_no_predictor.sh deleted file mode 100755 index 2922d239..00000000 --- a/workflow_vaevae/src/longread_sludge_no_predictor.sh +++ /dev/null @@ -1,31 +0,0 @@ -#!/usr/bin/bash -annotator=$1 - # --taxonomy_predictions /home/projects/cpr_10006/projects/semi_vamb/data/sludge/vaevaeout_sp100/results_taxonomy_predictor.csv \ - - -vamb \ - --model vaevae \ - --outdir /home/projects/cpr_10006/projects/semi_vamb/data/sludge/vaevae_${annotator} \ - --fasta /home/projects/cpr_10006/projects/semi_vamb/data/sludge/contigs_2kbp.fna \ - --rpkm /home/projects/cpr_10006/projects/semi_vamb/data/sludge/vambout/abundance.npz \ - --taxonomy /home/projects/cpr_10006/people/svekut/04_mmseq2/taxonomy_cami_kfold/sludge_taxonomy_${annotator}.tsv \ - --no_predictor \ - -l 64 \ - -e 500 \ - -t 1024 \ - -q \ - -o C \ - --cuda \ - --minfasta 200000 - -# vamb \ -# --model reclustering \ -# --latent_path /home/projects/cpr_10006/projects/semi_vamb/data/sludge/vaevaeout_dadam/vaevae_latent.npy \ -# --clusters_path /home/projects/cpr_10006/projects/semi_vamb/data/sludge/vaevaeout_dadam/vaevae_clusters.tsv \ -# --fasta /home/projects/cpr_10006/projects/semi_vamb/data/sludge/contigs_2kbp.fna \ -# --rpkm /home/projects/cpr_10006/projects/semi_vamb/data/sludge/vambout/abundance.npz \ -# --outdir /home/projects/cpr_10006/projects/semi_vamb/data/sludge/vaevaeout_dadam_reclustering \ -# --hmmout_path /home/projects/cpr_10006/projects/semi_vamb/data/marker_genes/markers_sludge.hmmout \ -# --taxonomy_predictions /home/projects/cpr_10006/people/svekut/vamb/results_taxonomy_predictor_sludge.csv \ -# --algorithm dbscan \ -# --minfasta 200000 diff --git a/workflow_vaevae/src/longread_sludge_predictor.sh b/workflow_vaevae/src/longread_sludge_predictor.sh deleted file mode 100755 index e85b324e..00000000 --- a/workflow_vaevae/src/longread_sludge_predictor.sh +++ /dev/null @@ -1,15 +0,0 @@ -#!/usr/bin/bash -# --taxonomy /home/projects/cpr_10006/people/svekut/mmseq2/almeida_taxonomy.tsv \ -# --taxonomy_predictions /home/projects/cpr_10006/projects/vamb/almeida_vaevaeout/results_taxonomy_predictor.csv \ -run_id=$1 - -vamb taxometer \ - --outdir /home/projects/cpr_10006/people/svekut/long_read_sludge_kfold_predictor_v207_${run_id}_gpu \ - --fasta /home/projects/cpr_10006/projects/semi_vamb/data/sludge/contigs_2kbp.fna \ - --rpkm /home/projects/cpr_10006/projects/semi_vamb/data/sludge/vambout/abundance.npz \ - --taxonomy /home/projects/cpr_10006/people/svekut/04_mmseq2/taxonomy_cami_kfold/sludge_taxonomy_${run_id}.tsv \ - -pe 100 \ - -pq \ - -pt 1024 \ - --cuda \ - -ploss flat_softmax diff --git a/workflow_vaevae/src/longread_sludge_vamb.sh b/workflow_vaevae/src/longread_sludge_vamb.sh deleted file mode 100755 index c16d5732..00000000 --- a/workflow_vaevae/src/longread_sludge_vamb.sh +++ /dev/null @@ -1,25 +0,0 @@ -#!/usr/bin/bash - - -# vamb \ -# --outdir /home/projects/cpr_10006/projects/semi_vamb/data/sludge/vambout64_20102023 \ -# --fasta /home/projects/cpr_10006/projects/semi_vamb/data/sludge/contigs_2kbp.fna \ -# --rpkm /home/projects/cpr_10006/projects/semi_vamb/data/sludge/vambout/abundance.npz \ -# -l 64 \ -# -e 500 \ -# -q 25 75 150 \ -# -o C \ -# --cuda \ -# --minfasta 200000 - -vamb \ - --model reclustering \ - --latent_path /home/projects/cpr_10006/projects/semi_vamb/data/sludge/vambout64_20102023/latent.npz \ - --clusters_path /home/projects/cpr_10006/projects/semi_vamb/data/sludge/vambout64_20102023/vae_clusters.tsv \ - --fasta /home/projects/cpr_10006/projects/semi_vamb/data/sludge/contigs_2kbp.fna \ - --rpkm /home/projects/cpr_10006/projects/semi_vamb/data/sludge/vambout/abundance.npz \ - --outdir /home/projects/cpr_10006/projects/semi_vamb/data/sludge/vambout64_20102023_reclustering \ - --hmmout_path /home/projects/cpr_10006/projects/semi_vamb/data/marker_genes/markers_sludge.hmmout \ - --taxonomy_predictions /home/projects/cpr_10006/projects/semi_vamb/data/sludge/vaevae_full_predictor_0.5/results_taxonomy_predictor.csv \ - --algorithm dbscan \ - --minfasta 200000 diff --git a/workflow_vaevae/src/shortread_CAMI2.sh b/workflow_vaevae/src/shortread_CAMI2.sh deleted file mode 100755 index 146e700f..00000000 --- a/workflow_vaevae/src/shortread_CAMI2.sh +++ /dev/null @@ -1,36 +0,0 @@ -#!/usr/bin/bash -dataset=$1 -annotator=$2 -thres=$3 - - # --taxonomy /home/projects/cpr_10006/people/svekut/mmseq2/${dataset}_taxonomy_2023.tsv \ - # --taxonomy_predictions /home/projects/cpr_10006/people/svekut/cami2_urog_out_32_667/results_taxonomy_predictor.csv - # --taxonomy /home/projects/cpr_10006/people/svekut/mmseq2/${dataset}_taxonomy.tsv \ - - # --cuda \ - -vamb bin vaevae \ - --outdir /home/projects/cpr_10006/people/svekut/cami2_${dataset}_${annotator}_${thres}_interface \ - --fasta /home/projects/cpr_10006/projects/vamb/data/datasets/cami2_${dataset}/contigs_2kbp.fna.gz \ - --rpkm /home/projects/cpr_10006/projects/vamb/data/datasets/cami2_${dataset}/abundance.npz \ - --taxonomy /home/projects/cpr_10006/people/svekut/04_mmseq2/taxonomy_cami_kfold/${dataset}_taxonomy_${annotator}.tsv \ - -l 32 \ - -e 1 \ - -q \ - -t 1024 \ - -pe 1 \ - -pq \ - -pt 1024 \ - -pthr ${thres} \ - -o C \ - --minfasta 200000 - -# vamb recluster \ -# --latent_path /home/projects/cpr_10006/people/svekut/cami2_${dataset}_${annotator}_${thres}_interface/vaevae_latent.npy \ -# --clusters_path /home/projects/cpr_10006/people/svekut/cami2_${dataset}_${annotator}_${thres}_interface/vaevae_clusters.tsv \ -# --fasta /home/projects/cpr_10006/projects/vamb/data/datasets/cami2_${dataset}/contigs_2kbp.fna.gz \ -# --rpkm /home/projects/cpr_10006/projects/vamb/data/datasets/cami2_${dataset}/abundance.npz \ -# --outdir /home/projects/cpr_10006/people/svekut/cami2_${dataset}_reclustering_${annotator}_${thres}_interface \ -# --hmmout_path /home/projects/cpr_10006/projects/semi_vamb/data/marker_genes/markers_cami_${dataset}.hmmout \ -# --algorithm kmeans \ -# --minfasta 200000 diff --git a/workflow_vaevae/src/shortread_CAMI2_no_predictor.sh b/workflow_vaevae/src/shortread_CAMI2_no_predictor.sh deleted file mode 100755 index ec79e089..00000000 --- a/workflow_vaevae/src/shortread_CAMI2_no_predictor.sh +++ /dev/null @@ -1,39 +0,0 @@ -#!/usr/bin/bash -dataset=$1 -run_id=$2 -keyword=$3 - - # --taxonomy /home/projects/cpr_10006/people/svekut/mmseq2/${dataset}_taxonomy_2023.tsv \ - # --taxonomy_predictions /home/projects/cpr_10006/people/svekut/cami2_urog_out_32_667/results_taxonomy_predictor.csv - # --taxonomy /home/projects/cpr_10006/people/svekut/mmseq2/${dataset}_taxonomy.tsv \ - - # --no_predictor \ - # --cuda \ - - -vamb \ - --model vaevae \ - --outdir /home/projects/cpr_10006/people/svekut/cami2_${dataset}_no_predictor_${run_id}_${keyword} \ - --fasta /home/projects/cpr_10006/projects/vamb/data/datasets/cami2_${dataset}/contigs_2kbp.fna.gz \ - --rpkm /home/projects/cpr_10006/projects/vamb/data/datasets/cami2_${dataset}/abundance.npz \ - --taxonomy /home/projects/cpr_10006/people/svekut/04_mmseq2/taxonomy_cami_kfold/${dataset}_taxonomy_${run_id}.tsv \ - --no_predictor \ - -l 32 \ - -e 300 \ - -t 1024 \ - -pq \ - -q \ - -o C \ - --cuda \ - --minfasta 200000 - -vamb \ - --model reclustering \ - --latent_path /home/projects/cpr_10006/people/svekut/cami2_${dataset}_no_predictor_${run_id}_${keyword}/vaevae_latent.npy \ - --clusters_path /home/projects/cpr_10006/people/svekut/cami2_${dataset}_no_predictor_${run_id}_${keyword}/vaevae_clusters.tsv \ - --fasta /home/projects/cpr_10006/projects/vamb/data/datasets/cami2_${dataset}/contigs_2kbp.fna.gz \ - --rpkm /home/projects/cpr_10006/projects/vamb/data/datasets/cami2_${dataset}/abundance.npz \ - --outdir /home/projects/cpr_10006/people/svekut/cami2_${dataset}_no_predictor_reclustering_${run_id}_${keyword} \ - --hmmout_path /home/projects/cpr_10006/projects/semi_vamb/data/marker_genes/markers_cami_${dataset}.hmmout \ - --algorithm kmeans \ - --minfasta 200000 diff --git a/workflow_vaevae/src/shortread_CAMI2_no_predictor_dadam.sh b/workflow_vaevae/src/shortread_CAMI2_no_predictor_dadam.sh deleted file mode 100755 index 07a9691a..00000000 --- a/workflow_vaevae/src/shortread_CAMI2_no_predictor_dadam.sh +++ /dev/null @@ -1,35 +0,0 @@ -#!/usr/bin/bash -dataset=$1 -run_id=$2 - - # --taxonomy /home/projects/cpr_10006/people/svekut/mmseq2/${dataset}_taxonomy_2023.tsv \ - # --taxonomy_predictions /home/projects/cpr_10006/people/svekut/cami2_urog_out_32_667/results_taxonomy_predictor.csv - # --taxonomy /home/projects/cpr_10006/people/svekut/mmseq2/${dataset}_taxonomy.tsv \ - - - # --cuda \ - -vamb \ - --model vaevae \ - --outdir /home/projects/cpr_10006/people/svekut/cami2_${dataset}_out_32_no_predictor_${run_id}_dadam__fix \ - --fasta /home/projects/cpr_10006/projects/vamb/data/datasets/cami2_${dataset}/contigs_2kbp.fna.gz \ - --rpkm /home/projects/cpr_10006/projects/vamb/data/datasets/cami2_${dataset}/abundance.npz \ - --taxonomy /home/projects/cpr_10006/people/svekut/04_mmseq2/taxonomy_cami_kfold/${dataset}_taxonomy_${run_id}.tsv \ - --no_predictor \ - -l 32 \ - -e 200 \ - -t 1024 \ - -q \ - -o C \ - --minfasta 200000 - -vamb \ - --model reclustering \ - --latent_path /home/projects/cpr_10006/people/svekut/cami2_${dataset}_out_32_no_predictor_${run_id}_dadam/vaevae_latent.npy \ - --clusters_path /home/projects/cpr_10006/people/svekut/cami2_${dataset}_out_32_no_predictor_${run_id}_dadam__fix/vaevae_clusters.tsv \ - --fasta /home/projects/cpr_10006/projects/vamb/data/datasets/cami2_${dataset}/contigs_2kbp.fna.gz \ - --rpkm /home/projects/cpr_10006/projects/vamb/data/datasets/cami2_${dataset}/abundance.npz \ - --outdir /home/projects/cpr_10006/people/svekut/cami2_${dataset}_out_32_no_predictor_reclustering_${run_id}_dadam__fix1 \ - --hmmout_path /home/projects/cpr_10006/projects/semi_vamb/data/marker_genes/markers_cami_${dataset}.hmmout \ - --algorithm kmeans \ - --minfasta 200000 diff --git a/workflow_vaevae/src/shortread_CAMI2_predictor.sh b/workflow_vaevae/src/shortread_CAMI2_predictor.sh deleted file mode 100755 index 1629b63a..00000000 --- a/workflow_vaevae/src/shortread_CAMI2_predictor.sh +++ /dev/null @@ -1,18 +0,0 @@ -#!/usr/bin/bash -dataset=$1 -run_id=$2 -keyword=$3 - - # --taxonomy /home/projects/cpr_10006/people/svekut/mmseq2/${dataset}_taxonomy.tsv \ - # --taxonomy /home/projects/cpr_10006/people/svekut/mmseq2/${dataset}_taxonomy_${run_id}.tsv \ - -vamb taxometer \ - --outdir /home/projects/cpr_10006/people/svekut/cami2_${dataset}_predictor_${keyword}_${run_id} \ - --fasta /home/projects/cpr_10006/projects/vamb/data/datasets/cami2_${dataset}/contigs_2kbp.fna.gz \ - --rpkm /home/projects/cpr_10006/projects/vamb/data/datasets/cami2_${dataset}/abundance.npz \ - --taxonomy /home/projects/cpr_10006/people/svekut/04_mmseq2/taxonomy_cami_kfold/${dataset}_taxonomy_${run_id}.tsv \ - -pe 100 \ - -pq \ - -pt 1024 \ - --cuda \ - -ploss ${keyword} diff --git a/workflow_vaevae/src/shortread_CAMI2_reassembled.sh b/workflow_vaevae/src/shortread_CAMI2_reassembled.sh deleted file mode 100755 index c159ef1c..00000000 --- a/workflow_vaevae/src/shortread_CAMI2_reassembled.sh +++ /dev/null @@ -1,31 +0,0 @@ -#!/usr/bin/bash -dataset=$1 -run_id=$2 - # --taxonomy_predictions /home/projects/cpr_10006/people/svekut/cami2_${dataset}_reassembled_1/results_taxonomy_predictor.csv \ - -vamb \ - --model vaevae \ - --outdir /home/projects/cpr_10006/people/svekut/cami2_${dataset}_reassembled_${run_id} \ - --fasta /home/projects/cpr_10006/projects/semi_vamb/data/cami_errorfree/${dataset}/contigs.fna \ - --rpkm /home/projects/cpr_10006/projects/semi_vamb/data/cami_errorfree/${dataset}/vambout/abundance.npz \ - --taxonomy /home/projects/cpr_10006/people/paupie/vaevae/mmseq2_annotations/ptracker/${dataset}_taxonomy.tsv \ - -l 32 \ - -e 200 \ - -q 25 75 150 \ - -pe 100 \ - -pq 25 75 \ - -pthr ${run_id} \ - -o C \ - --cuda \ - --minfasta 200000 - -vamb \ - --model reclustering \ - --latent_path /home/projects/cpr_10006/people/svekut/cami2_${dataset}_reassembled_${run_id}/vaevae_latent.npy \ - --clusters_path /home/projects/cpr_10006/people/svekut/cami2_${dataset}_reassembled_${run_id}/vaevae_clusters.tsv \ - --fasta /home/projects/cpr_10006/projects/semi_vamb/data/cami_errorfree/${dataset}/contigs.fna \ - --rpkm /home/projects/cpr_10006/projects/semi_vamb/data/cami_errorfree/${dataset}/vambout/abundance.npz \ - --outdir /home/projects/cpr_10006/people/svekut/cami2_${dataset}_reclustering_reassembled_${run_id} \ - --hmmout_path /home/projects/cpr_10006/projects/semi_vamb/data/marker_genes/cami_reassembled_${dataset}.hmmout \ - --algorithm kmeans \ - --minfasta 200000 diff --git a/workflow_vaevae/src/shortread_CAMI2_reassembled_predictor.sh b/workflow_vaevae/src/shortread_CAMI2_reassembled_predictor.sh deleted file mode 100755 index f5719db8..00000000 --- a/workflow_vaevae/src/shortread_CAMI2_reassembled_predictor.sh +++ /dev/null @@ -1,14 +0,0 @@ -#!/usr/bin/bash -dataset=$1 -run_id=$2 - -vamb \ - --model taxonomy_predictor \ - --outdir /home/projects/cpr_10006/people/svekut/cami2_${dataset}_predictor_${run_id} \ - --fasta /home/projects/cpr_10006/projects/semi_vamb/data/cami_errorfree/${dataset}/contigs.fna \ - --rpkm /home/projects/cpr_10006/projects/semi_vamb/data/cami_errorfree/${dataset}/vambout/abundance.npz \ - --taxonomy /home/projects/cpr_10006/people/paupie/vaevae/mmseq2_annotations/ptracker/${dataset}_taxonomy.tsv \ - -pe 100 \ - -pq 25 75 \ - -pthr 0.7 \ - --cuda diff --git a/workflow_vaevae/src/shortread_CAMI2_reassembled_vamb.sh b/workflow_vaevae/src/shortread_CAMI2_reassembled_vamb.sh deleted file mode 100755 index de84a83c..00000000 --- a/workflow_vaevae/src/shortread_CAMI2_reassembled_vamb.sh +++ /dev/null @@ -1,24 +0,0 @@ -#!/usr/bin/bash -dataset=$1 -run_id=$2 - # --taxonomy_predictions /home/projects/cpr_10006/people/svekut/cami2_${dataset}_reassembled_1/results_taxonomy_predictor.csv \ - -# vamb \ -# --model vae \ -# --outdir /home/projects/cpr_10006/people/svekut/cami2_${dataset}_vamb_${run_id} \ -# --fasta /home/projects/cpr_10006/projects/semi_vamb/data/cami_errorfree/${dataset}/contigs.fna \ -# --rpkm /home/projects/cpr_10006/projects/semi_vamb/data/cami_errorfree/${dataset}/vambout/abundance.npz \ -# -o C \ -# --cuda \ -# --minfasta 200000 - -vamb \ - --model reclustering \ - --latent_path /home/projects/cpr_10006/people/svekut/cami2_${dataset}_vamb_${run_id}/latent.npz \ - --clusters_path /home/projects/cpr_10006/people/svekut/cami2_${dataset}_vamb_${run_id}/vae_clusters.tsv \ - --fasta /home/projects/cpr_10006/projects/semi_vamb/data/cami_errorfree/${dataset}/contigs.fna \ - --rpkm /home/projects/cpr_10006/projects/semi_vamb/data/cami_errorfree/${dataset}/vambout/abundance.npz \ - --outdir /home/projects/cpr_10006/people/svekut/cami2_${dataset}_reclustering_vamb_${run_id} \ - --hmmout_path /home/projects/cpr_10006/projects/semi_vamb/data/marker_genes/cami_reassembled_${dataset}.hmmout \ - --algorithm kmeans \ - --minfasta 200000 diff --git a/workflow_vaevae/src/shortread_CAMI2_vamb.sh b/workflow_vaevae/src/shortread_CAMI2_vamb.sh deleted file mode 100755 index 3b82d103..00000000 --- a/workflow_vaevae/src/shortread_CAMI2_vamb.sh +++ /dev/null @@ -1,28 +0,0 @@ -#!/usr/bin/bash -dataset=$1 - - # --taxonomy /home/projects/cpr_10006/people/svekut/mmseq2/${dataset}_taxonomy_2023.tsv \ - # --taxonomy_predictions /home/projects/cpr_10006/people/svekut/cami2_urog_out_32_667/results_taxonomy_predictor.csv - # --taxonomy /home/projects/cpr_10006/people/svekut/mmseq2/${dataset}_taxonomy.tsv \ - -# vamb \ -# --outdir /home/projects/cpr_10006/people/svekut/cami2_${dataset}_vamb \ -# --fasta /home/projects/cpr_10006/projects/vamb/data/datasets/cami2_${dataset}/contigs_2kbp.fna.gz \ -# --rpkm /home/projects/cpr_10006/projects/vamb/data/datasets/cami2_${dataset}/abundance.npz \ -# -l 32 \ -# -e 300 \ -# -q 25 75 150 \ -# -o C \ -# --cuda \ -# --minfasta 200000 - -vamb \ - --model reclustering \ - --latent_path /home/projects/cpr_10006/people/svekut/cami2_${dataset}_vamb/latent.npz \ - --clusters_path /home/projects/cpr_10006/people/svekut/cami2_${dataset}_vamb/vae_clusters.tsv \ - --fasta /home/projects/cpr_10006/projects/vamb/data/datasets/cami2_${dataset}/contigs_2kbp.fna.gz \ - --rpkm /home/projects/cpr_10006/projects/vamb/data/datasets/cami2_${dataset}/abundance.npz \ - --outdir /home/projects/cpr_10006/people/svekut/cami2_${dataset}_reclustering_vamb \ - --hmmout_path /home/projects/cpr_10006/projects/semi_vamb/data/marker_genes/markers_cami_${dataset}.hmmout \ - --algorithm kmeans \ - --minfasta 200000 diff --git a/workflow_vaevae/src/shortread_MATRIX.sh b/workflow_vaevae/src/shortread_MATRIX.sh deleted file mode 100755 index 71526cd2..00000000 --- a/workflow_vaevae/src/shortread_MATRIX.sh +++ /dev/null @@ -1,39 +0,0 @@ -#!/usr/bin/bash -thres=$1 - -vamb bin default \ - --outdir /home/projects/cpr_10006/projects/vamb/MATRIX_vamb_no_split \ - --fasta /home/projects/ku_00200/data/matrix/mgx22_assembly/avamb_outdir/contigs.flt.fna.gz \ - --rpkm /home/projects/ku_00200/data/matrix/mgx22_assembly/avamb_outdir/abundance.npz \ - -l 32 \ - -e 300 \ - -t 256 \ - -q 25 75 100 150 \ - --cuda \ - --minfasta 200000 - -# vamb bin taxvamb \ -# --outdir /home/projects/cpr_10006/projects/vamb/MATRIX_taxvamb_${thres} \ -# --fasta /home/projects/ku_00200/data/matrix/mgx22_assembly/avamb_outdir/contigs.flt.fna.gz \ -# --rpkm /home/projects/ku_00200/data/matrix/mgx22_assembly/avamb_outdir/abundance.npz \ -# --taxonomy /home/projects/cpr_10006/people/svekut/mmseq2/MATRIX_taxonomy_v207.tsv \ -# -l 32 \ -# -e 300 \ -# -t 1024 \ -# -q \ -# -pe 100 \ -# -pt 1024 \ -# -pq \ -# -pthr ${thres} \ -# -o C \ -# --cuda \ -# --minfasta 200000 - -# vamb recluster \ -# --latent_path /home/projects/cpr_10006/projects/vamb/almeida_vaevaeout_predictor_${thres}_mem/vaevae_latent.npy \ -# --clusters_path /home/projects/cpr_10006/projects/vamb/almeida_vaevaeout_predictor_${thres}_mem/vaevae_clusters.tsv \ -# --fasta /home/projects/cpr_10006/projects/vamb/analysis/almeida/data/almeida.fa.gz \ -# --rpkm /home/projects/cpr_10006/projects/vamb/analysis/almeida/data/almeida.jgi.depth.npz \ -# --outdir /home/projects/cpr_10006/projects/vamb/almeida_vaevaeout_predictor_${thres}_mem_reclustering \ -# --taxonomy_predictions /home/projects/cpr_10006/projects/vamb/almeida_vaevaeout/results_taxonomy_predictor.csv \ -# --minfasta 200000 diff --git a/workflow_vaevae/src/shortread_almeida.sh b/workflow_vaevae/src/shortread_almeida.sh deleted file mode 100755 index b159868a..00000000 --- a/workflow_vaevae/src/shortread_almeida.sh +++ /dev/null @@ -1,28 +0,0 @@ -#!/usr/bin/bash -thres=$1 - -vamb bin taxvamb \ - --outdir /home/projects/cpr_10006/projects/vamb/almeida_vaevaeout_predictor_${thres}_mem \ - --fasta /home/projects/cpr_10006/projects/vamb/analysis/almeida/data/almeida.fna \ - --rpkm /home/projects/cpr_10006/projects/vamb/analysis/almeida/data/almeida.jgi.depth.npz \ - --taxonomy /home/projects/cpr_10006/people/svekut/mmseq2/almeida_taxonomy.tsv \ - -l 32 \ - -e 300 \ - -t 1024 \ - -q \ - -pe 100 \ - -pt 1024 \ - -pq \ - -pthr ${thres} \ - -o C \ - --cuda \ - --minfasta 200000 - -# vamb recluster \ -# --latent_path /home/projects/cpr_10006/projects/vamb/almeida_vaevaeout_predictor_${thres}_mem/vaevae_latent.npy \ -# --clusters_path /home/projects/cpr_10006/projects/vamb/almeida_vaevaeout_predictor_${thres}_mem/vaevae_clusters.tsv \ -# --fasta /home/projects/cpr_10006/projects/vamb/analysis/almeida/data/almeida.fa.gz \ -# --rpkm /home/projects/cpr_10006/projects/vamb/analysis/almeida/data/almeida.jgi.depth.npz \ -# --outdir /home/projects/cpr_10006/projects/vamb/almeida_vaevaeout_predictor_${thres}_mem_reclustering \ -# --taxonomy_predictions /home/projects/cpr_10006/projects/vamb/almeida_vaevaeout/results_taxonomy_predictor.csv \ -# --minfasta 200000 diff --git a/workflow_vaevae/src/shortread_almeida10.sh b/workflow_vaevae/src/shortread_almeida10.sh deleted file mode 100755 index 46525cbc..00000000 --- a/workflow_vaevae/src/shortread_almeida10.sh +++ /dev/null @@ -1,31 +0,0 @@ -#!/usr/bin/bash -annotator=$1 -thres=$2 - -# vamb \ -# --model vaevae \ -# --outdir /home/projects/cpr_10006/people/svekut/almeida10_${annotator}_predictor_${thres}_fix \ -# --fasta /home/projects/cpr_10006/people/paupie/vaevae/almeida_10_samples/03_abundances/abundances/contigs.flt.fna.gz \ -# --rpkm /home/projects/cpr_10006/people/paupie/vaevae/abundances_compositions/almeida10/abundance.npz \ -# --taxonomy /home/projects/cpr_10006/people/svekut/04_mmseq2/taxonomy_cami_kfold/almeida_10_samples_taxonomy_${annotator}.tsv\ -# -l 32 \ -# -e 300 \ -# -t 1024 \ -# -q \ -# -pe 100 \ -# -pt 1024 \ -# -pq \ -# -pthr ${thres} \ -# -o C \ -# --cuda \ -# --minfasta 200000 - -vamb \ - --model reclustering \ - --latent_path /home/projects/cpr_10006/people/svekut/almeida10_${annotator}_predictor_${thres}_fix/vaevae_latent.npy \ - --clusters_path /home/projects/cpr_10006/people/svekut/almeida10_${annotator}_predictor_${thres}_fix/vaevae_clusters.tsv \ - --fasta /home/projects/cpr_10006/people/paupie/vaevae/almeida_10_samples/03_abundances/abundances/contigs.flt.fna.gz \ - --rpkm /home/projects/cpr_10006/people/paupie/vaevae/abundances_compositions/almeida10/abundance.npz \ - --outdir /home/projects/cpr_10006/people/svekut/almeida10_${annotator}_predictor_${thres}_fix_reclustering \ - --algorithm kmeans \ - --minfasta 200000 diff --git a/workflow_vaevae/src/shortread_almeida10_no_predictor.sh b/workflow_vaevae/src/shortread_almeida10_no_predictor.sh deleted file mode 100755 index 863a6b49..00000000 --- a/workflow_vaevae/src/shortread_almeida10_no_predictor.sh +++ /dev/null @@ -1,29 +0,0 @@ -#!/usr/bin/bash -annotator=$1 -thres=$2 - -vamb \ - --model vaevae \ - --outdir /home/projects/cpr_10006/people/svekut/almeida10_${annotator}_fix \ - --fasta /home/projects/cpr_10006/people/paupie/vaevae/almeida_10_samples/03_abundances/abundances/contigs.flt.fna.gz \ - --rpkm /home/projects/cpr_10006/people/paupie/vaevae/abundances_compositions/almeida10/abundance.npz \ - --taxonomy /home/projects/cpr_10006/people/svekut/04_mmseq2/taxonomy_cami_kfold/almeida_10_samples_taxonomy_${annotator}.tsv\ - --no_predictor \ - -l 32 \ - -e 300 \ - -t 1024 \ - -q \ - -o C \ - --cuda \ - --minfasta 200000 - - -# vamb \ -# --model reclustering \ -# --latent_path /home/projects/cpr_10006/people/svekut/cami2_almeida10/vaevae_latent.npy \ -# --clusters_path /home/projects/cpr_10006/people/svekut/cami2_almeida10/vaevae_clusters.tsv \ -# --fasta /home/projects/cpr_10006/people/paupie/vaevae/almeida_10_samples/03_abundances/abundances/contigs.flt.fna.gz \ -# --rpkm /home/projects/cpr_10006/people/paupie/vaevae/abundances_compositions/almeida10/abundance.npz \ -# --outdir /home/projects/cpr_10006/people/svekut/cami2_almeida10_reclustering \ -# --algorithm kmeans \ -# --minfasta 200000 diff --git a/workflow_vaevae/src/shortread_almeida10_vamb.sh b/workflow_vaevae/src/shortread_almeida10_vamb.sh deleted file mode 100755 index a681c0d8..00000000 --- a/workflow_vaevae/src/shortread_almeida10_vamb.sh +++ /dev/null @@ -1,22 +0,0 @@ -#!/usr/bin/bash - -# vamb \ -# --outdir /home/projects/cpr_10006/people/svekut/cami2_almeida10_vamb_20102023 \ -# --fasta /home/projects/cpr_10006/people/paupie/vaevae/almeida_10_samples/03_abundances/abundances/contigs.flt.fna.gz \ -# --rpkm /home/projects/cpr_10006/people/paupie/vaevae/abundances_compositions/almeida10/abundance.npz \ -# -l 32 \ -# -e 300 \ -# -q 25 75 150 \ -# -o C \ -# --cuda \ -# --minfasta 200000 - -vamb \ - --model reclustering \ - --latent_path /home/projects/cpr_10006/people/svekut/cami2_almeida10_vamb_20102023/latent.npz \ - --clusters_path /home/projects/cpr_10006/people/svekut/cami2_almeida10_vamb_20102023/vae_clusters.tsv \ - --fasta /home/projects/cpr_10006/people/paupie/vaevae/almeida_10_samples/03_abundances/abundances/contigs.flt.fna.gz \ - --rpkm /home/projects/cpr_10006/people/paupie/vaevae/abundances_compositions/almeida10/abundance.npz \ - --outdir /home/projects/cpr_10006/people/svekut/cami2_almeida10_vamb_20102023_reclustering \ - --algorithm kmeans \ - --minfasta 200000 diff --git a/workflow_vaevae/src/shortread_almeida_N_samples_1.sh b/workflow_vaevae/src/shortread_almeida_N_samples_1.sh deleted file mode 100755 index 70dfc96a..00000000 --- a/workflow_vaevae/src/shortread_almeida_N_samples_1.sh +++ /dev/null @@ -1,31 +0,0 @@ -#!/usr/bin/bash -folder=$(awk -v var="$PBS_ARRAYID" 'NR==var {print; exit}' /home/projects/cpr_10006/people/svekut/almeida_N_samples/folder_list_1.txt) - -# vamb bin default \ -# --outdir /home/projects/cpr_10006/people/svekut/almeida_N_samples/data_folders/${folder}/vambout \ -# --fasta /home/projects/cpr_10006/people/svekut/almeida_N_samples/data_folders/${folder}/contigs.fna.gz \ -# --bamfiles /home/projects/cpr_10006/people/svekut/almeida_N_samples/data_folders/${folder}/bam_files/*.bam \ -# -l 32 \ -# -e 300 \ -# -q 25 75 150 \ -# -o C \ -# --cuda \ -# --minfasta 200000 - -vamb bin taxvamb \ - --outdir /home/projects/cpr_10006/people/svekut/almeida_N_samples/data_folders/${folder}/taxvambout \ - --fasta /home/projects/cpr_10006/people/svekut/almeida_N_samples/data_folders/${folder}/contigs.fna.gz \ - --bamdir /home/projects/cpr_10006/people/svekut/almeida_N_samples/data_folders/${folder}/bam_files \ - --taxonomy /home/projects/cpr_10006/people/svekut/almeida_N_samples/data_folders/${folder}/taxonomy_mmseqs.tsv \ - -l 32 \ - -e 300 \ - -t 1024 \ - -q \ - -pe 100 \ - -pt 1024 \ - -pq \ - -pthr 0.95 \ - -o C \ - --cuda \ - --minfasta 200000 - diff --git a/workflow_vaevae/src/shortread_almeida_N_samples_10.sh b/workflow_vaevae/src/shortread_almeida_N_samples_10.sh deleted file mode 100755 index 83eb5464..00000000 --- a/workflow_vaevae/src/shortread_almeida_N_samples_10.sh +++ /dev/null @@ -1,31 +0,0 @@ -#!/usr/bin/bash -folder=$(awk -v var="$PBS_ARRAYID" 'NR==var {print; exit}' /home/projects/cpr_10006/people/svekut/almeida_N_samples/folder_list_10.txt) - -# vamb bin default \ -# --outdir /home/projects/cpr_10006/people/svekut/almeida_N_samples/data_folders/${folder}/vambout \ -# --fasta /home/projects/cpr_10006/people/svekut/almeida_N_samples/data_folders/${folder}/contigs.fna.gz \ -# --bamfiles /home/projects/cpr_10006/people/svekut/almeida_N_samples/data_folders/${folder}/bam_files/*.bam \ -# -l 32 \ -# -e 300 \ -# -q 25 75 150 \ -# -o C \ -# --cuda \ -# --minfasta 200000 - -vamb bin taxvamb \ - --outdir /home/projects/cpr_10006/people/svekut/almeida_N_samples/data_folders/${folder}/taxvambout \ - --fasta /home/projects/cpr_10006/people/svekut/almeida_N_samples/data_folders/${folder}/contigs.fna.gz \ - --bamdir /home/projects/cpr_10006/people/svekut/almeida_N_samples/data_folders/${folder}/bam_files \ - --taxonomy /home/projects/cpr_10006/people/svekut/almeida_N_samples/data_folders/${folder}/taxonomy_mmseqs.tsv \ - -l 32 \ - -e 300 \ - -t 1024 \ - -q \ - -pe 100 \ - -pt 1024 \ - -pq \ - -pthr 0.95 \ - -o C \ - --cuda \ - --minfasta 200000 - diff --git a/workflow_vaevae/src/shortread_almeida_N_samples_100 copy.sh b/workflow_vaevae/src/shortread_almeida_N_samples_100 copy.sh deleted file mode 100755 index 6c8417c8..00000000 --- a/workflow_vaevae/src/shortread_almeida_N_samples_100 copy.sh +++ /dev/null @@ -1,31 +0,0 @@ -#!/usr/bin/bash -folder=$(awk -v var="$PBS_ARRAYID" 'NR==var {print; exit}' /home/projects/cpr_10006/people/svekut/almeida_N_samples/folder_list_100.txt) - -# vamb bin default \ -# --outdir /home/projects/cpr_10006/people/svekut/almeida_N_samples/data_folders/${folder}/vambout \ -# --fasta /home/projects/cpr_10006/people/svekut/almeida_N_samples/data_folders/${folder}/contigs.fna.gz \ -# --bamfiles /home/projects/cpr_10006/people/svekut/almeida_N_samples/data_folders/${folder}/bam_files/*.bam \ -# -l 32 \ -# -e 300 \ -# -q 25 75 150 \ -# -o C \ -# --cuda \ -# --minfasta 200000 - -vamb bin taxvamb \ - --outdir /home/projects/cpr_10006/people/svekut/almeida_N_samples/data_folders/${folder}/taxvambout \ - --fasta /home/projects/cpr_10006/people/svekut/almeida_N_samples/data_folders/${folder}/contigs.fna.gz \ - --bamdir /home/projects/cpr_10006/people/svekut/almeida_N_samples/data_folders/${folder}/bam_files \ - --taxonomy /home/projects/cpr_10006/people/svekut/almeida_N_samples/data_folders/${folder}/taxonomy_mmseqs.tsv \ - -l 32 \ - -e 300 \ - -t 1024 \ - -q \ - -pe 100 \ - -pt 1024 \ - -pq \ - -pthr 0.95 \ - -o C \ - --cuda \ - --minfasta 200000 - diff --git a/workflow_vaevae/src/shortread_almeida_N_samples_3_100.sh b/workflow_vaevae/src/shortread_almeida_N_samples_3_100.sh deleted file mode 100755 index 00f0478b..00000000 --- a/workflow_vaevae/src/shortread_almeida_N_samples_3_100.sh +++ /dev/null @@ -1,33 +0,0 @@ -#!/usr/bin/bash -# folder=$(awk -v var="$PBS_ARRAYID" 'NR==var {print; exit}' /home/projects/cpr_10006/people/svekut/almeida_N_samples/folder_list_100.txt) - -folder='3_100' - -# vamb bin default \ -# --outdir /home/projects/cpr_10006/people/svekut/almeida_N_samples/data_folders/${folder}/vambout \ -# --fasta /home/projects/cpr_10006/people/svekut/almeida_N_samples/data_folders/${folder}/contigs.fna.gz \ -# --bamfiles /home/projects/cpr_10006/people/svekut/almeida_N_samples/data_folders/${folder}/bam_files/*.bam \ -# -l 32 \ -# -e 300 \ -# -q 25 75 150 \ -# -o C \ -# --cuda \ -# --minfasta 200000 - -vamb bin taxvamb \ - --outdir /home/projects/cpr_10006/people/svekut/almeida_N_samples/data_folders/${folder}/taxvambout \ - --fasta /home/projects/cpr_10006/people/svekut/almeida_N_samples/data_folders/${folder}/contigs.fna.gz \ - --bamdir /home/projects/cpr_10006/people/svekut/almeida_N_samples/data_folders/${folder}/bam_files \ - --taxonomy /home/projects/cpr_10006/people/svekut/almeida_N_samples/data_folders/${folder}/taxonomy_mmseqs.tsv \ - -l 32 \ - -e 300 \ - -t 1024 \ - -q \ - -pe 100 \ - -pt 1024 \ - -pq \ - -pthr 0.95 \ - -o C \ - --cuda \ - --minfasta 200000 - diff --git a/workflow_vaevae/src/shortread_almeida_no_predictor.sh b/workflow_vaevae/src/shortread_almeida_no_predictor.sh deleted file mode 100755 index eea5abc8..00000000 --- a/workflow_vaevae/src/shortread_almeida_no_predictor.sh +++ /dev/null @@ -1,31 +0,0 @@ -#!/usr/bin/bash - -# --taxonomy /home/projects/cpr_10006/people/svekut/mmseq2/almeida_taxonomy.tsv \ -# --taxonomy_predictions /home/projects/cpr_10006/projects/vamb/almeida_vaevaeout/results_taxonomy_predictor.csv \ - -vamb \ - --model vaevae \ - --outdir /home/projects/cpr_10006/projects/vamb/almeida_vaevaeout_flatsoftmax \ - --fasta /home/projects/cpr_10006/projects/vamb/analysis/almeida/data/almeida.fna \ - --rpkm /home/projects/cpr_10006/projects/vamb/analysis/almeida/data/almeida.jgi.depth.npz \ - --taxonomy /home/projects/cpr_10006/people/svekut/04_mmseq2/taxonomy_cami_kfold/almeida_taxonomy_metabuli_otu.tsv \ - --no_predictor \ - -l 32 \ - -t 512 \ - -e 200 \ - -q \ - -o C \ - --cuda \ - --minfasta 200000 - - -vamb \ - --model reclustering \ - --latent_path /home/projects/cpr_10006/projects/vamb/almeida_vaevaeout_flatsoftmax/vaevae_latent.npy \ - --clusters_path /home/projects/cpr_10006/projects/vamb/almeida_vaevaeout_flatsoftmax/vaevae_clusters.tsv \ - --fasta /home/projects/cpr_10006/projects/vamb/analysis/almeida/data/almeida.fna \ - --rpkm /home/projects/cpr_10006/projects/vamb/analysis/almeida/data/almeida.jgi.depth.npz \ - --outdir /home/projects/cpr_10006/projects/vamb/almeida_vaevaeout_flatsoftmax_reclustering \ - --hmmout_path /home/projects/cpr_10006/projects/semi_vamb/data/marker_genes/markers_almeida.hmmout \ - --algorithm kmeans \ - --minfasta 200000 diff --git a/workflow_vaevae/src/shortread_almeida_predictor.sh b/workflow_vaevae/src/shortread_almeida_predictor.sh deleted file mode 100755 index 759a4a13..00000000 --- a/workflow_vaevae/src/shortread_almeida_predictor.sh +++ /dev/null @@ -1,16 +0,0 @@ -#!/usr/bin/bash -# --taxonomy /home/projects/cpr_10006/people/svekut/mmseq2/almeida_taxonomy.tsv \ -# --taxonomy_predictions /home/projects/cpr_10006/projects/vamb/almeida_vaevaeout/results_taxonomy_predictor.csv \ -run_id=$1 - -vamb \ - --model taxonomy_predictor \ - --outdir /home/projects/cpr_10006/people/svekut/almeida_kfold_predictor_flat_softmax_${run_id} \ - --fasta /home/projects/cpr_10006/projects/vamb/analysis/almeida/data/almeida.fna \ - --rpkm /home/projects/cpr_10006/projects/vamb/analysis/almeida/data/almeida.jgi.depth.npz \ - --taxonomy /home/projects/cpr_10006/people/svekut/04_mmseq2/taxonomy_cami_kfold/almeida_taxonomy_${run_id}.tsv \ - -pe 100 \ - -pq \ - -pt 1024 \ - --cuda \ - -ploss flat_softmax diff --git a/workflow_vaevae/src/shortread_almeida_vamb.sh b/workflow_vaevae/src/shortread_almeida_vamb.sh deleted file mode 100755 index 5f34d3ac..00000000 --- a/workflow_vaevae/src/shortread_almeida_vamb.sh +++ /dev/null @@ -1,22 +0,0 @@ -#!/usr/bin/bash - -vamb \ - --outdir /home/projects/cpr_10006/projects/vamb/almeida_vaevaeout_vamb \ - --fasta /home/projects/cpr_10006/projects/vamb/analysis/almeida/data/almeida.fna \ - --rpkm /home/projects/cpr_10006/projects/vamb/analysis/almeida/data/almeida.jgi.depth.npz \ - -l 32 \ - -e 300 \ - -q 25 75 150 \ - -o C \ - --cuda \ - --minfasta 200000 - -# vamb \ -# --model reclustering \ -# --latent_path /home/projects/cpr_10006/projects/vamb/almeida_vaevaeout_e50_dadam/vaevae_latent.npy \ -# --clusters_path /home/projects/cpr_10006/projects/vamb/almeida_vaevaeout_e50_dadam/vaevae_clusters.tsv \ -# --fasta /home/projects/cpr_10006/projects/vamb/analysis/almeida/data/almeida.fa.gz \ -# --rpkm /home/projects/cpr_10006/projects/vamb/analysis/almeida/data/almeida.jgi.depth.npz \ -# --outdir /home/projects/cpr_10006/projects/vamb/almeida_vaevaeout_e50_dadam_reclustering \ -# --taxonomy_predictions /home/projects/cpr_10006/projects/vamb/almeida_vaevaeout/results_taxonomy_predictor.csv \ -# --minfasta 200000 diff --git a/workflow_vaevae/src/shortread_marine_predictor.sh b/workflow_vaevae/src/shortread_marine_predictor.sh deleted file mode 100755 index d0347759..00000000 --- a/workflow_vaevae/src/shortread_marine_predictor.sh +++ /dev/null @@ -1,13 +0,0 @@ -#!/usr/bin/bash -keyword=$1 - # --cuda \ - -vamb taxometer \ - --outdir /home/projects/cpr_10006/people/svekut/cami2_marine_predictor_taxometer_${keyword}_cpu \ - --fasta /home/projects/cpr_10006/data/CAMI2/marine/vamb_output/contigs_2kbp.fna \ - --rpkm /home/projects/cpr_10006/data/CAMI2/marine/vamb_output/vambout/abundance.npz \ - --taxonomy /home/projects/cpr_10006/people/svekut/04_mmseq2/taxonomy_cami_kfold/marine_taxonomy_${keyword}.tsv \ - -pe 100 \ - -pq \ - -pt 1024 \ - -ploss flat_softmax diff --git a/workflow_vaevae/src/shortread_marine_predictor_cpu.sh b/workflow_vaevae/src/shortread_marine_predictor_cpu.sh deleted file mode 100755 index 0de43d73..00000000 --- a/workflow_vaevae/src/shortread_marine_predictor_cpu.sh +++ /dev/null @@ -1,12 +0,0 @@ -#!/usr/bin/bash -keyword=$1 - -vamb taxometer \ - --outdir /home/projects/cpr_10006/people/svekut/cami2_marine_predictor_taxometer_${keyword}_cpu \ - --fasta /home/projects/cpr_10006/data/CAMI2/marine/vamb_output/contigs_2kbp.fna \ - --rpkm /home/projects/cpr_10006/data/CAMI2/marine/vamb_output/vambout/abundance.npz \ - --taxonomy /home/projects/cpr_10006/people/svekut/04_mmseq2/taxonomy_cami_kfold/marine_taxonomy_${keyword}.tsv \ - -pe 100 \ - -pq \ - -pt 1024 \ - -ploss flat_softmax diff --git a/workflow_vaevae/src/shortread_rhizo500_predictor.sh b/workflow_vaevae/src/shortread_rhizo500_predictor.sh deleted file mode 100755 index b562d093..00000000 --- a/workflow_vaevae/src/shortread_rhizo500_predictor.sh +++ /dev/null @@ -1,12 +0,0 @@ -#!/usr/bin/bash -vamb \ - --model taxonomy_predictor \ - --outdir /home/projects/cpr_10006/people/svekut/cami2_rhizo_500_predictor \ - --fasta /home/projects/cpr_10006/people/svekut/rhizo_data/rhizo.fna.gz \ - --rpkm /home/projects/cpr_10006/people/svekut/rhizo_data/abundance.npz \ - --taxonomy /home/projects/cpr_10006/people/svekut/04_mmseq2/taxonomy_cami_kfold/rhizo_500_taxonomy_metabuli_otu.tsv \ - -pe 100 \ - -pq \ - -pt 1024 \ - --cuda \ - -ploss flat_softmax diff --git a/workflow_vaevae/src/shortread_rhizo_predictor.sh b/workflow_vaevae/src/shortread_rhizo_predictor.sh deleted file mode 100755 index 3efd535c..00000000 --- a/workflow_vaevae/src/shortread_rhizo_predictor.sh +++ /dev/null @@ -1,13 +0,0 @@ -#!/usr/bin/bash -keyword=$1 - # --cuda \ - -vamb taxometer \ - --outdir /home/projects/cpr_10006/people/svekut/cami2_rhizo_predictor_taxometer_${keyword}_cpu \ - --fasta /home/projects/cpr_10006/data/CAMI2/rhizo/vamb_output/contigs_2kbp.fna \ - --rpkm /home/projects/cpr_10006/data/CAMI2/rhizo/vamb_output/vambout/abundance.npz \ - --taxonomy /home/projects/cpr_10006/people/svekut/04_mmseq2/taxonomy_cami_kfold/rhizo_taxonomy_${keyword}.tsv \ - -pe 100 \ - -pq \ - -pt 1024 \ - -ploss flat_softmax diff --git a/workflow_vaevae/src/shortread_rhizo_predictor_cpu.sh b/workflow_vaevae/src/shortread_rhizo_predictor_cpu.sh deleted file mode 100755 index acf9f676..00000000 --- a/workflow_vaevae/src/shortread_rhizo_predictor_cpu.sh +++ /dev/null @@ -1,12 +0,0 @@ -#!/usr/bin/bash -keyword=$1 - -vamb taxometer \ - --outdir /home/projects/cpr_10006/people/svekut/cami2_rhizo_predictor_taxometer_${keyword}_cpu \ - --fasta /home/projects/cpr_10006/data/CAMI2/rhizo/vamb_output/contigs_2kbp.fna \ - --rpkm /home/projects/cpr_10006/data/CAMI2/rhizo/vamb_output/vambout/abundance.npz \ - --taxonomy /home/projects/cpr_10006/people/svekut/04_mmseq2/taxonomy_cami_kfold/rhizo_taxonomy_${keyword}.tsv \ - -pe 100 \ - -pq \ - -pt 1024 \ - -ploss flat_softmax