Skip to content

Commit

Permalink
Merge pull request #51 from gbouras13/clusterdb
Browse files Browse the repository at this point in the history
v0.2.0
  • Loading branch information
gbouras13 authored Jul 13, 2024
2 parents e65cd49 + e52b220 commit 14e916c
Show file tree
Hide file tree
Showing 46 changed files with 27,979 additions and 606 deletions.
28 changes: 28 additions & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,33 @@
# History

0.2.0
------------------

**You will need to re-install the updated phold database for v0.2.0 using `phold install`**
**You will also need to upgrade Foldseek to v9.427df8a**

v0.2.0 is a large update adding:

* Improved sensitivity and faster runtime for the `foldseek` search. This is achieved by clustering the Phold database at `--min-seq-id 0.3 -c 0.8` and creating a cluster db before running with `foldseek` which significantly improves runtime
* Overall, just over 1.1M structures are clustered into around 372k clusters
* `--cluster-search 1` parameter is added to `foldseek search` to search against the cluster representatives first and then within each cluster, which increases sensitivity and reduces resource usage compared to `phold v0.1.4`
* Changed default `--max_seqs` from 1000 to 10000 to improve sensitivity at little resource usage cost
* Phold database is expanded adding:
* Extremely conservative high confidence [efam](https://doi.org/10.1093/bioinformatics/btab451) proteins with hits to PHROGs.
* 95% dereplicated diversity-generating retroelements (DGRs) from [Roux et al](https://www.nature.com/articles/s41467-021-23402-7).
* 7153 netflax toxin-antitoxin system proteins from [Ernits et al](https://doi.org/10.1073/pnas.2305393120).
* Adds `--ultra_sensitive` flag which turns off Foldseek prefiltering for maximum sensitivity. Recommended for small datasets/single phages only.
* This passes the `--exhaustive-search` parameter to `foldseek search`
* Adds the ability to save ProstT5 embeddings with `--save_per_residue_embeddings` and `--save_per_protein_embeddings`
* Adds `.cif` support (e.g. from Alphafold3 server) for structures, not just `.pdb` file format
* Removes some experimental parameters from v0.1.4 (`--split` etc)

Breaking CLI parameter changes

* `--pdb` has changed to `--structures`
* `--pdb_dir` has changed to `--structure_dir`
* `--filter_pdbs` has changed to `--filter_structures`

0.1.4 (2024-03-26)
------------------

Expand Down
101 changes: 63 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,23 @@

# phold - Phage Annotation using Protein Structures

<p align="center">
<img src="img/phold_logo.png" alt="phold Logo" height=200>
</p>

`phold` is a sensitive annotation tool for bacteriophage genomes and metagenomes using protein structural homology.

`phold` uses the [ProstT5](https://github.com/mheinzinger/ProstT5) protein language model to translate protein amino acid sequences to the 3Di token alphabet used by [Foldseek](https://github.com/steineggerlab/foldseek). Foldseek is then used to search these against a database of 803k protein structures mostly predicted using [Colabfold](https://github.com/sokrypton/ColabFold).
`phold` uses the [ProstT5](https://github.com/mheinzinger/ProstT5) protein language model to rapidly translate protein amino acid sequences to the 3Di token alphabet used by [Foldseek](https://github.com/steineggerlab/foldseek). Foldseek is then used to search these against a database of over 1 million phage protein structures mostly predicted using [Colabfold](https://github.com/sokrypton/ColabFold).

Alternatively, you can specify protein structures that you have pre-computed for your phage(s) instead of using ProstT5 with `phold compare`.

Alternatively, you can specify protein structures that you have pre-computed for your phage(s) instead of using ProstT5.
Benchmarking is ongoing, but `phold` strongly outperforms [Pharokka](https://github.com/gbouras13/pharokka), particularly for less characterised phages such as those from metagenomic datasets.

Benchmarking is ongoing but `phold` strongly outperforms [Pharokka](https://github.com/gbouras13/pharokka), particularly for less characterised phages such as those from metagenomic datasets.
The below plot shows the percentage of annotated coding sequences (CDS) for 179 metagenomic phage genomes assembled with [phables](https://github.com/Vini2/phables). Phold v0.2.0 run both in default settings (with ProstT5) settings and where predicted protein structures (with Colabfold) were compared against Pharokka v1.7.0.

<p align="center">
<img src="img/phables_bench.jpeg" alt="phables benchmarking" height=200>
</p>

If you have already annotated your phage(s) with Pharokka, `phold` takes the Genbank output of Pharokka as an input option, so you can easily update the annotation with more functional predictions!

Expand All @@ -25,13 +35,11 @@ Check out the `phold` tutorial at [https://phold.readthedocs.io/en/latest/tutori

If you don't want to install `phold` locally, you can run it without any code using one of the following Google Colab notebooks:

* To run `pharokka` + `phold` + `phynteny` (recommended) use [this link](https://colab.research.google.com/github/gbouras13/phold/blob/main/run_pharokka_and_phold_and_phynteny.ipynb)
* To run `pharokka` + `phold` + `phynteny` use [this link](https://colab.research.google.com/github/gbouras13/phold/blob/main/run_pharokka_and_phold_and_phynteny.ipynb)
* [phynteny](https://github.com/susiegriggo/Phynteny) uses a long-short term memory model trained on phage synteny (the conserved gene order across phages) to assign hypothetical phage proteins to a PHROG category - it might help you add extra PHROG category annotations to hypothetical genes remaining after you run `phold`.
* Note: Phynteny will work only if your phage has fewer than 120 predicted proteins
* You can still use this notebook to run `phold` if your phage(s) are too big - just don't run the Phynteny step!

* To run only `phold` use [this link](https://colab.research.google.com/github/gbouras13/phold/blob/main/run_phold.ipynb)

# Table of Contents

- [phold - Phage Annotation using Protein Structures](#phold---phage-annotation-using-protein-structures)
Expand Down Expand Up @@ -87,11 +95,11 @@ Once `phold` is installed, to download and install the database run:
phold install
```

* Note: You will need at least 8GB of free space (the `phold` databases including ProstT5 are 7.7GB uncompressed).
* Note: You will need at least 8GB of free space (the `phold` databases including ProstT5 are just over 8GB uncompressed).

# Quick Start

* `phold` takes a GenBank format file output from [pharokka](https://github.com/gbouras13/pharokka) as its input by default.
* `phold` takes a GenBank format file output from [pharokka](https://github.com/gbouras13/pharokka) or from [NCBI Genbank](https://www.ncbi.nlm.nih.gov/genbank/) as its input by default.
* If you are running `phold` on a local work station with GPU available, using `phold run` is recommended. It runs both `phold predict` and `phold compare`

``` bash
Expand Down Expand Up @@ -137,10 +145,12 @@ Commands:
citation Print the citation(s) for this tool
compare Runs Foldseek vs phold db
createdb Creates foldseek DB from AA FASTA and 3Di FASTA input...
install Installs ProstT5 model and phold database
plot Creates Phold Circular Genome Plots
predict Uses ProstT5 to predict 3Di tokens - GPU recommended
proteins-compare Runs Foldseek vs phold db on proteins input
proteins-predict Runs ProstT5 on a multiFASTA input - GPU recommended
remote Uses foldseek API to run ProstT5 then foldseek locally
remote Uses Foldseek API to run ProstT5 then Foldseek locally
run phold predict then comapare all in one - GPU recommended
```

Expand All @@ -150,34 +160,46 @@ Usage: phold run [OPTIONS]
phold predict then comapare all in one - GPU recommended

Options:
-h, --help Show this message and exit.
-V, --version Show the version and exit.
-i, --input PATH Path to input file in Genbank format or nucleotide
FASTA format [required]
-o, --output PATH Output directory [default: output_phold]
-t, --threads INTEGER Number of threads [default: 1]
-p, --prefix TEXT Prefix for output files [default: phold]
-d, --database TEXT Specific path to installed phold database
-f, --force Force overwrites the output directory
--batch_size INTEGER batch size for ProstT5. 1 is usually fastest.
[default: 1]
--cpu Use cpus only.
--omit_probs Do not output 3Di probabilities from ProstT5
--finetune Use finetuned ProstT5 model
--finetune_path TEXT Path to finetuned model weights
-e, --evalue FLOAT Evalue threshold for Foldseek [default: 1e-3]
-s, --sensitivity FLOAT sensitivity parameter for Foldseek [default: 9.5]
--keep_tmp_files Keep temporary intermediate files, particularly
the large foldseek_results.tsv of all Foldseek
hits
--split Split the Foldseek searches by ProstT5 probability
--split_threshold FLOAT ProstT5 probability to split by [default: 60]
--card_vfdb_evalue FLOAT Stricter Evalue threshold for Foldseek CARD and
VFDB hits [default: 1e-10]
--separate Output separate genbank files for every contig
--max_seqs INTEGER Maximum results per query sequence allowed to pass
the prefilter. You may want to reduce this to save
disk space for enormous datasets [default: 1000]
-h, --help Show this message and exit.
-V, --version Show the version and exit.
-i, --input PATH Path to input file in Genbank format or
nucleotide FASTA format [required]
-o, --output PATH Output directory [default: output_phold]
-t, --threads INTEGER Number of threads [default: 1]
-p, --prefix TEXT Prefix for output files [default: phold]
-d, --database TEXT Specific path to installed phold database
-f, --force Force overwrites the output directory
--batch_size INTEGER batch size for ProstT5. 1 is usually fastest.
[default: 1]
--cpu Use cpus only.
--omit_probs Do not output 3Di probabilities from ProstT5
--finetune Use finetuned ProstT5 model (PhrostT5).
Experimental and not recommended for now
--finetune_path TEXT Path to finetuned model weights
--save_per_residue_embeddings Save the ProstT5 embeddings per resuide in a
h5 file
--save_per_protein_embeddings Save the ProstT5 embeddings as means per
protein in a h5 file
-e, --evalue FLOAT Evalue threshold for Foldseek [default:
1e-3]
-s, --sensitivity FLOAT Sensitivity parameter for foldseek [default:
9.5]
--keep_tmp_files Keep temporary intermediate files,
particularly the large foldseek_results.tsv
of all Foldseek hits
--card_vfdb_evalue FLOAT Stricter Evalue threshold for Foldseek CARD
and VFDB hits [default: 1e-10]
--separate Output separate GenBank files for each contig
--max_seqs INTEGER Maximum results per query sequence allowed to
pass the prefilter. You may want to reduce
this to save disk space for enormous datasets
[default: 10000]
--only_representatives Foldseek search only against the cluster
representatives (i.e. turn off --cluster-
search 1 Foldseek parameter)
--ultra_sensitive Runs phold with maximum sensitivity by
skipping Foldseek prefilter. Not recommended
for large datasets.
```

# Plotting
Expand All @@ -194,10 +216,11 @@ phold plot -i tests/test_data/NC_043029_phold_output.gbk -o NC_043029_phold_plo

# Citation

`phold` is a work in progress, a preprint will be coming hopefully soon - if you use it please cite the GitHub repository https://github.com/gbouras13/phold for now.
`phold` is a work in progress, a preprint will be coming soon - if you use it please cite the GitHub repository https://github.com/gbouras13/phold for now.

Please be sure to cite the following core dependencies and PHROGs database:

* Pharokka - (https://github.com/gbouras13/pharokka) [Bouras G, Nepal R, Houtak G, Psaltis AJ, Wormald P-J, Vreugde S. Pharokka: a fast scalable bacteriophage annotation tool. Bioinformatics, Volume 39, Issue 1, January 2023, btac776](https://doi.org/10.1093/bioinformatics/btac776)
* Foldseek - (https://github.com/steineggerlab/foldseek) [van Kempen M, Kim S, Tumescheit C, Mirdita M, Lee J, Gilchrist C, Söding J, and Steinegger M. Fast and accurate protein structure search with Foldseek. Nature Biotechnology, doi:10.1038/s41587-023-01773-0 (2023)](https://www.nature.com/articles/s41587-023-01773-0)
* ProstT5 - (https://github.com/mheinzinger/ProstT5) [Michael Heinzinger, Konstantin Weissenow, Joaquin Gomez Sanchez, Adrian Henkel, Martin Steinegger, Burkhard Rost. ProstT5: Bilingual Language Model for Protein Sequence and Structure. bioRxiv doi:10.1101/2023.07.23.550085 (2023)](https://www.biorxiv.org/content/10.1101/2023.07.23.550085v1)
* Colabfold - (https://github.com/sokrypton/ColabFold) [Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S and Steinegger M. ColabFold: Making protein folding accessible to all. Nature Methods (2022) doi: 10.1038/s41592-022-01488-1 ](https://www.nature.com/articles/s41592-022-01488-1)
Expand All @@ -209,5 +232,7 @@ Please also consider citing these supplementary databases where relevant:
* [VFDB](http://www.mgc.ac.cn/VFs/main.htm) - [Chen L., Yang J., Yao Z., Sun L., Shen Y., Jin Q., "VFDB: a reference database for bacterial virulence factors", Nucleic Acids Research (2005) https://doi.org/10.1093/nar/gki008](https://doi.org/10.1093/nar/gki008)
* [Defensefinder](https://defensefinder.mdmlab.fr) - [ F. Tesson, R. Planel, A. Egorov, H. Georjon, H. Vaysset, B. Brancotte, B. Néron, E. Mordret, A Bernheim, G. Atkinson, J. Cury. A Comprehensive Resource for Exploring Antiphage Defense: DefenseFinder Webservice, Wiki and Databases. bioRxiv (2024) https://doi.org/10.1101/2024.01.25.577194](https://doi.org/10.1101/2024.01.25.577194)
* [acrDB](https://bcb.unl.edu/AcrDB/) - please cite the original acrDB database paper [Le Huang, Bowen Yang, Haidong Yi, Amina Asif, Jiawei Wang, Trevor Lithgow, Han Zhang, Fayyaz ul Amir Afsar Minhas, Yanbin Yin, AcrDB: a database of anti-CRISPR operons in prokaryotes and viruses. Nucleic Acids Research (2021) https://doi.org/10.1093/nar/gkaa857](https://doi.org/10.1093/nar/gkaa857) AND the paper that generated the structures for these protein used by `phold` [Harutyun Sahakyan, Kira S. Makarova, and Eugene V. Koonin. Search for Origins of Anti-CRISPR Proteins by Structure Comparison. The CRISPR Journal (2023)](https://doi.org/10.1089/crispr.2023.0011)
* [Netflax](http://netflax.webflags.se) - [Karin Ernits, Chayan Kumar Saha, Tetiana Brodiazhenko, Bhanu Chouhan, Aditi Shenoy, Jessica A. Buttress, Julián J. Duque-Pedraza, Veda Bojar, Jose A. Nakamoto, Tatsuaki Kurata, Artyom A. Egorov, Lena Shyrokova, Marcus J. O. Johansson, Toomas Mets, Aytan Rustamova, Jelisaveta Džigurski, Tanel Tenson, Abel Garcia-Pino, Henrik Strahl, Arne Elofsson, Vasili Hauryliuk, and Gemma C. Atkinson, The structural basis of hyperpromiscuity in a core combinatorial network of type II toxin–antitoxin and related phage defense systems. PNAS (2023) https://doi.org/10.1073/pnas.2305393120](https://doi.org/10.1073/pnas.2305393120)
* [Netflax](http://netflax.webflags.se) - [Karin Ernits, Chayan Kumar Saha, Tetiana Brodiazhenko, Bhanu Chouhan, Aditi Shenoy, Jessica A. Buttress, Julián J. Duque-Pedraza, Veda Bojar, Jose A. Nakamoto, Tatsuaki Kurata, Artyom A. Egorov, Lena Shyrokova, Marcus J. O. Johansson, Toomas Mets, Aytan Rustamova, Jelisaveta Džigurski, Tanel Tenson, Abel Garcia-Pino, Henrik Strahl, Arne Elofsson, Vasili Hauryliuk, and Gemma C. Atkinson, The structural basis of hyperpromiscuity in a core combinatorial network of type II toxin–antitoxin and related phage defense systems. PNAS (2023) https://doi.org/10.1073/pnas.2305393120](https://doi.org/10.1073/pnas.2305393120)


Loading

0 comments on commit 14e916c

Please sign in to comment.