Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FAQ entry about antibody sequencing #304

Merged
merged 3 commits into from
Feb 24, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 24 additions & 17 deletions docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,40 +34,40 @@ You can avoid this error by explicitly specifying the model file using the `--mo
**Casanovo is very slow even when running on the GPU. How can I speed it up?**

It is highly recommended to run Casanovo on the GPU to get the maximum performance.
If Casanovo is slow despite that your system has a GPU, the GPU might not be configured correctly for Casanovo.
If Casanovo is slow despite your system having a GPU, then the GPU might not be configured correctly.
A quick test to verify that Casanovo is using your (CUDA-enabled) GPU is to run `watch nvidia-smi` in your terminal.
If Casanovo has access to the GPU, you should see it listed in the bottom process table and the "Volatile GPU-Util" column at the top right should show activity while Casanovo is processing the data.
If Casanovo has access to the GPU, then you should see it listed in the bottom process table, and the "Volatile GPU-Util" column at the top right should show activity while Casanovo is processing the data.

If Casanovo is not listed in the `nvidia-smi` output, it is not using your GPU.
If Casanovo is not listed in the `nvidia-smi` output, then it is not using your GPU.
This is commonly caused by an incompatibility between your NVIDIA drivers and Pytorch.
Although Pytorch is installed automatically when installing Casanovo, in this case we recommend to reinstall it manually according to the following steps:
Although Pytorch is installed automatically when installing Casanovo, in this case we recommend reinstalling it manually according to the following steps:

1. Uninstall the current version of Pytorch: `pip uninstall torch`
2. Install the latest version of the NVIDIA drivers using the [official CUDA Toolkit](https://developer.nvidia.com/cuda-downloads). If supported by your system, an easy alternative can be conda using `conda install -c nvidia cuda-toolkit`.
3. Install the latest version of Pytorch according to the [instructions on the Pytorch website](https://pytorch.org/get-started/locally/).

Try to run Casanovo again and use `watch nvidia-smi` to inspect whether it can use the GPU now.
If this is still not yet the case, please [open an issue on GitHub](https://github.com/Noble-Lab/casanovo/issues/new).
Include full information about your system set-up, the installed CUDA toolkit and Pytorch versions, and the troubleshooting steps you have performed.
If this is still not the case, please [open an issue on GitHub](https://github.com/Noble-Lab/casanovo/issues/new).
Include full information about your system setup, the installed CUDA toolkit and Pytorch versions, and the troubleshooting steps you have performed.

**I get a "CUDA out of memory" error when trying to run Casanovo. Help!**
**Why do I get a "CUDA out of memory" error when trying to run Casanovo?**

This means that there was not enough (free) memory available on your GPU to run Casanovo, which is especially likely to happen when you are using a smaller, consumer-grade GPU.
We recommend trying to decrease the `train_batch_size` or `predict_batch_size` options in the [config file](https://github.com/Noble-Lab/casanovo/blob/main/casanovo/config.yaml) (depending on whether the error occurred during `train` or `denovo` mode) to reduce the number of spectra that are processed simultaneously.
Depending on whether the error occurred during `train` or `denovo` mode, we recommend decreasing the `train_batch_size` or `predict_batch_size` options, respectively, in the [config file](https://github.com/Noble-Lab/casanovo/blob/main/casanovo/config.yaml) to reduce the number of spectra that are processed simultaneously.
Additionally, we recommend shutting down any other processes that may be running on the GPU, so that Casanovo can exclusively use the GPU.

**How can I run Casanovo on a specific GPU device?**

You can control which GPU(s) Casanovo uses by setting the `devices` option in the [configuration file](https://github.com/Noble-Lab/casanovo/blob/main/casanovo/config.yaml).
Analogously, this setting also controls the number of cores to use when running on a CPU only (which can be specified using the `accelerator` option).
This setting also controls the number of cores to use when running on a CPU only (which can be specified using the `accelerator` option).

By default, Casanovo will automatically try to use the maximum number of devices available.
I.e., if your system has multiple GPUs, Casanovo will utilize all of those for maximum efficiency.
I.e., if your system has multiple GPUs, then Casanovo will use all of them for maximum efficiency.
Alternatively, you can select a specific GPU by specifying the GPU number as the value for `devices`.
For example, if you have a four-GPU system, when specifying `devices: 1` in your config file Casanovo will only use the GPU with identifier `1`.

The config file functionality only allows specifying a single GPU, by setting its id under `devices`, or all GPUs, by setting `devices: -1`.
If you want more fine-grained control to use some but not all GPUs on a multi-GPU system, the `CUDA_VISIBLE_DEVICES` environment variable can be used instead.
If you want more fine-grained control to use some but not all GPUs on a multi-GPU system, then the `CUDA_VISIBLE_DEVICES` environment variable can be used instead.
For example, by setting `CUDA_VISIBLE_DEVICES=1,3`, only GPUs `1` and `3` will be visible to Casanovo, and specifying `devices: -1` will allow it to utilize both of these.

Note that when using `CUDA_VISIBLE_DEVICES`, the GPU numbers (potentially to be specified under `devices`) are reset to consecutively increase from `0`.
Expand All @@ -86,7 +86,7 @@ This will need to be set with each new shell session, or you can add it to your

**Where can I find the data that Casanovo was trained on?**

The [Casanovo results reported](https://doi.org/10.1101/2023.01.03.522621) were obtained by training on two different datasets: (i) a commonly used nine-species benchmark dataset, and (ii) a large-scale training dataset derived from the MassIVE Knowledge Base (MassIVE-KB).
The [reported Casanovo results](https://doi.org/10.1101/2023.01.03.522621) were obtained by training on two different datasets: (i) a commonly used nine-species benchmark dataset, and (ii) a large-scale training dataset derived from the MassIVE Knowledge Base (MassIVE-KB).

All data for the _nine-species benchmark_ is available as annotated MGF files [on MassIVE](https://doi.org/doi:10.25345/C52V2CK8J).
Using these data, Casanovo was trained in a cross-validated fashion, training on eight species and testing on the remaining species.
Expand All @@ -104,12 +104,12 @@ Training, validation, and test splits for the non-enzymatic dataset are availabl

By default, Casanovo saves a snapshot of the model weights after every 50,000 training steps.
Note that the number of samples that are processed during a single training step depends on the batch size.
Therefore, when using the default training batch size of 32, this corresponds to saving a model snapshot after every 1.6 million training samples.
Therefore, the default training batch size of 32 corresponds to saving a model snapshot after every 1.6 million training samples.
You can optionally modify the snapshot (and validation) frequency in the [config file](https://github.com/Noble-Lab/casanovo/blob/main/casanovo/config.yaml) (parameter `val_check_interval`), depending on your dataset size.
Note that taking very frequent model snapshots will result in somewhat slower training time because Casanovo will evaluate its performance on the validation data for every snapshot.
Note that taking very frequent model snapshots will result in slower training time because Casanovo will evaluate its performance on the validation data for every snapshot.

When saving a model snapshot, Casanovo will use the validation data to compute performance measures (training loss, validation loss, amino acid precision, and peptide precision) and print this information to the console and log file.
After your training job is finished, you can identify the best performing model that achieves the maximum peptide and amino acid precision from the log file and use the corresponding model snapshot.
After your training job is finished, you can identify the model that achieves the maximum peptide and amino acid precision from the log file and use the corresponding model snapshot.

**Even though I added new post-translational modifications to the configuration file, Casanovo didn't identify those peptides.**

Expand All @@ -121,7 +121,7 @@ By default, this includes oxidation of methionine, deamidation of asparagine and
(Additionally, cysteines are _always_ considered to be carbamidomethylated.)
Simply making changes to the `residues` alphabet in the configuration file is insufficient to identify new types of PTMs with Casanovo, however.
This is indicated by the fact that this option is not marked with `(I)` in the configuration file, which indicates options that can be modified during inference.
Al remaining options require training a new Casanovo model.
All remaining options require training a new Casanovo model.

Therefore, to learn the spectral signature of previously unknown PTMs, a new Casanovo version needs to be _trained_.
To include new PTMs in Casanovo, you need to:
Expand All @@ -134,7 +134,7 @@ Instead, such a model must be trained from scratch.

**How can I change the learning rate schedule used during training?**

By default, Casanovo uses a learning rate schedule that combines linear warm up followed by a cosine wave shaped decay (as implemented in `CosineWarmupScheduler` in `casanovo/denovo/model.py`) during training.
By default, Casanovo uses a learning rate schedule that combines linear warm up followed by a cosine decay (as implemented in `CosineWarmupScheduler` in `casanovo/denovo/model.py`) during training.
To use a different learning rate schedule, you can specify an alternative learning rate scheduler as follows (in the `lr_scheduler` variable in function `Spec2Pep.configure_optimizers` in `casanovo/denovo/model.py`):

```
Expand All @@ -145,6 +145,13 @@ You can use any of the scheduler classes available in [`torch.optim.lr_scheduler

## Miscellaneous

**Can I use Casanovo to sequence antibodies?**

Yes, antibody sequencing is one of the popular uses for de novo sequencing technology.
[This article](https://academic.oup.com/bib/article/24/1/bbac542/6955273) carried out a systematic comparison of six de novo sequencing tools (Novor, pNovo 3, DeepNovo, SMSNet, PointNovo and Casanovo). Casanovo fared very well in this comparison: "Casanovo exhibits the highest number of correct peptide predictions compared with all other de novo algorithms across all enzymes demonstrating the advantage of using transformers for peptide sequencing. Furthermore, Casanovo predicts amino acids with overall superior precision."

In practice, you may want to try providing your Casanovo output file to the [Stitch software](https://github.com/snijderlab/stitch), which performs template-based assembly of de novo peptide reads to reconstruct antibody sequences ([Schulte and Snyder 2024](https://www.biorxiv.org/content/10.1101/2024.02.20.581155v1)).

**Where can I find Casanovo model weights trained on the nine-species benchmark?**

You can find the Casanovo weights corresponding to the nine-species benchmark [on Zenodo](https://doi.org/10.5281/zenodo.10694984), compatible with Casanovo v4.x.x.
Expand Down
Loading