Skip to content

Commit

Permalink
Merge branch 'dev' into dev_db_search
Browse files Browse the repository at this point in the history
  • Loading branch information
VarunAnanth2003 authored Nov 3, 2024
2 parents c9eb8b7 + 0d1df14 commit 2b6198b
Show file tree
Hide file tree
Showing 9 changed files with 39 additions and 20 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
- The `--output` option has been split into two options, `--output_dir` and `--output_root`.
- The `--validation_peak_path` is now optional when training; if `--validation_peak_path` is not set then the `train_peak_path` will also be used for validation.
- The `tb_summarywriter` config option is now a boolean config option, and if set to true the TensorBoard summary will be written to a sub-directory of the output directory named `tensorboard`.
- The Casanovo model peptide level score is now reported as the geometric mean of the raw amino acid scores, rather then the arithmetic mean.

### Fixed

Expand Down
2 changes: 1 addition & 1 deletion casanovo/denovo/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -1264,7 +1264,7 @@ def _aa_pep_score(
peptide_score : float
The peptide score.
"""
peptide_score = np.mean(aa_scores)
peptide_score = np.exp(np.mean(np.log(aa_scores)))
aa_scores = (aa_scores + peptide_score) / 2
if not fits_precursor_mz:
peptide_score -= 1
Expand Down
15 changes: 9 additions & 6 deletions docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,14 +103,17 @@ Training, validation, and test splits for the non-enzymatic dataset are availabl

**How do I know which model to use after training Casanovo?**

By default, Casanovo saves a snapshot of the model weights after every 50,000 training steps.
When running model validation, Casanovo will use the validation data to compute performance measures (training loss, validation loss, amino acid precision, and peptide precision) and print this information to the console and log file.
At the end of each validation run and at the end of each training epoch (one complete run over the training data), Casanovo will take a snapshot of the current model weights.
After the training job is finished, the validation snapshot that achieved the lowest **validation loss** will be saved to the output directory as `<output_root>.best.ckpt`.
Additionally, a snapshot of the model weights at the end of each **training** epoch will be saved to the output directory as `epoch=<epoch>-step=<step>.ckpt`.
Snapshots from previous training epochs will be overwritten with the latest training snapshot at the end of each training epoch.

By default, Casanovo runs model validation every 50,000 training steps.
Note that the number of samples that are processed during a single training step depends on the batch size.
Therefore, the default training batch size of 32 corresponds to saving a model snapshot after every 1.6 million training samples.
You can optionally modify the snapshot (and validation) frequency in the [config file](https://github.com/Noble-Lab/casanovo/blob/main/casanovo/config.yaml) (parameter `val_check_interval`), depending on your dataset size.
Note that taking very frequent model snapshots will result in slower training time because Casanovo will evaluate its performance on the validation data for every snapshot.

When saving a model snapshot, Casanovo will use the validation data to compute performance measures (training loss, validation loss, amino acid precision, and peptide precision) and print this information to the console and log file.
After your training job is finished, you can identify the model that achieves the maximum peptide and amino acid precision from the log file and use the corresponding model snapshot.
You can optionally modify the validation run frequency in the [config file](https://github.com/Noble-Lab/casanovo/blob/main/casanovo/config.yaml) (parameter `val_check_interval`), depending on your dataset size.
Note that running model validation very frequently will result in slower training time because Casanovo will evaluate its performance on the validation data for every validation check.

**Even though I added new post-translational modifications to the configuration file, Casanovo didn't identify those peptides.**

Expand Down
2 changes: 1 addition & 1 deletion docs/file_formats.md
Original file line number Diff line number Diff line change
Expand Up @@ -253,6 +253,6 @@ Similarly, in Casanovo evaluation mode only annotated MGF files are supported.
<!-- TODO: when index files can be reused, document this here -->

During training, Casanovo will save **checkpoint files** at every `val_check_interval` steps, specified in the configuration.
Model checkpoints will be saved in the `model_save_folder_path` folder with filename format `epoch=EPOCH-step=STEP.ckpt`, with `EPOCH` the epoch and `STEP` the training step at which the checkpoint was taken, helping you track progress and select the best model based on validation performance.
Model checkpoints will be saved to the folder specified by the `--output_dir` command line option with filename format `epoch=EPOCH-step=STEP.ckpt`, with `EPOCH` the epoch and `STEP` the training step at which the checkpoint was taken, helping you track progress and select the best model based on validation performance.

<!-- TODO: when checkpointing is made more flexible, update this information -->
17 changes: 10 additions & 7 deletions docs/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,9 +81,12 @@ This model file or a custom one can then be specified using the `--model` comman

Not all releases will have a model file included on the [Releases page](https://github.com/Noble-Lab/casanovo/releases), in which case model weights for alternative releases with the same major version number can be used.

The most recent model weights for Casanovo version 3.x are currently provided under [Casanovo v3.0.0](https://github.com/Noble-Lab/casanovo/releases/tag/v3.0.0):
- `casanovo_massivekb.ckpt`: Default Casanovo weights to use when analyzing tryptic data. These weights will be downloaded automatically if no weights are explicitly specified.
- `casanovo_non-enzy.checkpt`: Casanovo weights to use when analyzing non-tryptic data, obtained by fine-tuning the tryptic model on multi-enzyme data. These weights need to be downloaded manually.
The most recent model weights for Casanovo version 4.2 and above are currently provided under [Casanovo v4.2.0](https://github.com/Noble-Lab/casanovo/releases/tag/v4.2.0):
- `casanovo_v4_2_0.ckpt`: Default Casanovo weights to use as described in [Melendez et al.](https://pubs.acs.org/doi/full/10.1021/acs.jproteome.4c00422). These weights will be downloaded automatically if no weights are explicitly specified.

Alternatively, model weigths for Casanovo version 4.x as described in [Yilmaz et al.](https://www.nature.com/articles/s41467-024-49731-x) are currently provided under [Casanovo v4.0.0](https://github.com/Noble-Lab/casanovo/releases/tag/v4.0.0):
- `casanovo_massivekb.ckpt`: Casanovo weights to use when analyzing tryptic data. These weights need to be downloaded manually.
- `casanovo_nontryptic.ckpt`: Casanovo weights to use when analyzing non-tryptic data, obtained by fine-tuning the tryptic model on multi-enzyme data. These weights need to be downloaded manually.

## Running Casanovo

Expand All @@ -96,7 +99,7 @@ We recommend a Linux system with a dedicated GPU to achieve optimal runtime perf
To sequence your own mass spectra with Casanovo, use the `casanovo sequence` command:

```sh
casanovo sequence -o results.mztab spectra.mgf
casanovo sequence spectra.mgf
```
![`casanovo sequence --help`](images/sequence-help.svg)

Expand All @@ -105,10 +108,10 @@ This will write peptide predictions for the given MS/MS spectra to the specified

### Evaluate *de novo* sequencing performance

To evaluate _de novo_ sequencing performance based on known mass spectrum annotations, use the `casanovo evaluate` command:
To evaluate _de novo_ sequencing performance based on known mass spectrum annotations, use the `casanovo sequence` command with the `--evaluate` option:

```sh
casanovo evaluate annotated_spectra.mgf
casanovo sequence annotated_spectra.mgf --evaluate
```
![`casanovo evaluate --help`](images/evaluate-help.svg)

Expand Down Expand Up @@ -144,7 +147,7 @@ casanovo sequence [PATH_TO]/sample_preprocessed_spectra.mgf
```

```{note}
If you want to store the output mzTab file in a different location than the current working directory, specify an alternative output location using the `--output` parameter.
If you want to store the output mzTab file in a different location than the current working directory, specify an alternative output location using the `--output_dir` parameter.
```

This job should complete in < 1 minute.
Expand Down
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ maxdepth: 1
Getting Started <getting_started.md>
File Formats <file_formats.md>
Command Line Interface <cli.rst>
Nextflow Workflow <nextflow.md>
FAQs <faq.md>
Contributing <CONTRIBUTING.md>
Code of Conduct <CODE_OF_CONDUCT.md>
Expand Down
6 changes: 6 additions & 0 deletions docs/nextflow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Casanovo Nextflow Workflow

To simplify the process of setting up and running Casanovo, a dedicated [Nextflow](https://www.nextflow.io/) workflow is available.
In addition to simplifying the installation of Casanovo and its dependencies, the Casanovo Nextflow workflow provides an automated mass spectrometry data pipeline that converts input data files to a Casanovo-compatible format using [msconvert](https://proteowizard.sourceforge.io/tools/msconvert.html), infers peptide sequences using Casanovo, and (optionally) uploads the results to [Limelight](https://limelight-ms.org/) - a platform for sharing and visualizing proteomics results.
The workflow can be used on POSIX-compatible (UNIX) systems, Windows using WSL, or on a cloud platform such as AWS.
For more details, refer to the [Casanovo Nextflow Workflow Documentation](https://nf-ms-dda-casanovo.readthedocs.io/en/latest/#).
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ classifiers = [
requires-python = ">=3.8"
dependencies = [
"appdirs",
"lightning>=2.0",
"lightning>=2.1",
"click",
"depthcharge-ms>=0.2.3,<0.3.0",
"natsort",
Expand Down
13 changes: 9 additions & 4 deletions tests/unit_tests/test_unit.py
Original file line number Diff line number Diff line change
Expand Up @@ -440,12 +440,17 @@ def test_aa_pep_score():
aa_scores_raw = np.asarray([0.0, 0.5, 1.0])

aa_scores, peptide_score = _aa_pep_score(aa_scores_raw, True)
np.testing.assert_array_equal(aa_scores, np.asarray([0.25, 0.5, 0.75]))
assert peptide_score == pytest.approx(0.5)
np.testing.assert_array_equal(aa_scores, np.asarray([0.0, 0.25, 0.5]))
assert peptide_score == pytest.approx(0.0)

aa_scores, peptide_score = _aa_pep_score(aa_scores_raw, False)
np.testing.assert_array_equal(aa_scores, np.asarray([0.25, 0.5, 0.75]))
assert peptide_score == pytest.approx(-0.5)
np.testing.assert_array_equal(aa_scores, np.asarray([0.0, 0.25, 0.5]))
assert peptide_score == pytest.approx(-1.0)

aa_scores_raw = np.asarray([1.0, 0.25])
aa_scores, peptide_score = _aa_pep_score(aa_scores_raw, True)
np.testing.assert_array_equal(aa_scores, np.asarray([0.75, 0.375]))
assert peptide_score == pytest.approx(0.5)


def test_calc_match_score():
Expand Down

0 comments on commit 2b6198b

Please sign in to comment.