From 8ab39b0d767155dc18dbcfe6ec4632f230e1dce9 Mon Sep 17 00:00:00 2001 From: Lilferrit Date: Thu, 10 Oct 2024 14:16:40 -0700 Subject: [PATCH 01/14] nextflow documentation --- docs/index.md | 1 + docs/nextflow.md | 5 +++++ 2 files changed, 6 insertions(+) create mode 100644 docs/nextflow.md diff --git a/docs/index.md b/docs/index.md index f0c5700f..97a10325 100644 --- a/docs/index.md +++ b/docs/index.md @@ -17,6 +17,7 @@ maxdepth: 1 Getting Started File Formats Command Line Interface +Casanovo Nextflow Workflow FAQs Contributing Code of Conduct diff --git a/docs/nextflow.md b/docs/nextflow.md new file mode 100644 index 00000000..f8c44e27 --- /dev/null +++ b/docs/nextflow.md @@ -0,0 +1,5 @@ +# Casanovo Nextflow Workflow + +To simplify the process of setting up and running Casanovo, a dedicated Nextflow workflow is available. +This Casanovo Nextflow workflow automates the installation and execution of Casanovo and its dependencies. +For more details, refer to the [Casanovo Nextflow Workflow Documentation](https://nf-ms-dda-casanovo.readthedocs.io/en/latest/#). \ No newline at end of file From 505a8cd2d6bf88c2eab30da29a9eedebc5b6e9d6 Mon Sep 17 00:00:00 2001 From: Lilferrit Date: Tue, 15 Oct 2024 16:22:27 -0700 Subject: [PATCH 02/14] nextflow nav prefix, more nextflow docs details --- docs/index.md | 2 +- docs/nextflow.md | 5 +++-- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/docs/index.md b/docs/index.md index 97a10325..130e7cfa 100644 --- a/docs/index.md +++ b/docs/index.md @@ -17,7 +17,7 @@ maxdepth: 1 Getting Started File Formats Command Line Interface -Casanovo Nextflow Workflow +Nextflow Workflow FAQs Contributing Code of Conduct diff --git a/docs/nextflow.md b/docs/nextflow.md index f8c44e27..2fc3837d 100644 --- a/docs/nextflow.md +++ b/docs/nextflow.md @@ -1,5 +1,6 @@ # Casanovo Nextflow Workflow -To simplify the process of setting up and running Casanovo, a dedicated Nextflow workflow is available. -This Casanovo Nextflow workflow automates the installation and execution of Casanovo and its dependencies. +To simplify the process of setting up and running Casanovo, a dedicated [Nextflow](https://www.nextflow.io/) workflow is available. +In addition to automatic the installation of Casanovo and its dependencies, the Casanovo Nextflow workflow provides an automated mass spectrometry data pipeline that converts input data files to a Casanovo compatible format using [msconvert](https://bioinformaticshome.com/tools/proteomics/descriptions/msconvert.html#gsc.tab=0), infers peptide sequences using Casanovo, and (optionally) uploads the results to [Limelight](https://limelight-ms.org/) - a platform for sharing and visualizing proteomics results. +The workflow may be ran on POSIX-compatible (UNIX) systems, Windows using WSL, or on a cloud platform such as AWS. For more details, refer to the [Casanovo Nextflow Workflow Documentation](https://nf-ms-dda-casanovo.readthedocs.io/en/latest/#). \ No newline at end of file From 24b298ebaffa9d2ab443839477b2aec1aad98259 Mon Sep 17 00:00:00 2001 From: Lilferrit Date: Mon, 21 Oct 2024 15:59:28 -0700 Subject: [PATCH 03/14] grammatical fixes --- docs/nextflow.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/nextflow.md b/docs/nextflow.md index 2fc3837d..33de2f46 100644 --- a/docs/nextflow.md +++ b/docs/nextflow.md @@ -1,6 +1,6 @@ # Casanovo Nextflow Workflow -To simplify the process of setting up and running Casanovo, a dedicated [Nextflow](https://www.nextflow.io/) workflow is available. -In addition to automatic the installation of Casanovo and its dependencies, the Casanovo Nextflow workflow provides an automated mass spectrometry data pipeline that converts input data files to a Casanovo compatible format using [msconvert](https://bioinformaticshome.com/tools/proteomics/descriptions/msconvert.html#gsc.tab=0), infers peptide sequences using Casanovo, and (optionally) uploads the results to [Limelight](https://limelight-ms.org/) - a platform for sharing and visualizing proteomics results. -The workflow may be ran on POSIX-compatible (UNIX) systems, Windows using WSL, or on a cloud platform such as AWS. +To simplify the process of setting up and running Casanovo, a dedicated [Nextflow](https://proteowizard.sourceforge.io/tools/msconvert.html) workflow is available. +In addition to simplifying the installation of Casanovo and its dependencies, the Casanovo Nextflow workflow provides an automated mass spectrometry data pipeline that converts input data files to a Casanovo compatible format using [msconvert](https://bioinformaticshome.com/tools/proteomics/descriptions/msconvert.html#gsc.tab=0), infers peptide sequences using Casanovo, and (optionally) uploads the results to [Limelight](https://limelight-ms.org/) - a platform for sharing and visualizing proteomics results. +The workflow can be used on POSIX-compatible (UNIX) systems, Windows using WSL, or on a cloud platform such as AWS. For more details, refer to the [Casanovo Nextflow Workflow Documentation](https://nf-ms-dda-casanovo.readthedocs.io/en/latest/#). \ No newline at end of file From 50283395819340cc38a7018ef07cd4ba4a0822db Mon Sep 17 00:00:00 2001 From: Lilferrit Date: Tue, 22 Oct 2024 17:00:53 -0700 Subject: [PATCH 04/14] grammatical fixes --- docs/nextflow.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/nextflow.md b/docs/nextflow.md index 33de2f46..7104ece3 100644 --- a/docs/nextflow.md +++ b/docs/nextflow.md @@ -1,6 +1,6 @@ # Casanovo Nextflow Workflow To simplify the process of setting up and running Casanovo, a dedicated [Nextflow](https://proteowizard.sourceforge.io/tools/msconvert.html) workflow is available. -In addition to simplifying the installation of Casanovo and its dependencies, the Casanovo Nextflow workflow provides an automated mass spectrometry data pipeline that converts input data files to a Casanovo compatible format using [msconvert](https://bioinformaticshome.com/tools/proteomics/descriptions/msconvert.html#gsc.tab=0), infers peptide sequences using Casanovo, and (optionally) uploads the results to [Limelight](https://limelight-ms.org/) - a platform for sharing and visualizing proteomics results. +In addition to simplifying the installation of Casanovo and its dependencies, the Casanovo Nextflow workflow provides an automated mass spectrometry data pipeline that converts input data files to a Casanovo-compatible format using [msconvert](https://bioinformaticshome.com/tools/proteomics/descriptions/msconvert.html#gsc.tab=0), infers peptide sequences using Casanovo, and (optionally) uploads the results to [Limelight](https://limelight-ms.org/) - a platform for sharing and visualizing proteomics results. The workflow can be used on POSIX-compatible (UNIX) systems, Windows using WSL, or on a cloud platform such as AWS. For more details, refer to the [Casanovo Nextflow Workflow Documentation](https://nf-ms-dda-casanovo.readthedocs.io/en/latest/#). \ No newline at end of file From 2b550fae9b0797435ec402cafe55940177974c02 Mon Sep 17 00:00:00 2001 From: Lilferrit Date: Tue, 29 Oct 2024 16:46:59 -0700 Subject: [PATCH 05/14] fixed links --- docs/nextflow.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/nextflow.md b/docs/nextflow.md index 7104ece3..b459e40e 100644 --- a/docs/nextflow.md +++ b/docs/nextflow.md @@ -1,6 +1,6 @@ # Casanovo Nextflow Workflow -To simplify the process of setting up and running Casanovo, a dedicated [Nextflow](https://proteowizard.sourceforge.io/tools/msconvert.html) workflow is available. -In addition to simplifying the installation of Casanovo and its dependencies, the Casanovo Nextflow workflow provides an automated mass spectrometry data pipeline that converts input data files to a Casanovo-compatible format using [msconvert](https://bioinformaticshome.com/tools/proteomics/descriptions/msconvert.html#gsc.tab=0), infers peptide sequences using Casanovo, and (optionally) uploads the results to [Limelight](https://limelight-ms.org/) - a platform for sharing and visualizing proteomics results. +To simplify the process of setting up and running Casanovo, a dedicated [Nextflow](https://www.nextflow.io/) workflow is available. +In addition to simplifying the installation of Casanovo and its dependencies, the Casanovo Nextflow workflow provides an automated mass spectrometry data pipeline that converts input data files to a Casanovo-compatible format using [msconvert](https://proteowizard.sourceforge.io/tools/msconvert.html), infers peptide sequences using Casanovo, and (optionally) uploads the results to [Limelight](https://limelight-ms.org/) - a platform for sharing and visualizing proteomics results. The workflow can be used on POSIX-compatible (UNIX) systems, Windows using WSL, or on a cloud platform such as AWS. For more details, refer to the [Casanovo Nextflow Workflow Documentation](https://nf-ms-dda-casanovo.readthedocs.io/en/latest/#). \ No newline at end of file From 592c84ff1cbab15f3c6edd07516607e989dc8425 Mon Sep 17 00:00:00 2001 From: Lilferrit Date: Thu, 10 Oct 2024 14:16:40 -0700 Subject: [PATCH 06/14] nextflow documentation --- docs/index.md | 1 + docs/nextflow.md | 5 +++++ 2 files changed, 6 insertions(+) create mode 100644 docs/nextflow.md diff --git a/docs/index.md b/docs/index.md index f0c5700f..97a10325 100644 --- a/docs/index.md +++ b/docs/index.md @@ -17,6 +17,7 @@ maxdepth: 1 Getting Started File Formats Command Line Interface +Casanovo Nextflow Workflow FAQs Contributing Code of Conduct diff --git a/docs/nextflow.md b/docs/nextflow.md new file mode 100644 index 00000000..f8c44e27 --- /dev/null +++ b/docs/nextflow.md @@ -0,0 +1,5 @@ +# Casanovo Nextflow Workflow + +To simplify the process of setting up and running Casanovo, a dedicated Nextflow workflow is available. +This Casanovo Nextflow workflow automates the installation and execution of Casanovo and its dependencies. +For more details, refer to the [Casanovo Nextflow Workflow Documentation](https://nf-ms-dda-casanovo.readthedocs.io/en/latest/#). \ No newline at end of file From 179b03b8e30c24746d47f18037507c027b07b674 Mon Sep 17 00:00:00 2001 From: Lilferrit Date: Tue, 15 Oct 2024 16:22:27 -0700 Subject: [PATCH 07/14] nextflow nav prefix, more nextflow docs details --- docs/index.md | 2 +- docs/nextflow.md | 5 +++-- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/docs/index.md b/docs/index.md index 97a10325..130e7cfa 100644 --- a/docs/index.md +++ b/docs/index.md @@ -17,7 +17,7 @@ maxdepth: 1 Getting Started File Formats Command Line Interface -Casanovo Nextflow Workflow +Nextflow Workflow FAQs Contributing Code of Conduct diff --git a/docs/nextflow.md b/docs/nextflow.md index f8c44e27..2fc3837d 100644 --- a/docs/nextflow.md +++ b/docs/nextflow.md @@ -1,5 +1,6 @@ # Casanovo Nextflow Workflow -To simplify the process of setting up and running Casanovo, a dedicated Nextflow workflow is available. -This Casanovo Nextflow workflow automates the installation and execution of Casanovo and its dependencies. +To simplify the process of setting up and running Casanovo, a dedicated [Nextflow](https://www.nextflow.io/) workflow is available. +In addition to automatic the installation of Casanovo and its dependencies, the Casanovo Nextflow workflow provides an automated mass spectrometry data pipeline that converts input data files to a Casanovo compatible format using [msconvert](https://bioinformaticshome.com/tools/proteomics/descriptions/msconvert.html#gsc.tab=0), infers peptide sequences using Casanovo, and (optionally) uploads the results to [Limelight](https://limelight-ms.org/) - a platform for sharing and visualizing proteomics results. +The workflow may be ran on POSIX-compatible (UNIX) systems, Windows using WSL, or on a cloud platform such as AWS. For more details, refer to the [Casanovo Nextflow Workflow Documentation](https://nf-ms-dda-casanovo.readthedocs.io/en/latest/#). \ No newline at end of file From f6a7c74493c1699b5682a52207adac74529ac4a0 Mon Sep 17 00:00:00 2001 From: Lilferrit Date: Mon, 21 Oct 2024 15:59:28 -0700 Subject: [PATCH 08/14] grammatical fixes --- docs/nextflow.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/nextflow.md b/docs/nextflow.md index 2fc3837d..33de2f46 100644 --- a/docs/nextflow.md +++ b/docs/nextflow.md @@ -1,6 +1,6 @@ # Casanovo Nextflow Workflow -To simplify the process of setting up and running Casanovo, a dedicated [Nextflow](https://www.nextflow.io/) workflow is available. -In addition to automatic the installation of Casanovo and its dependencies, the Casanovo Nextflow workflow provides an automated mass spectrometry data pipeline that converts input data files to a Casanovo compatible format using [msconvert](https://bioinformaticshome.com/tools/proteomics/descriptions/msconvert.html#gsc.tab=0), infers peptide sequences using Casanovo, and (optionally) uploads the results to [Limelight](https://limelight-ms.org/) - a platform for sharing and visualizing proteomics results. -The workflow may be ran on POSIX-compatible (UNIX) systems, Windows using WSL, or on a cloud platform such as AWS. +To simplify the process of setting up and running Casanovo, a dedicated [Nextflow](https://proteowizard.sourceforge.io/tools/msconvert.html) workflow is available. +In addition to simplifying the installation of Casanovo and its dependencies, the Casanovo Nextflow workflow provides an automated mass spectrometry data pipeline that converts input data files to a Casanovo compatible format using [msconvert](https://bioinformaticshome.com/tools/proteomics/descriptions/msconvert.html#gsc.tab=0), infers peptide sequences using Casanovo, and (optionally) uploads the results to [Limelight](https://limelight-ms.org/) - a platform for sharing and visualizing proteomics results. +The workflow can be used on POSIX-compatible (UNIX) systems, Windows using WSL, or on a cloud platform such as AWS. For more details, refer to the [Casanovo Nextflow Workflow Documentation](https://nf-ms-dda-casanovo.readthedocs.io/en/latest/#). \ No newline at end of file From 5c77436b567c38810ad42c2cf9ea0311f44f31cf Mon Sep 17 00:00:00 2001 From: Lilferrit Date: Tue, 22 Oct 2024 17:00:53 -0700 Subject: [PATCH 09/14] grammatical fixes --- docs/nextflow.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/nextflow.md b/docs/nextflow.md index 33de2f46..7104ece3 100644 --- a/docs/nextflow.md +++ b/docs/nextflow.md @@ -1,6 +1,6 @@ # Casanovo Nextflow Workflow To simplify the process of setting up and running Casanovo, a dedicated [Nextflow](https://proteowizard.sourceforge.io/tools/msconvert.html) workflow is available. -In addition to simplifying the installation of Casanovo and its dependencies, the Casanovo Nextflow workflow provides an automated mass spectrometry data pipeline that converts input data files to a Casanovo compatible format using [msconvert](https://bioinformaticshome.com/tools/proteomics/descriptions/msconvert.html#gsc.tab=0), infers peptide sequences using Casanovo, and (optionally) uploads the results to [Limelight](https://limelight-ms.org/) - a platform for sharing and visualizing proteomics results. +In addition to simplifying the installation of Casanovo and its dependencies, the Casanovo Nextflow workflow provides an automated mass spectrometry data pipeline that converts input data files to a Casanovo-compatible format using [msconvert](https://bioinformaticshome.com/tools/proteomics/descriptions/msconvert.html#gsc.tab=0), infers peptide sequences using Casanovo, and (optionally) uploads the results to [Limelight](https://limelight-ms.org/) - a platform for sharing and visualizing proteomics results. The workflow can be used on POSIX-compatible (UNIX) systems, Windows using WSL, or on a cloud platform such as AWS. For more details, refer to the [Casanovo Nextflow Workflow Documentation](https://nf-ms-dda-casanovo.readthedocs.io/en/latest/#). \ No newline at end of file From 0634f90b3f61bd277d67b217c403e6d686af4586 Mon Sep 17 00:00:00 2001 From: Lilferrit Date: Tue, 29 Oct 2024 16:46:59 -0700 Subject: [PATCH 10/14] fixed links --- docs/nextflow.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/nextflow.md b/docs/nextflow.md index 7104ece3..b459e40e 100644 --- a/docs/nextflow.md +++ b/docs/nextflow.md @@ -1,6 +1,6 @@ # Casanovo Nextflow Workflow -To simplify the process of setting up and running Casanovo, a dedicated [Nextflow](https://proteowizard.sourceforge.io/tools/msconvert.html) workflow is available. -In addition to simplifying the installation of Casanovo and its dependencies, the Casanovo Nextflow workflow provides an automated mass spectrometry data pipeline that converts input data files to a Casanovo-compatible format using [msconvert](https://bioinformaticshome.com/tools/proteomics/descriptions/msconvert.html#gsc.tab=0), infers peptide sequences using Casanovo, and (optionally) uploads the results to [Limelight](https://limelight-ms.org/) - a platform for sharing and visualizing proteomics results. +To simplify the process of setting up and running Casanovo, a dedicated [Nextflow](https://www.nextflow.io/) workflow is available. +In addition to simplifying the installation of Casanovo and its dependencies, the Casanovo Nextflow workflow provides an automated mass spectrometry data pipeline that converts input data files to a Casanovo-compatible format using [msconvert](https://proteowizard.sourceforge.io/tools/msconvert.html), infers peptide sequences using Casanovo, and (optionally) uploads the results to [Limelight](https://limelight-ms.org/) - a platform for sharing and visualizing proteomics results. The workflow can be used on POSIX-compatible (UNIX) systems, Windows using WSL, or on a cloud platform such as AWS. For more details, refer to the [Casanovo Nextflow Workflow Documentation](https://nf-ms-dda-casanovo.readthedocs.io/en/latest/#). \ No newline at end of file From a0c332db354144758f58aa06f42b46b51ae86f07 Mon Sep 17 00:00:00 2001 From: Lilferrit Date: Tue, 8 Oct 2024 14:56:02 -0700 Subject: [PATCH 11/14] update Read the Docs with new functionality --- docs/faq.md | 15 +++++++++------ docs/getting_started.md | 6 +++--- 2 files changed, 12 insertions(+), 9 deletions(-) diff --git a/docs/faq.md b/docs/faq.md index 3462ae9d..c5264603 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -103,14 +103,17 @@ Training, validation, and test splits for the non-enzymatic dataset are availabl **How do I know which model to use after training Casanovo?** -By default, Casanovo saves a snapshot of the model weights after every 50,000 training steps. +By default, Casanovo evaluates the model validation performance every 50,000 training steps. Note that the number of samples that are processed during a single training step depends on the batch size. Therefore, the default training batch size of 32 corresponds to saving a model snapshot after every 1.6 million training samples. -You can optionally modify the snapshot (and validation) frequency in the [config file](https://github.com/Noble-Lab/casanovo/blob/main/casanovo/config.yaml) (parameter `val_check_interval`), depending on your dataset size. -Note that taking very frequent model snapshots will result in slower training time because Casanovo will evaluate its performance on the validation data for every snapshot. - -When saving a model snapshot, Casanovo will use the validation data to compute performance measures (training loss, validation loss, amino acid precision, and peptide precision) and print this information to the console and log file. -After your training job is finished, you can identify the model that achieves the maximum peptide and amino acid precision from the log file and use the corresponding model snapshot. +You can optionally modify the validation run frequency in the [config file](https://github.com/Noble-Lab/casanovo/blob/main/casanovo/config.yaml) (parameter `val_check_interval`), depending on your dataset size. +Note that running model validation very frequently will result in slower training time because Casanovo will evaluate its performance on the validation data for every validation check. + +When running model validation, Casanovo will use the validation data to compute performance measures (training loss, validation loss, amino acid precision, and peptide precision) and print this information to the console and log file. +At the end of each validation run and training epoch (one complete run over the training data), Casanovo will take a snapshot of the current model weights. +After the training job is finished, the validation snapshot that achieved the lowest **validation loss** will ba saved to the output directory as `best.ckpt` if no custom output prefix is specified. +Additionally, a snapshot of the model weights at the end of each **training** epoch will be saved to the output directory as `epoch=-step=.ckpt`. +Snapshots from previous training epochs will be overwritten with the latest training snapshot at the end of each training epoch. **Even though I added new post-translational modifications to the configuration file, Casanovo didn't identify those peptides.** diff --git a/docs/getting_started.md b/docs/getting_started.md index 73cbd5f0..8e98a51e 100644 --- a/docs/getting_started.md +++ b/docs/getting_started.md @@ -96,7 +96,7 @@ We recommend a Linux system with a dedicated GPU to achieve optimal runtime perf To sequence your own mass spectra with Casanovo, use the `casanovo sequence` command: ```sh -casanovo sequence -o results.mztab spectra.mgf +casanovo sequence spectra.mgf ``` ![`casanovo sequence --help`](images/sequence-help.svg) @@ -105,10 +105,10 @@ This will write peptide predictions for the given MS/MS spectra to the specified ### Evaluate *de novo* sequencing performance -To evaluate _de novo_ sequencing performance based on known mass spectrum annotations, use the `casanovo evaluate` command: +To evaluate _de novo_ sequencing performance based on known mass spectrum annotations, use the `casanovo sequence` command with the `--evaluate` option: ```sh -casanovo evaluate annotated_spectra.mgf +casanovo sequence annotated_spectra.mgf --evaluate ``` ![`casanovo evaluate --help`](images/evaluate-help.svg) From 93130f0b995454c625530f731e81f5c053ebfad6 Mon Sep 17 00:00:00 2001 From: Lilferrit Date: Tue, 8 Oct 2024 14:58:27 -0700 Subject: [PATCH 12/14] rephrasing --- docs/faq.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/faq.md b/docs/faq.md index c5264603..ccd5622a 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -103,7 +103,7 @@ Training, validation, and test splits for the non-enzymatic dataset are availabl **How do I know which model to use after training Casanovo?** -By default, Casanovo evaluates the model validation performance every 50,000 training steps. +By default, Casanovo runs model validation every 50,000 training steps. Note that the number of samples that are processed during a single training step depends on the batch size. Therefore, the default training batch size of 32 corresponds to saving a model snapshot after every 1.6 million training samples. You can optionally modify the validation run frequency in the [config file](https://github.com/Noble-Lab/casanovo/blob/main/casanovo/config.yaml) (parameter `val_check_interval`), depending on your dataset size. From f7e66db0cdf6cdb234326c03ce9db441148dbf9c Mon Sep 17 00:00:00 2001 From: Lilferrit Date: Thu, 10 Oct 2024 13:55:01 -0700 Subject: [PATCH 13/14] update file formats section --- docs/file_formats.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/file_formats.md b/docs/file_formats.md index b01e4c02..cc5ffcff 100644 --- a/docs/file_formats.md +++ b/docs/file_formats.md @@ -253,6 +253,6 @@ Similarly, in Casanovo evaluation mode only annotated MGF files are supported. During training, Casanovo will save **checkpoint files** at every `val_check_interval` steps, specified in the configuration. -Model checkpoints will be saved in the `model_save_folder_path` folder with filename format `epoch=EPOCH-step=STEP.ckpt`, with `EPOCH` the epoch and `STEP` the training step at which the checkpoint was taken, helping you track progress and select the best model based on validation performance. +Model checkpoints will be saved to the folder specified by the `--output_dir` command line option with filename format `epoch=EPOCH-step=STEP.ckpt`, with `EPOCH` the epoch and `STEP` the training step at which the checkpoint was taken, helping you track progress and select the best model based on validation performance. From 3042ef7543b69aacfc34547f0a342f006d3e3b0d Mon Sep 17 00:00:00 2001 From: Lilferrit Date: Tue, 29 Oct 2024 16:55:52 -0700 Subject: [PATCH 14/14] updated faq note; training faq section changes --- docs/faq.md | 12 ++++++------ docs/getting_started.md | 2 +- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/faq.md b/docs/faq.md index ccd5622a..614d0876 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -103,18 +103,18 @@ Training, validation, and test splits for the non-enzymatic dataset are availabl **How do I know which model to use after training Casanovo?** +When running model validation, Casanovo will use the validation data to compute performance measures (training loss, validation loss, amino acid precision, and peptide precision) and print this information to the console and log file. +At the end of each validation run and at the end of each training epoch (one complete run over the training data), Casanovo will take a snapshot of the current model weights. +After the training job is finished, the validation snapshot that achieved the lowest **validation loss** will be saved to the output directory as `.best.ckpt`. +Additionally, a snapshot of the model weights at the end of each **training** epoch will be saved to the output directory as `epoch=-step=.ckpt`. +Snapshots from previous training epochs will be overwritten with the latest training snapshot at the end of each training epoch. + By default, Casanovo runs model validation every 50,000 training steps. Note that the number of samples that are processed during a single training step depends on the batch size. Therefore, the default training batch size of 32 corresponds to saving a model snapshot after every 1.6 million training samples. You can optionally modify the validation run frequency in the [config file](https://github.com/Noble-Lab/casanovo/blob/main/casanovo/config.yaml) (parameter `val_check_interval`), depending on your dataset size. Note that running model validation very frequently will result in slower training time because Casanovo will evaluate its performance on the validation data for every validation check. -When running model validation, Casanovo will use the validation data to compute performance measures (training loss, validation loss, amino acid precision, and peptide precision) and print this information to the console and log file. -At the end of each validation run and training epoch (one complete run over the training data), Casanovo will take a snapshot of the current model weights. -After the training job is finished, the validation snapshot that achieved the lowest **validation loss** will ba saved to the output directory as `best.ckpt` if no custom output prefix is specified. -Additionally, a snapshot of the model weights at the end of each **training** epoch will be saved to the output directory as `epoch=-step=.ckpt`. -Snapshots from previous training epochs will be overwritten with the latest training snapshot at the end of each training epoch. - **Even though I added new post-translational modifications to the configuration file, Casanovo didn't identify those peptides.** Casanovo can only make predictions using post-translational modifications (PTMs) that were included when training the model. diff --git a/docs/getting_started.md b/docs/getting_started.md index 8e98a51e..58307175 100644 --- a/docs/getting_started.md +++ b/docs/getting_started.md @@ -144,7 +144,7 @@ casanovo sequence [PATH_TO]/sample_preprocessed_spectra.mgf ``` ```{note} -If you want to store the output mzTab file in a different location than the current working directory, specify an alternative output location using the `--output` parameter. +If you want to store the output mzTab file in a different location than the current working directory, specify an alternative output location using the `--output_dir` parameter. ``` This job should complete in < 1 minute.