diff --git a/docs/index.rst b/docs/index.rst index 60c4a3f6..7f369928 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,5 +1,5 @@ -ProteoBench -=========== +Overview of ProteoBench ++++++++++++++++++++++++ .. toctree:: :maxdepth: 1 @@ -13,8 +13,44 @@ ProteoBench Contributing Changelog -Project Overview ------------------- + +ProteoBench is an open and collaborative platform for community-curated benchmarks for proteomics +data analysis pipelines. Our goal is to allow a continuous, easy, and controlled comparison of +proteomics data analysis workflows. + +ProteoBench provides a centralized web platform for developers and end-users to compare proteomics +data analysis pipelines. This community-curated effort will allow for an easy and controlled comparison +of tools developed or used by the participants to other state-of-the-art pipelines for specific +applications. The goal is not to select a single best one-fits-all data analysis workflow, but to allow:\\ + +1. **end-users** to identify a good workflow to fulfill their specific needs +2. **developers** to identify the specific strengths and weaknesses in their workflows, guiding the development process +3. **the field** to easily position a newly published workflow within the context of the existing state-of-the-art + +Description +=========== + +Participants (both end-users and developers) can download a set of input files (e.g., raw MS files, +search databases, or spectral libraries) tailored to specific benchmark metrics. They can then analyze +the data with any workflow, and upload the results in a homogenized format. A set of metrics will be +retrieved or calculated from these results and can be visualized alongside all the other metrics +calculated from the results of other participants (or of curated benchmark runs). + +Goals and non-goals +=================== + +ProteoBench: + +1. Allows for an easy and controlled comparison of existing data analysis workflows +2. Provides a frame of reference for newly developed workflows +3. Documents and implements benchmarks that each highlight strengths or weaknesses of data analysis workflow (steps) +4. Evolves continuously, according to the needs of the field +5. **DOES NOT** point to a single best one-fits-all data analysis workflow +6. **SHOULD NOT** be used as evidence for generalized statements about a workflow’s performance +7. **SHOULD NOT** be used by developers as single performance measure of their workflow + +Organization +============ The ProteoBench project is divided into two main parts: diff --git a/docs/modules/index.rst b/docs/modules/index.rst index 02d29481..c1be88f4 100644 --- a/docs/modules/index.rst +++ b/docs/modules/index.rst @@ -1,3 +1,10 @@ -####### -Modules -####### +############################# +ProteoBench benchmark modules +############################# + +.. toctree:: + :caption: ProteoBench benchmark modules + :glob: + :maxdepth: 1 + + * \ No newline at end of file diff --git a/docs/modules/module-life-cycle.md b/docs/modules/module-life-cycle.md new file mode 100644 index 00000000..91c4ece9 --- /dev/null +++ b/docs/modules/module-life-cycle.md @@ -0,0 +1,56 @@ +# Benchmark module life cycle + +## Proposal + +*Module proposals are not accepted yet. If you are interested, stay tuned.* + +Proposals can be started by opening a thread on GitHub Discussions, using a specific template. One of the ProteoBench maintainers will be assigned as editor. +At least two reviewers, independent from both existing ProteoBench contributors and from the proposal submitters, should be contacted to review the proposal. + +Required information for a proposal: + +1. A **description of the new module**: + - Which aspect of proteomics data analysis is benchmarked? + - Added value compared to already existing modules? +2. **Input data**: + - Provide a persistent identifier for the dataset (e.g., PXD, or DOI) (If this does not exist yet, publish the data on Zenodo and provide this DOI) + - Optionally, provide DOI to the dataset publication + - If only a subset of the referenced dataset is used, describe which subset. + - Describe why this dataset was selected. +3. **Workflow output data** (data to be uploaded to ProteoBench for metric calculation) +4. Specify the **minimal information needed from the workflow for metric calculation**. This can be an existing (standardized) file format, or a simple well-described CSV file. +5. **Structured metadata**: Which information is needed to sufficiently describe each benchmark run (e.g., workflow parameters). +6. **Metrics**: + - Description of the benchmark metrics + - Methodology for calculating benchmark metrics +7. How can the metric for each benchmark run be shown in a **single visualization** (optionally add a mock figure) +8. **External reviewers**: Optionally propose at least two reviewers (see above) +9. Will you be able to work on the implementation (coding) yourself, with additional help from the ProteoBench maintainers? + +## Implementation + +*Implementation may or may not be done by the people who made the proposal.* + +Once fully reviewed and accepted, the editor moves the Proposal from Discussions to Issues. Based on this new issue (which can be labeled as “new benchmark module”), describing the finalized Proposal, the module can be implemented and documented in the ProteoBench codebase. Finally, a pull request (PR) can be opened. + +After two positive code reviews by ProteoBench maintainers, the PR can be merged. The PR MUST meet the following requirements: +1. Proper documentation of the benchmarking module +2. Proper documentation of the code +3. All code should follow Black styling +4. The latest commit of the PR should pass the continuous integration tests + +## Beta + +When the PR is merged, the new module enters a beta stage, where its code base is part of the Python package, and it is present on the web platforms. However, a prominent banner states that the module is still in “Beta”. After a minimal period of one month and approval by the initial proposers and external reviewers, the beta label can be removed. + +## Live + +The benchmark module is accessible to the community without restriction. + +## Archived + +Benchmark modules that are still valid but superseded by a better alternative. We still display the module on the web platforms and in the stable code base, but do not accept new reference runs anymore. A banner is also displayed, stating the status. + +## Withdrawn + +Benchmark modules that in hindsight proved to be flawed in any way and should no longer be used in any context. Code is removed from the Python package, and the module and its results are removed from the web platforms. diff --git a/docs/user-guide/glossary.md b/docs/user-guide/glossary.md index e71c3364..6c709c46 100644 --- a/docs/user-guide/glossary.md +++ b/docs/user-guide/glossary.md @@ -1,2 +1,33 @@ # Glossary +We adopted the ontology proposed by [PSI-MS](https://github.com/HUPO-PSI/psi-ms-CV/blob/master/psi-ms.obo). +Here are the terms specific to ProteoBench: + +## Benchmark module +A benchmark module compares the performance of different data analysis workflows based on module-specific, predefined metrics. It provides a specific set of input files (e.g., mass spectrometry files and a sequence database) and it requires specific workflow output files (e.g., identified peptides). Based on these workflow output files, metrics are calculated as defined by the module and can be compared to previously validated benchmark runs. As each benchmark module defines specific workflow input files and metrics, it evaluates only a limited set of characteristics of the data analysis workflow. + +## Metric +A single number resulting from an aggregated calculation of the workflow output which allows for a comparison between different benchmark runs. + +## Workflow +A combination of data analysis tools with associated parameters that takes workflow input files (provided by a benchmark module) and generates workflow output files. Based on the workflow output files, metrics can be calculated describing the workflow performance. + +## Benchmark run +The result of running a workflow with specific parameter values and calculating the benchmark metrics based on the workflow output. + +## Workflow metadata +A set of parameter values (e.g., missed cleavages, mass tolerance), workflow properties (e.g., software name, software version), and workflow configuration files that include all information required to fully understand and re-execute a given workflow. This should include the workflow options, as well as a detailed description of the click sequence and/or potential supplemental parameters unique to the workflow. + +### Structured workflow metadata +A fixed set of metadata to be provided through a form with every benchmark run that is submitted for validation. + +### Unstructured workflow metadata +Additional metadata that is specific to a workflow and can therefore not be presented in a structured submission form and requires a free-form text field instead. The metadata does not need to be written as full text, but should be fully comprehensible. + +## Workflow configuration files +Files that contain parameters for a workflow or for a data analysis tool within a workflow. These files can be specific to the workflow or to the data analysis tool and help to re-execute it with the same parameters (e.g., mqpar.xml). + +## Validated benchmark run +A benchmark run accepted by the ProteoBench team to be made publicly available as part of the ProteoBench repository. For validation, the submission must include the workflow output files, structured metadata, unstructured metadata, and (if applicable) workflow configuration files. The workflow metadata must include all information needed to fully understand and re-execute the workflow; i.e., the benchmark run must be fully reproducible. + +