Benchmarking of search engines with the ground truth of spatial proteomics datasets #404

JuliaS92 · 2024-09-27T11:12:28Z

JuliaS92
Sep 27, 2024

aim of the new module

Benchmarks of search engines are crucial for selecting optimal tools in computational proteomics. Current benchmarks typically assess depth, coefficient of variation, and accuracy in mixed species experiments. While these mixed species benchmarks represent significant progress in software evaluation, they address only one of many use cases in proteomics and diverge from more common single-species experiments. Establishing a testable ground truth in real-life datasets remains challenging. However, spatial proteomics and SEC-MS experiments offer an inherent biochemical ground truth, as members of bona fide protein complexes exhibit near identical profiles. We propose leveraging this principle for software benchmarking, building upon previous work with dynamic organellar maps. By combining established measures with a carefully selected set of reference datasets, we aim to develop a comprehensive Proteobench module. This will provide an additional software benchmark that specifically addresses single-species performance and usability in profiling experiments, thereby enhancing the evaluation of proteomics software tools.

full description of the new module

What file would be available for download on the module website?
DDA DOMS: https://www.ebi.ac.uk/pride/archive/projects/PXD034962
DIA DOMS: https://www.ebi.ac.uk/pride/archive/projects/PXD034971
In theory several datasets would be suitable to benchmark different quantification methods with the same concept.
What would users need to do?
Process the data and upload protein group quantifications
What would be the type/format of files used as input for the module?
Protein groups files
What metrics would be calculated and how?
See https://www.nature.com/articles/s41467-023-41000-7
The library can be configured for different data sources and can generate all metrics. One big question would be how to compare robustly between runs, as protein complex coverage can differ. DOM-ABC always requires all files to make the benchmark comparable. For this, a reduced amount of data would need to be stored for every run. Deciding how to deal with this is probably the biggest bottle neck.
Propose a plot for metrics visualisation
Several are possible, I would suggest 3: profiled depth, complex scatter, reproducibility, as in Figure S1B https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-023-41000-7/MediaObjects/41467_2023_41000_MOESM1_ESM.pdf

potential reviewers

No response

Will you be able to work on the implementation (coding) yourself, with additional help from the ProteoBench maintainers?

yes
no

any other information

This will hopefully be addressed at the EuBIC Developer meeting 2025 - see the related hackathon proposal here: EuBIC/EuBIC2025#9

mlocardpaulet · 2024-10-03T09:34:25Z

mlocardpaulet
Oct 3, 2024
Maintainer

I would suggest to also provide a fasta file with the raw files so that everybody work with the same sequences (with the target sequences + contaminants).
Another comment: calculating the metrics at the protein group level may complicate the situation because depending on how users set up the peptide aggregation and protein group reporting, protein groups can be constituted of different accessions. So matching protein groups between the outputs of different software tools may be challenging. Why not working at percursor or peptide level ?

3 replies

JuliaS92 Oct 8, 2024
Author

The data at peptide level is most often too sparse for transforming the data into profiles. In my library I implemented an algorithm to align protein groups as best as possible (with the major restriction of ids only showing up once in the output). For markers any group containing the canonical id is considered, so it doesn't matter how many additional proteins are lumped in here. As simplification we could fix it to only use the canonical ids in the fasta, which I agree we should provide.

mlocardpaulet Oct 16, 2024
Maintainer

I see... And I guess that it is impossible to know if the peptide behaviour is due to biological effects (truncation, PTMs, etc...) or not.
When you say that that your algorithm does protein inference, you mean that users will provide precursor ions quantification and you will do the peptide-to-protein group aggregation? So that it is done the same way for all? Or maybe you want to discuss this in the dev. meet.
Your module idea is really cool :)

JuliaS92 Nov 15, 2024
Author

My algorithm actually doesn't do protein inference. I only match protein groups between search engines. For example if one protein is grouped in one output and unique in the other, or even just different isoforms getting listed, exact matching would underestimate the overlap. I deal with that by finding the best matching group. Figuring out what data to store and use will be one major issue to solve.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ProteoBench

Benchmarking of search engines with the ground truth of spatial proteomics datasets #404

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

ProteoBench

Benchmarking of search engines with the ground truth of spatial proteomics datasets #404

JuliaS92 Sep 27, 2024

aim of the new module

full description of the new module

potential reviewers

Will you be able to work on the implementation (coding) yourself, with additional help from the ProteoBench maintainers?

any other information

Replies: 1 comment · 3 replies

mlocardpaulet Oct 3, 2024 Maintainer

JuliaS92 Oct 8, 2024 Author

mlocardpaulet Oct 16, 2024 Maintainer

JuliaS92 Nov 15, 2024 Author

JuliaS92
Sep 27, 2024

Replies: 1 comment 3 replies

mlocardpaulet
Oct 3, 2024
Maintainer

JuliaS92 Oct 8, 2024
Author

mlocardpaulet Oct 16, 2024
Maintainer

JuliaS92 Nov 15, 2024
Author