You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
DeepScore: Community curated scoring, supercharged with AI
(Image taken from https://twitter.com/afterglow2046/status/1197271037009973251) The trained deep learning classifier predicts that this image of a dog is an image of Harrison Ford with a 99% probability. The probability that the image of the dog is NOT Harrison Ford is predicted to be 1%. If Harrison Ford is a target and NOT Harrison Ford a decoy, how would this example translate to proteomics?
Abstract
State of the art for identification in mass-spectrometry-based proteomics is to generate “non-sense”, decoy data and to determine a score cutoff based on a false discovery rate. Nowadays, machine (ML) and deep learning (DL) algorithms are used to learn how to optimally distinguish targets from decoys. While this drastically increases sensitivity, it comes at the cost of explainability, and human-chosen acceptance criteria are replaced with black-box models. This can hinder acceptance in clinical practice, e.g., in peptidomics, where upregulated proteins are investigated by investigating raw data peaks. While other DL-driven domains allow straightforward human validation of the models, e.g., imaging, speech, text, or inspecting a predicted structure from AlphaFold, for proteomics, this is much more challenging. Here, we aim to explore the limitations of the current scoring approach and provide potential solutions. We revisit the idea of confidence by trying to artificially increase identifications with non-sense features, hard decoys, or leaking data. Next, we will build an interactive tool to validate identifications manually and assign human confidence scores. With this, we create a training dataset and build an ML or DL- model to rescore identifications and assign predicted human-level confidence scores.
Project Plan
he hackathon will be organized similarly to a SCRUM process. First, we create a user story map to
agree on a prioritized list of things we would like to do. Next, we will assess the workload for each task with story points and, depending on the team size and skillset, make a sprint plan on what we can achieve in the given timeframe.
Some of the potential tasks could be:
Assessing Confidence
Hacks: Supplement random numbers to an ML-scoring system and investigate the performance
Hard decoys: Provide harder decoys and investigate the performance
Data Leakage: Gradually leak training data to a scoring system and investigate the performance
Interactive Tool
Frontend that shows raw data and accepts user input to assign confidence scores
Backend with database or functionality to merge multiple user sessions
Deep-learning Score
Extract raw data identifications
Train a model based on the human-supplied confidence scores
Perform rescoring on existing studies
Technical Details
The default language should be Python, but open to everything that gets the job done better.
The tools mentioned below are suggestions – there are a lot of great tools out there that I am not aware of, and we should collect and decide what to use at the beginning of the hackathon.
GitHub to host the code and manage the project board
Frontend: https://streamlit.io
I use Visual Studio Code with Anaconda and have a couple of DL/ML environments ready to go.
The programming language(s) that will be used.
(If applicable) Any existing software that will be featured.
(If applicable) Any datasets that will be used and their availability.
Hardware
• Laptop, with decent GPU is a plus (set up drivers etc. for compute)
• There is always Google Colab as a fallback
• For heavier workloads I have access to a high-performance cluster
• Alternatively we could rent on Amazon or related.
Datasets
There are a lot of datasets out there we could use, but this we probably narrow down once we have discussed the scope.
Feasibility
I have some preliminary data with human-curated data and some preliminary tools, so we could start with existing code or start from scratch. Most of the modules can be worked on in parallel.
Contact Information
Maximilian Strauss [email protected] or [email protected]
Mann Group
Novo Nordisk Foundation Center for Protein Research
University of Copenhagen
Feel also free to reply to this issue with questions or comments. Thanks!
The text was updated successfully, but these errors were encountered:
I am happy to inform you that your proposal has been selected for the DevMeeting2023! Participants will decide which hackathon to join after the pitch on Monday.
DeepScore: Community curated scoring, supercharged with AI
Machine- (ML) and, in particular, Deep-Learning (DL) can be successfully applied across the entire analysis pipeline in MS-based proteomics. We investigated potential pitfalls that can arise when they are applied incorrectly. Motivated by creating a dummy decoy classifier that seemingly boosts identifications yet is actually not able to learn, we systematically assessed how we could identify potential data leakage. We found that comparing decoy and false positive distributions is an indicative feature to detect leaked decoy information. Moreover, we found that using a Kolmogorov-Smirnov-type test allows sensitive quantification of this.
Our investigation also revealed that up to 5% of identified peptides with Oxidation of Methionine are reassigned when searched again without allowing this modification, calling into question the set false discovery rate of 1%.
Additionally, our manual inspection of 1618 timsTOF identifications from a DIA-NN search with a 5% FDR revealed that approximately 34% of the results would not be considered confidently identified by human inspection.
Lastly, we investigated the potential of clustering as an unsupervised method for distinguishing targets from decoys in scoring.
DeepScore: Community curated scoring, supercharged with AI
(Image taken from https://twitter.com/afterglow2046/status/1197271037009973251)
The trained deep learning classifier predicts that this image of a dog is an image of Harrison Ford with a 99% probability. The probability that the image of the dog is NOT Harrison Ford is predicted to be 1%. If Harrison Ford is a target and NOT Harrison Ford a decoy, how would this example translate to proteomics?
Abstract
State of the art for identification in mass-spectrometry-based proteomics is to generate “non-sense”, decoy data and to determine a score cutoff based on a false discovery rate. Nowadays, machine (ML) and deep learning (DL) algorithms are used to learn how to optimally distinguish targets from decoys. While this drastically increases sensitivity, it comes at the cost of explainability, and human-chosen acceptance criteria are replaced with black-box models. This can hinder acceptance in clinical practice, e.g., in peptidomics, where upregulated proteins are investigated by investigating raw data peaks. While other DL-driven domains allow straightforward human validation of the models, e.g., imaging, speech, text, or inspecting a predicted structure from AlphaFold, for proteomics, this is much more challenging. Here, we aim to explore the limitations of the current scoring approach and provide potential solutions. We revisit the idea of confidence by trying to artificially increase identifications with non-sense features, hard decoys, or leaking data. Next, we will build an interactive tool to validate identifications manually and assign human confidence scores. With this, we create a training dataset and build an ML or DL- model to rescore identifications and assign predicted human-level confidence scores.
Project Plan
he hackathon will be organized similarly to a SCRUM process. First, we create a user story map to
agree on a prioritized list of things we would like to do. Next, we will assess the workload for each task with story points and, depending on the team size and skillset, make a sprint plan on what we can achieve in the given timeframe.
Some of the potential tasks could be:
Assessing Confidence
Interactive Tool
Deep-learning Score
Technical Details
The default language should be Python, but open to everything that gets the job done better.
The tools mentioned below are suggestions – there are a lot of great tools out there that I am not aware of, and we should collect and decide what to use at the beginning of the hackathon.
hacked
PyTorch) or following the tutorials at https://www.proteomicsml.org
I use Visual Studio Code with Anaconda and have a couple of DL/ML environments ready to go.
Hardware
• Laptop, with decent GPU is a plus (set up drivers etc. for compute)
• There is always Google Colab as a fallback
• For heavier workloads I have access to a high-performance cluster
• Alternatively we could rent on Amazon or related.
Datasets
There are a lot of datasets out there we could use, but this we probably narrow down once we have discussed the scope.
Feasibility
I have some preliminary data with human-curated data and some preliminary tools, so we could start with existing code or start from scratch. Most of the modules can be worked on in parallel.
Contact Information
Maximilian Strauss [email protected] or [email protected]
Mann Group
Novo Nordisk Foundation Center for Protein Research
University of Copenhagen
Feel also free to reply to this issue with questions or comments. Thanks!
The text was updated successfully, but these errors were encountered: