Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepScore: Community curated scoring, supercharged with AI #14

Open
straussmaximilian opened this issue Oct 13, 2022 · 3 comments
Open
Labels

Comments

@straussmaximilian
Copy link

DeepScore: Community curated scoring, supercharged with AI

EJ2PqRjXsAIQrb6
(Image taken from https://twitter.com/afterglow2046/status/1197271037009973251)
The trained deep learning classifier predicts that this image of a dog is an image of Harrison Ford with a 99% probability. The probability that the image of the dog is NOT Harrison Ford is predicted to be 1%. If Harrison Ford is a target and NOT Harrison Ford a decoy, how would this example translate to proteomics?

Abstract

State of the art for identification in mass-spectrometry-based proteomics is to generate “non-sense”, decoy data and to determine a score cutoff based on a false discovery rate. Nowadays, machine (ML) and deep learning (DL) algorithms are used to learn how to optimally distinguish targets from decoys. While this drastically increases sensitivity, it comes at the cost of explainability, and human-chosen acceptance criteria are replaced with black-box models. This can hinder acceptance in clinical practice, e.g., in peptidomics, where upregulated proteins are investigated by investigating raw data peaks. While other DL-driven domains allow straightforward human validation of the models, e.g., imaging, speech, text, or inspecting a predicted structure from AlphaFold, for proteomics, this is much more challenging. Here, we aim to explore the limitations of the current scoring approach and provide potential solutions. We revisit the idea of confidence by trying to artificially increase identifications with non-sense features, hard decoys, or leaking data. Next, we will build an interactive tool to validate identifications manually and assign human confidence scores. With this, we create a training dataset and build an ML or DL- model to rescore identifications and assign predicted human-level confidence scores.

Project Plan

he hackathon will be organized similarly to a SCRUM process. First, we create a user story map to
agree on a prioritized list of things we would like to do. Next, we will assess the workload for each task with story points and, depending on the team size and skillset, make a sprint plan on what we can achieve in the given timeframe.
Some of the potential tasks could be:

Assessing Confidence

  • Hacks: Supplement random numbers to an ML-scoring system and investigate the performance
  • Hard decoys: Provide harder decoys and investigate the performance
  • Data Leakage: Gradually leak training data to a scoring system and investigate the performance

Interactive Tool

  • Frontend that shows raw data and accepts user input to assign confidence scores
  • Backend with database or functionality to merge multiple user sessions

Deep-learning Score

  • Extract raw data identifications
  • Train a model based on the human-supplied confidence scores
  • Perform rescoring on existing studies

Technical Details

The default language should be Python, but open to everything that gets the job done better.
The tools mentioned below are suggestions – there are a lot of great tools out there that I am not aware of, and we should collect and decide what to use at the beginning of the hackathon.

Hardware

• Laptop, with decent GPU is a plus (set up drivers etc. for compute)
• There is always Google Colab as a fallback
• For heavier workloads I have access to a high-performance cluster
• Alternatively we could rent on Amazon or related.

Datasets

There are a lot of datasets out there we could use, but this we probably narrow down once we have discussed the scope.

Feasibility

I have some preliminary data with human-curated data and some preliminary tools, so we could start with existing code or start from scratch. Most of the modules can be worked on in parallel.

Contact Information

Maximilian Strauss [email protected] or [email protected]
Mann Group
Novo Nordisk Foundation Center for Protein Research
University of Copenhagen

Feel also free to reply to this issue with questions or comments. Thanks!

@tobiasko tobiasko added enhancement New feature or request selected and removed enhancement New feature or request labels Nov 8, 2022
@tobiasko
Copy link
Contributor

tobiasko commented Nov 8, 2022

Dear @straussmaximilian ,

I am happy to inform you that your proposal has been selected for the DevMeeting2023! Participants will decide which hackathon to join after the pitch on Monday.

Best,
Tobi

@tobiasko
Copy link
Contributor

Hello everyone,

I just created a slack workspace for the DevMeeting and a channel named deepscore for this hack. You should receive an invite to join by email.

Best,
Tobi

@straussmaximilian
Copy link
Author

Summary:

DeepScore: Community curated scoring, supercharged with AI

Machine- (ML) and, in particular, Deep-Learning (DL) can be successfully applied across the entire analysis pipeline in MS-based proteomics. We investigated potential pitfalls that can arise when they are applied incorrectly. Motivated by creating a dummy decoy classifier that seemingly boosts identifications yet is actually not able to learn, we systematically assessed how we could identify potential data leakage. We found that comparing decoy and false positive distributions is an indicative feature to detect leaked decoy information. Moreover, we found that using a Kolmogorov-Smirnov-type test allows sensitive quantification of this.
Our investigation also revealed that up to 5% of identified peptides with Oxidation of Methionine are reassigned when searched again without allowing this modification, calling into question the set false discovery rate of 1%.
Additionally, our manual inspection of 1618 timsTOF identifications from a DIA-NN search with a 5% FDR revealed that approximately 34% of the results would not be considered confidently identified by human inspection.
Lastly, we investigated the potential of clustering as an unsupervised method for distinguishing targets from decoys in scoring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants