Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion about results #56

Open
3 tasks done
ypriverol opened this issue Mar 18, 2021 · 17 comments
Open
3 tasks done

Discussion about results #56

ypriverol opened this issue Mar 18, 2021 · 17 comments
Assignees
Labels
enhancement New feature or request question Further information is requested

Comments

@ypriverol
Copy link
Collaborator

ypriverol commented Mar 18, 2021

Hi @ALL: We have put one student analyzing the data to finish the project 🚀. We have made some major advances in the data analysis included in the following PR #55 in the results folder excel and figures. Here some details about what we want to do in terms of analysis:

  • Take a set of RAW files and perform peptide/protein identification using MSGF+. We have selected PXD004732
  • Cluster the mzMLs or the MGF from the Raw files and cluster them using MaRAClutser or PRIDE Cluster.
    - Applied multiple consensus methods for each cluster (bin, average, best, most_similar)
    - Perform the search of Step one but using the consensus spectra.
  • We then one to compare the peptide ids results between the different analysis methods (original spectra vs all the methods for consensus).

We have found some issues we need to solve before from now on related to clustering that probably @jgriss can comment on.

  • The number of clusters for PRIDE Cluster is lower than MaraCluster. Would be good that we start from similar numbers because we are not comparing clustering methods but the consensus methods. We have tried and discuss in this document the parameters we have used to run the clustering method results/supplement.docx
  • In the results file results/identification-results-v4-10.xlsx we have added all the results for each RAW files.
  • We have added the scoring distribution for MSGF+ q-value for each method (results/ PRIDE distribution of decoy(True) VS target(False)-test.png)

Some minor results shows:

  • Original spectra search is better in terms of peptide ids than the clustering methods.
  • Binning consensus is better than the other methods.
@ypriverol ypriverol self-assigned this Mar 18, 2021
@ypriverol ypriverol added enhancement New feature or request question Further information is requested labels Mar 18, 2021
@ypriverol
Copy link
Collaborator Author

We have found the issue with PRIDE Cluster generation of clusters. We will update on this, this week.

@jgriss
Copy link
Collaborator

jgriss commented Mar 21, 2021

@ypriverol Thanks for the update! Was just starting to set up the tests!

Anything I can fix in the spectra-cluster code?

@jgriss
Copy link
Collaborator

jgriss commented Mar 21, 2021

@ypriverol The results are in-line with what's in the literature. If I'm not mistaken, we even had this in our first paper in 2012.

The theory back then was that a consensus spectrum always contains some noise and will therefore never be as good as the best measured spectrum in the dataset.

Question is whether we can use this to create an even better consensus algorithm

@ypriverol
Copy link
Collaborator Author

@jgriss we have found the right PRIDE Cluster threshold to produce the same amount of clusters between MaRA cluster and PRIDE cluster. I will update in the issue this week.

@ypriverol
Copy link
Collaborator Author

@jgriss @percolator :

We have the first results of the data analysis of consensus clustering. The idea of this research is to compare different methods of consensus clustering through peptide identifications.

We found out small differences between PRIDE Cluster and MARa cluster. I really would like to remove one of them @percolator because any reviewer will go in the direction of the clustering results rather than the consensus spectra generation. I don't mind to make the comparison with MaRA rather than PRIDE cluster. Bit I don't really want to make the comparisons about clustering algorithms. Original results can be seen here: https://github.com/ypriverol/specpride/blob/dev/results/discussion.adoc

What do you think? @percolator @jgriss

@jgriss
Copy link
Collaborator

jgriss commented Apr 6, 2021

Hi @ypriverol ,

I'm just looking through the clustering nextflow workflow: In the current version in the repo, there are no arguments passed to MaRaCluster.

The default precursor tolerance, for example, is 20 ppm for MaRaCluster while you set 10 ppm for spectra-cluster.

Also, I failed to find the step where you merge the clustering results with the MGF files. Which MaRaCluster threshold are you using?

Finally, both MaRaCluster and spectra-cluster have inbuilt consensus spectra algorithms. How about evaluating them as well?

Kind regards,
Johannes

@ypriverol
Copy link
Collaborator Author

ypriverol commented Apr 6, 2021

After a brief discussion today with @percolator and @jgriss we decided the following things:

  • MaRa cluster results will be used in the main manuscript and the PRIDE cluster will be moved to Suppl. to put the focus on the consensus algorithms rather than clustering benchmark.
  • We need a major dataset ~1M of spectra to do a better benchmark, the current data is about 50k per file.
  • We need to update and document better the pipelines as suggested by @jgriss .

@bittremieux
Copy link
Collaborator

I have recently compared a few clustering tools, including spectra-cluster and MaRaCluster: https://www.biorxiv.org/content/10.1101/2021.02.05.429957v2

image

Both of these can generate a very high number of small clusters compared to other tools (Figure 2). This is an important aspect to keep in mind. For example, spectra-cluster and MaRaCluster might split large clusters corresponding to the same peptide into several small clusters more often than other clustering tools. As a result, searching the clustered data will take longer while the number of unique peptides that can be identified should be similar.

Rather than trying to get the same number of clusters out of each tool, I think it could be relevant to get a clustering result from different tools with a certain number of incorrectly clustered spectra, as in my evaluation. Next, as you already suggest, I thinmk it's a good idea to not approach by contrasting different tools to each other, but rather to evaluate which representative strategy works best for each tool.

I still think it's valuable to include multiple tools, not only MaRaCluster. A single tool might do something funky or have some special properties. By getting a trend for multiple tools we'll have less "overfitting" and get a more generalizable insight.

Different tools produce clusters with different characteristics that may be suited for alternative downstream applications. There likely won't be a single best answer. Instead, learning general trends of what works well and what doesn't will be valuable, without having to explicitly compare different tools to each other.

@bittremieux
Copy link
Collaborator

bittremieux commented Apr 29, 2021

The analysis above was performed on the Kim draft human proteome, which consists of ~25M spectra. I already have clustering results from all these different tools on that dataset, so possibly we might be able to re-use those if that fits as a bigger dataset.

The clustering data is actually also available here: https://zenodo.org/record/4721496

@jgriss
Copy link
Collaborator

jgriss commented Apr 30, 2021

@bittremieux Thanks a lot for sharing! That's a very nice benchmark!

@ypriverol I very much like @bittremieux 's idea to use this results as a basis for the consensus spectrum test. The dataset is well known and sufficiently large.

Would it be possible to adapt the pipeline to use these results as well?

@percolator
Copy link
Collaborator

percolator commented Apr 30, 2021 via email

@jgriss
Copy link
Collaborator

jgriss commented Apr 30, 2021

Hi Lukas,

One aspect that should maybe be discussed is that based on Wout's results, both msCRUSH and falcon create large clusters than our tools. This could improve the consensus spectrum quality.

How about keeping the dataset but adding falcon and msCRUSH? If the results are very similar, everything can go into supplementary. But if they are not, they could point into the right direction.

Kind regards,
Johannes

@bittremieux
Copy link
Collaborator

[...] we came to the conclusion that we might include multiple clustering tools, but that we should strive to keep such results in the supplement unless the results are very consistent.

We seem to be pretty much in agreement. It's a valid concern that we don't want a competition between different clustering tools. However, it's my hope that by having results from multiple tools, a general trend will emerge. That would make for a stronger message than just having a single tool, with its particular idiosyncrasies and cluster characteristics. And a key message will probably be that there is no single "best" tool, but that different tools produce different results that might be more or less applicable to different use cases.

I'm not the biggest fan of using the ProteomeTools dataset, because it's not a realistically complex sample. Unfortunately, of course with the Kim dataset (and other biological datasets), there is no ground truth. It's maybe a bit inconvenient, but I think that using peptide identifications as a proxy for the ground truth should be fine. After all, that's what the field has been doing for decades.
There will indeed be some noise in the labels. When performing the above benchmark I inspected several clusters manually, and some "incorrectly" clustered spectra have highly unlikely dissimilar peptide assignments that seemed a scoring artifact rather than really different spectra. However, this is likely a similar problem for all clustering tools, and should presumably not favor one tool over another.

Let me see if I can find a bit of time this weekend to export representative spectra for the clustering results with the different tools that I already have. If it works, we might be able to relatively quickly check what the results look like and decide how to frame the story based on that.

In the figure above I only considered clusters with at least two spectra as valid clusters (otherwise you're always able to cluster 100% of the data of course). I guess for this analysis we also want to include singleton clusters, because the goal is to maximize the number of identifications?

@bittremieux
Copy link
Collaborator

@percolator Do you mind giving me direct commit rights to this repository? Thanks.

@ypriverol
Copy link
Collaborator Author

First of all, thanks @bittremieux @jgriss and @percolator for the discussion. Some thoughts here:

1- First, we will use multiple clustering algorithms as we originally agreed on @bittremieux @jgriss @percolator. However, within the main manuscript, we will discuss one clustering algorithm and all the other tools can be moved to the Suppl. The major reason is that we don't want to go toward clustering benchmarking, we have done it already, multiple papers have done it as well. But, I like the idea that consensus algorithms that work better in combination with clustering algorithms.

2- Currently, the student is testing a 1M MS/MS dataset besides the existing two datasets already tested. While I like the idea of testing the 25M spectra, in the current manuscript we always compare with the original id numbers (without clustering), do you have those numbers @bittremieux ?

3- The student is now running all the data again with the identification pipeline because we have some inconsistencies with the ids and MSGF+. I will update at the end of the week the results.

@bittremieux
Copy link
Collaborator

Yes, the identifications I'm using are from the MassIVE reanalysis of the draft human proteome dataset (RMSV000000091.3). That's also what I used for the comparisons in the figure above.

@jgriss
Copy link
Collaborator

jgriss commented May 4, 2021

Hi everyone,

I very much agree with @ypriverol and @percolator that we should be careful not to let the focus move on the clustering algorithms. Nevertheless, the size and purity of clusters may be an important factor in the quality of consensus spectra.

In order to strengthen the focus on the consensus algorithm, why don't we add a completely artificial dataset where the consensus spectrum is calculated based on a certain portion of spectra that were identified as the same peptide? This fraction could be changed in order to create "clusters" with different sizes. This could allow us to explicitly study the phenomenon of cluster size without using any clustering algorithm.

I don't suggest this should be the only comparison, but an additional one to what's already planned.

Kind regards,
Johannes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants