Discussion about results #56

ypriverol · 2021-03-18T18:30:44Z

Hi @ALL: We have put one student analyzing the data to finish the project 🚀. We have made some major advances in the data analysis included in the following PR #55 in the results folder excel and figures. Here some details about what we want to do in terms of analysis:

Take a set of RAW files and perform peptide/protein identification using MSGF+. We have selected PXD004732
Cluster the mzMLs or the MGF from the Raw files and cluster them using MaRAClutser or PRIDE Cluster.
- Applied multiple consensus methods for each cluster (bin, average, best, most_similar)
- Perform the search of Step one but using the consensus spectra.
We then one to compare the peptide ids results between the different analysis methods (original spectra vs all the methods for consensus).

We have found some issues we need to solve before from now on related to clustering that probably @jgriss can comment on.

The number of clusters for PRIDE Cluster is lower than MaraCluster. Would be good that we start from similar numbers because we are not comparing clustering methods but the consensus methods. We have tried and discuss in this document the parameters we have used to run the clustering method results/supplement.docx
In the results file results/identification-results-v4-10.xlsx we have added all the results for each RAW files.
We have added the scoring distribution for MSGF+ q-value for each method (results/ PRIDE distribution of decoy(True) VS target(False)-test.png)

Some minor results shows:

Original spectra search is better in terms of peptide ids than the clustering methods.
Binning consensus is better than the other methods.

ypriverol · 2021-03-21T08:32:44Z

We have found the issue with PRIDE Cluster generation of clusters. We will update on this, this week.

jgriss · 2021-03-21T21:20:42Z

@ypriverol Thanks for the update! Was just starting to set up the tests!

Anything I can fix in the spectra-cluster code?

jgriss · 2021-03-21T21:25:24Z

@ypriverol The results are in-line with what's in the literature. If I'm not mistaken, we even had this in our first paper in 2012.

The theory back then was that a consensus spectrum always contains some noise and will therefore never be as good as the best measured spectrum in the dataset.

Question is whether we can use this to create an even better consensus algorithm

ypriverol · 2021-03-21T21:41:39Z

@jgriss we have found the right PRIDE Cluster threshold to produce the same amount of clusters between MaRA cluster and PRIDE cluster. I will update in the issue this week.

ypriverol · 2021-04-04T17:56:36Z

@jgriss @percolator :

We have the first results of the data analysis of consensus clustering. The idea of this research is to compare different methods of consensus clustering through peptide identifications.

We found out small differences between PRIDE Cluster and MARa cluster. I really would like to remove one of them @percolator because any reviewer will go in the direction of the clustering results rather than the consensus spectra generation. I don't mind to make the comparison with MaRA rather than PRIDE cluster. Bit I don't really want to make the comparisons about clustering algorithms. Original results can be seen here: https://github.com/ypriverol/specpride/blob/dev/results/discussion.adoc

What do you think? @percolator @jgriss

jgriss · 2021-04-06T08:41:54Z

Hi @ypriverol ,

I'm just looking through the clustering nextflow workflow: In the current version in the repo, there are no arguments passed to MaRaCluster.

The default precursor tolerance, for example, is 20 ppm for MaRaCluster while you set 10 ppm for spectra-cluster.

Also, I failed to find the step where you merge the clustering results with the MGF files. Which MaRaCluster threshold are you using?

Finally, both MaRaCluster and spectra-cluster have inbuilt consensus spectra algorithms. How about evaluating them as well?

Kind regards,
Johannes

ypriverol · 2021-04-06T13:49:18Z

After a brief discussion today with @percolator and @jgriss we decided the following things:

MaRa cluster results will be used in the main manuscript and the PRIDE cluster will be moved to Suppl. to put the focus on the consensus algorithms rather than clustering benchmark.
We need a major dataset ~1M of spectra to do a better benchmark, the current data is about 50k per file.
We need to update and document better the pipelines as suggested by @jgriss .

bittremieux · 2021-04-29T23:52:51Z

I have recently compared a few clustering tools, including spectra-cluster and MaRaCluster: https://www.biorxiv.org/content/10.1101/2021.02.05.429957v2

Both of these can generate a very high number of small clusters compared to other tools (Figure 2). This is an important aspect to keep in mind. For example, spectra-cluster and MaRaCluster might split large clusters corresponding to the same peptide into several small clusters more often than other clustering tools. As a result, searching the clustered data will take longer while the number of unique peptides that can be identified should be similar.

Rather than trying to get the same number of clusters out of each tool, I think it could be relevant to get a clustering result from different tools with a certain number of incorrectly clustered spectra, as in my evaluation. Next, as you already suggest, I thinmk it's a good idea to not approach by contrasting different tools to each other, but rather to evaluate which representative strategy works best for each tool.

I still think it's valuable to include multiple tools, not only MaRaCluster. A single tool might do something funky or have some special properties. By getting a trend for multiple tools we'll have less "overfitting" and get a more generalizable insight.

Different tools produce clusters with different characteristics that may be suited for alternative downstream applications. There likely won't be a single best answer. Instead, learning general trends of what works well and what doesn't will be valuable, without having to explicitly compare different tools to each other.

bittremieux · 2021-04-29T23:54:12Z

The analysis above was performed on the Kim draft human proteome, which consists of ~25M spectra. I already have clustering results from all these different tools on that dataset, so possibly we might be able to re-use those if that fits as a bigger dataset.

The clustering data is actually also available here: https://zenodo.org/record/4721496

jgriss · 2021-04-30T08:48:31Z

@bittremieux Thanks a lot for sharing! That's a very nice benchmark!

@ypriverol I very much like @bittremieux 's idea to use this results as a basis for the consensus spectrum test. The dataset is well known and sufficiently large.

Would it be possible to adapt the pipeline to use these results as well?

percolator · 2021-04-30T12:24:22Z

I agree that these are really nice plots, Wout! My worry is just that it might be hard to get through with a message if we discuss results from several clustering methods. I discussed this some time ago with Yasset, and we came to the conclusion that we might include multiple clustering tools, but that we should strive to keep such results in the supplement unless the results are very consistent. We used the Zolg et al, set in our benchmark mostly to know that we have an answer to which peptide is behind each cluster. This is not the case for the Kim et al. set. Is there a constructive way for what we would see as a correct result for Kim et al? I all honesty, the approach with the Zolg et al. has an inherent problem in that Yassets tests produce lower unique peptide identification rates with clustering than without clustering. That is likely not the case for a larger set, like the Kim et al. set. Yours --Lukas

…

On Fri, Apr 30, 2021 at 10:48 AM Johannes Griss ***@***.***> wrote: @bittremieux <https://github.com/bittremieux> Thanks a lot for sharing! That's a very nice benchmark! @ypriverol <https://github.com/ypriverol> I very much like @bittremieux <https://github.com/bittremieux> 's idea to use this results as a basis for the consensus spectrum test. The dataset is well known and sufficiently large. Would it be possible to adapt the pipeline to use these results as well? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#56 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAXKAD3A4L7WFNZ6Y2OLT3TLJVHDANCNFSM4ZNFC6SA> .

jgriss · 2021-04-30T12:33:37Z

Hi Lukas,

One aspect that should maybe be discussed is that based on Wout's results, both msCRUSH and falcon create large clusters than our tools. This could improve the consensus spectrum quality.

How about keeping the dataset but adding falcon and msCRUSH? If the results are very similar, everything can go into supplementary. But if they are not, they could point into the right direction.

Kind regards,
Johannes

bittremieux · 2021-04-30T15:29:17Z

[...] we came to the conclusion that we might include multiple clustering tools, but that we should strive to keep such results in the supplement unless the results are very consistent.

We seem to be pretty much in agreement. It's a valid concern that we don't want a competition between different clustering tools. However, it's my hope that by having results from multiple tools, a general trend will emerge. That would make for a stronger message than just having a single tool, with its particular idiosyncrasies and cluster characteristics. And a key message will probably be that there is no single "best" tool, but that different tools produce different results that might be more or less applicable to different use cases.

I'm not the biggest fan of using the ProteomeTools dataset, because it's not a realistically complex sample. Unfortunately, of course with the Kim dataset (and other biological datasets), there is no ground truth. It's maybe a bit inconvenient, but I think that using peptide identifications as a proxy for the ground truth should be fine. After all, that's what the field has been doing for decades.
There will indeed be some noise in the labels. When performing the above benchmark I inspected several clusters manually, and some "incorrectly" clustered spectra have highly unlikely dissimilar peptide assignments that seemed a scoring artifact rather than really different spectra. However, this is likely a similar problem for all clustering tools, and should presumably not favor one tool over another.

Let me see if I can find a bit of time this weekend to export representative spectra for the clustering results with the different tools that I already have. If it works, we might be able to relatively quickly check what the results look like and decide how to frame the story based on that.

In the figure above I only considered clusters with at least two spectra as valid clusters (otherwise you're always able to cluster 100% of the data of course). I guess for this analysis we also want to include singleton clusters, because the goal is to maximize the number of identifications?

bittremieux · 2021-05-02T21:54:59Z

@percolator Do you mind giving me direct commit rights to this repository? Thanks.

ypriverol · 2021-05-04T06:57:45Z

First of all, thanks @bittremieux @jgriss and @percolator for the discussion. Some thoughts here:

1- First, we will use multiple clustering algorithms as we originally agreed on @bittremieux @jgriss @percolator. However, within the main manuscript, we will discuss one clustering algorithm and all the other tools can be moved to the Suppl. The major reason is that we don't want to go toward clustering benchmarking, we have done it already, multiple papers have done it as well. But, I like the idea that consensus algorithms that work better in combination with clustering algorithms.

2- Currently, the student is testing a 1M MS/MS dataset besides the existing two datasets already tested. While I like the idea of testing the 25M spectra, in the current manuscript we always compare with the original id numbers (without clustering), do you have those numbers @bittremieux ?

3- The student is now running all the data again with the identification pipeline because we have some inconsistencies with the ids and MSGF+. I will update at the end of the week the results.

bittremieux · 2021-05-04T19:45:04Z

Yes, the identifications I'm using are from the MassIVE reanalysis of the draft human proteome dataset (RMSV000000091.3). That's also what I used for the comparisons in the figure above.

jgriss · 2021-05-04T20:05:08Z

Hi everyone,

I very much agree with @ypriverol and @percolator that we should be careful not to let the focus move on the clustering algorithms. Nevertheless, the size and purity of clusters may be an important factor in the quality of consensus spectra.

In order to strengthen the focus on the consensus algorithm, why don't we add a completely artificial dataset where the consensus spectrum is calculated based on a certain portion of spectra that were identified as the same peptide? This fraction could be changed in order to create "clusters" with different sizes. This could allow us to explicitly study the phenomenon of cluster size without using any clustering algorithm.

I don't suggest this should be the only comparison, but an additional one to what's already planned.

Kind regards,
Johannes

ypriverol self-assigned this Mar 18, 2021

ypriverol added enhancement New feature or request question Further information is requested labels Mar 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion about results #56

Discussion about results #56

ypriverol commented Mar 18, 2021 •

edited

Loading

ypriverol commented Mar 21, 2021

jgriss commented Mar 21, 2021

jgriss commented Mar 21, 2021

ypriverol commented Mar 21, 2021

ypriverol commented Apr 4, 2021

jgriss commented Apr 6, 2021

ypriverol commented Apr 6, 2021 •

edited

Loading

bittremieux commented Apr 29, 2021

bittremieux commented Apr 29, 2021 •

edited

Loading

jgriss commented Apr 30, 2021

percolator commented Apr 30, 2021 via email

jgriss commented Apr 30, 2021

bittremieux commented Apr 30, 2021

bittremieux commented May 2, 2021

ypriverol commented May 4, 2021

bittremieux commented May 4, 2021

jgriss commented May 4, 2021

Discussion about results #56

Discussion about results #56

Comments

ypriverol commented Mar 18, 2021 • edited Loading

ypriverol commented Mar 21, 2021

jgriss commented Mar 21, 2021

jgriss commented Mar 21, 2021

ypriverol commented Mar 21, 2021

ypriverol commented Apr 4, 2021

jgriss commented Apr 6, 2021

ypriverol commented Apr 6, 2021 • edited Loading

bittremieux commented Apr 29, 2021

bittremieux commented Apr 29, 2021 • edited Loading

jgriss commented Apr 30, 2021

percolator commented Apr 30, 2021 via email

jgriss commented Apr 30, 2021

bittremieux commented Apr 30, 2021

bittremieux commented May 2, 2021

ypriverol commented May 4, 2021

bittremieux commented May 4, 2021

jgriss commented May 4, 2021

ypriverol commented Mar 18, 2021 •

edited

Loading

ypriverol commented Apr 6, 2021 •

edited

Loading

bittremieux commented Apr 29, 2021 •

edited

Loading