-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion about results #56
Comments
We have found the issue with PRIDE Cluster generation of clusters. We will update on this, this week. |
@ypriverol Thanks for the update! Was just starting to set up the tests! Anything I can fix in the spectra-cluster code? |
@ypriverol The results are in-line with what's in the literature. If I'm not mistaken, we even had this in our first paper in 2012. The theory back then was that a consensus spectrum always contains some noise and will therefore never be as good as the best measured spectrum in the dataset. Question is whether we can use this to create an even better consensus algorithm |
@jgriss we have found the right PRIDE Cluster threshold to produce the same amount of clusters between MaRA cluster and PRIDE cluster. I will update in the issue this week. |
We have the first results of the data analysis of consensus clustering. The idea of this research is to compare different methods of consensus clustering through peptide identifications. We found out small differences between PRIDE Cluster and MARa cluster. I really would like to remove one of them @percolator because any reviewer will go in the direction of the clustering results rather than the consensus spectra generation. I don't mind to make the comparison with MaRA rather than PRIDE cluster. Bit I don't really want to make the comparisons about clustering algorithms. Original results can be seen here: https://github.com/ypriverol/specpride/blob/dev/results/discussion.adoc What do you think? @percolator @jgriss |
Hi @ypriverol , I'm just looking through the clustering nextflow workflow: In the current version in the repo, there are no arguments passed to MaRaCluster. The default precursor tolerance, for example, is Also, I failed to find the step where you merge the clustering results with the MGF files. Which MaRaCluster threshold are you using? Finally, both MaRaCluster and spectra-cluster have inbuilt consensus spectra algorithms. How about evaluating them as well? Kind regards, |
After a brief discussion today with @percolator and @jgriss we decided the following things:
|
I have recently compared a few clustering tools, including spectra-cluster and MaRaCluster: https://www.biorxiv.org/content/10.1101/2021.02.05.429957v2 Both of these can generate a very high number of small clusters compared to other tools (Figure 2). This is an important aspect to keep in mind. For example, spectra-cluster and MaRaCluster might split large clusters corresponding to the same peptide into several small clusters more often than other clustering tools. As a result, searching the clustered data will take longer while the number of unique peptides that can be identified should be similar. Rather than trying to get the same number of clusters out of each tool, I think it could be relevant to get a clustering result from different tools with a certain number of incorrectly clustered spectra, as in my evaluation. Next, as you already suggest, I thinmk it's a good idea to not approach by contrasting different tools to each other, but rather to evaluate which representative strategy works best for each tool. I still think it's valuable to include multiple tools, not only MaRaCluster. A single tool might do something funky or have some special properties. By getting a trend for multiple tools we'll have less "overfitting" and get a more generalizable insight. Different tools produce clusters with different characteristics that may be suited for alternative downstream applications. There likely won't be a single best answer. Instead, learning general trends of what works well and what doesn't will be valuable, without having to explicitly compare different tools to each other. |
The analysis above was performed on the Kim draft human proteome, which consists of ~25M spectra. I already have clustering results from all these different tools on that dataset, so possibly we might be able to re-use those if that fits as a bigger dataset. The clustering data is actually also available here: https://zenodo.org/record/4721496 |
@bittremieux Thanks a lot for sharing! That's a very nice benchmark! @ypriverol I very much like @bittremieux 's idea to use this results as a basis for the consensus spectrum test. The dataset is well known and sufficiently large. Would it be possible to adapt the pipeline to use these results as well? |
I agree that these are really nice plots, Wout!
My worry is just that it might be hard to get through with a message if we
discuss results from several clustering methods. I discussed this some
time ago with Yasset, and we came to the conclusion that we might include
multiple clustering tools, but that we should strive to keep such results
in the supplement unless the results are very consistent.
We used the Zolg et al, set in our benchmark mostly to know that we have an
answer to which peptide is behind each cluster. This is not the case for
the Kim et al. set. Is there a constructive way for what we would see as a
correct result for Kim et al?
I all honesty, the approach with the Zolg et al. has an inherent problem in
that Yassets tests produce lower unique peptide identification rates with
clustering than without clustering. That is likely not the case for a
larger set, like the Kim et al. set.
Yours
--Lukas
…On Fri, Apr 30, 2021 at 10:48 AM Johannes Griss ***@***.***> wrote:
@bittremieux <https://github.com/bittremieux> Thanks a lot for sharing!
That's a very nice benchmark!
@ypriverol <https://github.com/ypriverol> I very much like @bittremieux
<https://github.com/bittremieux> 's idea to use this results as a basis
for the consensus spectrum test. The dataset is well known and sufficiently
large.
Would it be possible to adapt the pipeline to use these results as well?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#56 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAXKAD3A4L7WFNZ6Y2OLT3TLJVHDANCNFSM4ZNFC6SA>
.
|
Hi Lukas, One aspect that should maybe be discussed is that based on Wout's results, both msCRUSH and falcon create large clusters than our tools. This could improve the consensus spectrum quality. How about keeping the dataset but adding falcon and msCRUSH? If the results are very similar, everything can go into supplementary. But if they are not, they could point into the right direction. Kind regards, |
We seem to be pretty much in agreement. It's a valid concern that we don't want a competition between different clustering tools. However, it's my hope that by having results from multiple tools, a general trend will emerge. That would make for a stronger message than just having a single tool, with its particular idiosyncrasies and cluster characteristics. And a key message will probably be that there is no single "best" tool, but that different tools produce different results that might be more or less applicable to different use cases. I'm not the biggest fan of using the ProteomeTools dataset, because it's not a realistically complex sample. Unfortunately, of course with the Kim dataset (and other biological datasets), there is no ground truth. It's maybe a bit inconvenient, but I think that using peptide identifications as a proxy for the ground truth should be fine. After all, that's what the field has been doing for decades. Let me see if I can find a bit of time this weekend to export representative spectra for the clustering results with the different tools that I already have. If it works, we might be able to relatively quickly check what the results look like and decide how to frame the story based on that. In the figure above I only considered clusters with at least two spectra as valid clusters (otherwise you're always able to cluster 100% of the data of course). I guess for this analysis we also want to include singleton clusters, because the goal is to maximize the number of identifications? |
@percolator Do you mind giving me direct commit rights to this repository? Thanks. |
First of all, thanks @bittremieux @jgriss and @percolator for the discussion. Some thoughts here: 1- First, we will use multiple clustering algorithms as we originally agreed on @bittremieux @jgriss @percolator. However, within the main manuscript, we will discuss one clustering algorithm and all the other tools can be moved to the Suppl. The major reason is that we don't want to go toward clustering benchmarking, we have done it already, multiple papers have done it as well. But, I like the idea that consensus algorithms that work better in combination with clustering algorithms. 2- Currently, the student is testing a 1M MS/MS dataset besides the existing two datasets already tested. While I like the idea of testing the 25M spectra, in the current manuscript we always compare with the original id numbers (without clustering), do you have those numbers @bittremieux ? 3- The student is now running all the data again with the identification pipeline because we have some inconsistencies with the ids and MSGF+. I will update at the end of the week the results. |
Yes, the identifications I'm using are from the MassIVE reanalysis of the draft human proteome dataset (RMSV000000091.3). That's also what I used for the comparisons in the figure above. |
Hi everyone, I very much agree with @ypriverol and @percolator that we should be careful not to let the focus move on the clustering algorithms. Nevertheless, the size and purity of clusters may be an important factor in the quality of consensus spectra. In order to strengthen the focus on the consensus algorithm, why don't we add a completely artificial dataset where the consensus spectrum is calculated based on a certain portion of spectra that were identified as the same peptide? This fraction could be changed in order to create "clusters" with different sizes. This could allow us to explicitly study the phenomenon of cluster size without using any clustering algorithm. I don't suggest this should be the only comparison, but an additional one to what's already planned. Kind regards, |
Hi @ALL: We have put one student analyzing the data to finish the project 🚀. We have made some major advances in the data analysis included in the following PR #55 in the results folder excel and figures. Here some details about what we want to do in terms of analysis:
- Applied multiple consensus methods for each cluster (bin, average, best, most_similar)
- Perform the search of Step one but using the consensus spectra.
We have found some issues we need to solve before from now on related to clustering that probably @jgriss can comment on.
Some minor results shows:
The text was updated successfully, but these errors were encountered: