Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MS2 spectra without a precursor charge are ignored #26

Open
YasinEl opened this issue May 5, 2024 · 14 comments
Open

MS2 spectra without a precursor charge are ignored #26

YasinEl opened this issue May 5, 2024 · 14 comments
Assignees
Labels
bug Something isn't working

Comments

@YasinEl
Copy link

YasinEl commented May 5, 2024

Hello, I was trying to cluster mzXML files from https://massive.ucsd.edu/ProteoSAFe/dataset.jsp?task=88a7dfeeecb74131a6d6bfb7a9db0a46 in WSL:Ubuntu-22.04 but it does not seem to recognize any spectra. My parameters and output are below:

falcon BAX89_BA1_01_23240.mzXML falcon
2024-05-04 18:26:51,147 INFO [falcon/MainProcess] falcon.main : falcon version 0.1.3
2024-05-04 18:26:51,147 DEBUG [falcon/MainProcess] falcon.main : work_dir = None
2024-05-04 18:26:51,147 DEBUG [falcon/MainProcess] falcon.main : overwrite = False
2024-05-04 18:26:51,148 DEBUG [falcon/MainProcess] falcon.main : export_representatives = True
2024-05-04 18:26:51,148 DEBUG [falcon/MainProcess] falcon.main : usi_pxd = USI000000
2024-05-04 18:26:51,148 DEBUG [falcon/MainProcess] falcon.main : precursor_tol = 20.00 ppm
2024-05-04 18:26:51,148 DEBUG [falcon/MainProcess] falcon.main : rt_tol = None
2024-05-04 18:26:51,148 DEBUG [falcon/MainProcess] falcon.main : fragment_tol = 0.05
2024-05-04 18:26:51,149 DEBUG [falcon/MainProcess] falcon.main : eps = 0.100
2024-05-04 18:26:51,149 DEBUG [falcon/MainProcess] falcon.main : min_samples = 2
2024-05-04 18:26:51,149 DEBUG [falcon/MainProcess] falcon.main : mz_interval = 1
2024-05-04 18:26:51,149 DEBUG [falcon/MainProcess] falcon.main : hash_len = 800
2024-05-04 18:26:51,150 DEBUG [falcon/MainProcess] falcon.main : n_neighbors = 64
2024-05-04 18:26:51,150 DEBUG [falcon/MainProcess] falcon.main : n_neighbors_ann = 128
2024-05-04 18:26:51,150 DEBUG [falcon/MainProcess] falcon.main : batch_size = 65536
2024-05-04 18:26:51,150 DEBUG [falcon/MainProcess] falcon.main : n_probe = 32
2024-05-04 18:26:51,150 DEBUG [falcon/MainProcess] falcon.main : min_peaks = 5
2024-05-04 18:26:51,150 DEBUG [falcon/MainProcess] falcon.main : min_mz_range = 10.00
2024-05-04 18:26:51,150 DEBUG [falcon/MainProcess] falcon.main : min_mz = 40.00
2024-05-04 18:26:51,151 DEBUG [falcon/MainProcess] falcon.main : max_mz = 1500.00
2024-05-04 18:26:51,151 DEBUG [falcon/MainProcess] falcon.main : remove_precursor_tol = 1.50
2024-05-04 18:26:51,151 DEBUG [falcon/MainProcess] falcon.main : min_intensity = 0.01
2024-05-04 18:26:51,151 DEBUG [falcon/MainProcess] falcon.main : max_peaks_used = 50
2024-05-04 18:26:51,151 DEBUG [falcon/MainProcess] falcon.main : scaling = off
2024-05-04 18:26:51,156 INFO [falcon/MainProcess] falcon._prepare_spectra : Read spectra from 1 peak file(s)
2024-05-04 18:27:02,645 DEBUG [falcon/MainProcess] falcon._prepare_spectra : 0 spectra written to 0 buckets by precursor charge and precursor m/z
2024-05-04 18:27:02,655 ERROR [falcon/MainProcess] falcon.main : No valid spectra found for clustering

Tried to convert the mzXML to mzML and mgf via ProteoWizard 3.0.24124 but that did not solve the issue. I have confirmed that the files contain indeed MS2 spectra.

Thank you for the support!

@bittremieux bittremieux added the bug Something isn't working label May 6, 2024
@Janne98
Copy link
Collaborator

Janne98 commented May 6, 2024

Hi!
Have you checked if there's a charge specified in the files? The current version of falcon discards spectra with a missing charge value.

@YasinEl
Copy link
Author

YasinEl commented May 6, 2024

Hey, yes seems like there is no charge reported. Unfortunately, that's common in public metabolomics data. Are you using the charge information in some way or you using this basically as a noise filter?
Thanks!

@bittremieux
Copy link
Owner

We're mainly using the charge to split the spectra into charge-disjoint groups, to avoid that spectra with different charge states are clustered together. This is more relevant for proteomics data of course, where you'll encounter more different charges.

We can look into how we can generalize the code a bit so that this information is no longer mandatory to be present.

@bittremieux bittremieux changed the title No MS2 spectra read from mzXML MS2 spectra without a precursor charge are ignored May 7, 2024
@YasinEl
Copy link
Author

YasinEl commented Sep 28, 2024

Hey, thank you for working on this! I tried out the branch where you implemented it and it seems to work now. However, it appears you won't allow clusters to be formed between spectra with and without charge.

Just putting that here in case you have not thought of it. Thank you!

@bittremieux
Copy link
Owner

Good point. Considering that we're clustering spectra by charge but in this case that information is unknown, it's not immediately obvious how this should best be handled.

  1. First cluster within each charge separately, also the unknown charge. Then as post-processing, for all clusters with an unknown charge, merge them with clusters with a known charge based on some criteria (e.g. if the distance between the cluster centroids is small enough). This is slightly dissatisfying because the cluster merging would be done using a different comparison than the initial clustering.
  2. First luster all spectra in one go. Then as post-processing, for each cluster that has members with different charges, split them into subclusters. This is somewhat dissatisfying because when splitting clusters you can assign spectra with known charges, but those without charges would need to be assigned to the subclusters based on some specific criteria.

@Janne98 what do you think? Or maybe you have a better solution?

Splitting the clusters by charge is pretty essential though, as Ming has demonstrated in the MS-RT paper that cluster quality is significantly degraded when you don't do that.

@YasinEl
Copy link
Author

YasinEl commented Sep 29, 2024

Thanks! Interesting that the charge state splitting was the main reason for falcons better performance. Did not completely grasp that from the MS-RT manuscript.

If charge states are that essential what do you think about deriving them from the precursor MS1 scan if possible? I am actually surprised you always have this information for most Vendors already reported in the mzML files.

I am not sure about the proportion of (public) metabolomics data where you have this information available, but I think I have stumbled across such data randomly twice at this point, so I imagine it's not that rare. Hence it would probably be an issue for larger-scale public data reanalysis efforts.

@bittremieux
Copy link
Owner

Thanks! Interesting that the charge state splitting was the main reason for falcons better performance. Did not completely grasp that from the MS-RT manuscript.

Ah, I wouldn't say that this is the main reason, but rather one of the contributing factors. We know it makes some difference, so we don't want to make falcon worse by ignoring the relevant charge information.

We know that the actual clustering approach in falcon is also significantly superior from what MS-Cluster does (and further upgrading that in recent developments), so there are multiple things at play to explain the performance difference.

If charge states are that essential what do you think about deriving them from the precursor MS1 scan if possible? I am actually surprised you always have this information for most Vendors already reported in the mzML files.

Afaik charge state is missing when vendor/on-board instrument software can't determine it. So in that case, how likely is it that it can be derived by post-processing? I actually don't know, I haven't really looked at that yet.

We have some plans that would require going back to the MS1 data (#19), so such a task could be tackled at the same time. I'm not sure we'll manage to do that for the next release though.

I am not sure about the proportion of (public) metabolomics data where you have this information available, but I think I have stumbled across such data randomly twice at this point, so I imagine it's not that rare. Hence it would probably be an issue for larger-scale public data reanalysis efforts.

Indeed. It should majority be older data though, right? Afaik the charge is missing when no isotopic envelope can be detected, which is (much) more often the case for older instruments with a lower resolution.

@YasinEl
Copy link
Author

YasinEl commented Sep 30, 2024

Interesting.

Typically like in the files I shared in the beginning of the issue and also files such as this do not appear to charge information for any of the MS2 scans. So this must be due to what is saved into the file rather than what can be derived from isotope situation. Also can not be due to resolution since we have mostly charge 1 in metabolomics which requires a bit more than unit resolution,..

@bittremieux
Copy link
Owner

Yeah, fair enough. I think the vendor also matters, Thermo typically has charge but Waters/Bruker less often.

One good thing though: if none of the spectra have a charge, there won't be any charge splitting and all spectra will be clustered together. So maybe it's less of an issue than feared?

@YasinEl
Copy link
Author

YasinEl commented Sep 30, 2024

I run into the issue because I like clustering spectra of files from different studies because I can align things despite different retention times (with admittedly high tolerances when clustering across vendors).

@bittremieux
Copy link
Owner

Yes, that's a valid use case that we definitely want to support. Looking at the last example, the MS1 actually has a nice isotopic pattern. So it should be possible to at least for some spectra correct the charge information.

This might be functionality for version x+1 though. In the mean time, maybe the precursorRefine filter when using msConvert to convert to mzML could already help alleviate the issue in many cases? I don't know how that's implemented, but I'd assume our code would have to do something very similar as is already implemented in there.

@YasinEl
Copy link
Author

YasinEl commented Sep 30, 2024

Ah, that would be a good intermediate solution. Unfortunately, precursorRefine only works for Orbitrap/FT data though.

@bittremieux
Copy link
Owner

The documentation here says that TOF data should also work.

In any case, this is an interesting feature that we want to support, but it might not be for the next release.

@YasinEl
Copy link
Author

YasinEl commented Oct 1, 2024

Oh, it seems I have an outdated version.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants