Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup of peptide parsing and annotation #61

Merged

Conversation

jspaezp
Copy link
Contributor

@jspaezp jspaezp commented Apr 2, 2024

This PR implements 4 main things, all with the purpose of improving speed of spectrum annotation workflows.

  1. A fast-pass for unmodified peptides during the parsing.
  2. The option for a simpler parsing grammar.
  3. LRU caching of the parser (read once per session, not once per parse of a proforma sequence)
  4. The option to annotate spectra passing a list of proteoforms directly (instead of a sequence)
    • This feature is critical for me, since I have a workflow that uses both the proteoforms directly and the annotated spectra. Therefore by itself makes my workflow 2x faster.

Benchmarks

Using some dummy peptide examples the speedup i see in the parsing is:

With mods

29.51it/s -> (baseline), greedy loading, no fastpass
137.54it/s -> + unmod fastpass, cached full parser (4x improve)
168.48it/s -> + simple parser (1.22x improve,~6x from baseline)

Without mods

34.18it/s -> (baseline) greedy loading, no fastpass
995089.92it/s -> + unmod fastpass, cached full parser (~ 30000x improve)
1081006.19it/s -> + simple parser (equivalent for practical purposes)

On a heavy annotation workflow I have these changes dropped the run time from 45 mins to 2.20 :P

LMK what you think!
Best

@jspaezp
Copy link
Contributor Author

jspaezp commented Apr 2, 2024

btw the tests that involve reading from USI are also breaking on master on my local system.

Copy link
Member

@bittremieux bittremieux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. The fast-pass for unmodified peptides is a great addition, as well as annotating proteoforms directly. I have a few comments related to the code that should be relatively easy fixes.

I'm a bit less convinced about the simplified grammar though. This can be discussed in a bit more detail, see the respective comments.

spectrum_utils/proforma.py Outdated Show resolved Hide resolved
spectrum_utils/proforma.py Outdated Show resolved Hide resolved
spectrum_utils/proforma.py Outdated Show resolved Hide resolved
spectrum_utils/proforma.py Outdated Show resolved Hide resolved
spectrum_utils/proforma.py Outdated Show resolved Hide resolved
spectrum_utils/spectrum.py Outdated Show resolved Hide resolved
spectrum_utils/spectrum.py Outdated Show resolved Hide resolved
spectrum_utils/spectrum.py Outdated Show resolved Hide resolved
spectrum_utils/proforma_simple.ebnf Outdated Show resolved Hide resolved
spectrum_utils/proforma_simple.ebnf Outdated Show resolved Hide resolved
@jspaezp
Copy link
Contributor Author

jspaezp commented Apr 16, 2024

@bittremieux added the suggestions, LMK what you think!

@bittremieux bittremieux merged commit 40151e2 into bittremieux-lab:main Apr 17, 2024
1 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants