Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track line and column data incrementally when parsing strings #71

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mmcqd
Copy link

@mmcqd mmcqd commented Feb 4, 2023

This PR replaces the integer index into a stream with a triple of integers representing offset, line and column information.

offset is what index originally was: how many individual elements of the stream have been consumed.

When parser input is a string line and column are updated incrementally with each character of the input string.

When parser input is not a string line and column are both set to -1 and ignored.

This adds a huge speedup on large inputs to parsers which make frequent use of the line_info parser. For reference, using the example JSON parser updated to track line/col data on each node, a 100k line JSON file goes from taking ~210 seconds to ~7 seconds on my machine. We've gone from quadratic time position tracking to linear time position tracking.

I added no new tests, since this PR does not add a new feature, only changes the implementation of an existing one. I'm happy to add more tests if needed though.

Let me know if there's anything I can do to improve this!

column: int


@dataclass(frozen=True)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made Result and Position frozen dataclasses, because it seems there's no reason to treat them mutably, and they ought to be hashable if they can be.

@@ -516,6 +513,17 @@ def fail(expected: str) -> Parser:
return Parser(lambda _, index: Result.failure(index, expected))


def make_index_update(consumed: str) -> Callable[[Position], Position]:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is curried to avoid recomputing the count and rfind methods every single time someone uses the string parser

@codecov-commenter
Copy link

Codecov Report

Base: 94.44% // Head: 94.43% // Decreases project coverage by -0.02% ⚠️

Coverage data is based on head (6f66f49) compared to base (da4593e).
Patch coverage: 100.00% of modified lines in pull request are covered.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files
@@            Coverage Diff             @@
##           master      #71      +/-   ##
==========================================
- Coverage   94.44%   94.43%   -0.02%     
==========================================
  Files           9        9              
  Lines        1027     1025       -2     
==========================================
- Hits          970      968       -2     
  Misses         57       57              
Impacted Files Coverage Δ
src/parsy/__init__.py 100.00% <100.00%> (ø)
tests/test_parsy.py 99.37% <100.00%> (-0.02%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@spookylukey
Copy link
Member

Thanks for the PR, this looks like it could be a very valuable contribution for some users.

I haven't had time to look at it in depth, but it looks like it is significantly backwards incompatible. The Result class is publicly documented, as are a number of the methods whose signatures have changed. For example, see https://parsy.readthedocs.io/en/latest/ref/parser_instances.html . At the very minimum the example on that page would need to work unchanged.

Would there be any way to get this PR to work without these backwards incompatibilities? For example, could you make your Position object subclass from int for backwards compat, or some other solution that checks the type of values and converts as necessary?

If not, that would probably be a stow stopper. A breaking change of this size would require a new major version at least, or a fork.

@mmcqd
Copy link
Author

mmcqd commented Feb 6, 2023

Ah, yeah I see the problem. I think it should be possible to keep this backwards compatible, let me see what I can do

@underyx
Copy link

underyx commented Feb 6, 2023

One idea for better backwards compat: Store the three-tuple on a different attribute, e.g. Result.index_position, and make Result.index a @property that returns an index based on the data in index_position.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants