Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RE2Parser bug with regex begin line and end line markers #459

Merged
merged 2 commits into from
Nov 18, 2024

Conversation

Adda0
Copy link
Collaborator

@Adda0 Adda0 commented Nov 15, 2024

This PR attempts to fix parsing ^ and $ in regexes parsed by RE2. There are a lot of missing features that remain unimplemented and will have to be resolved in the future.

The PR is supposed to fix issues from #457, #437, and #450. Whether it actually fixes the issues, that remains to be seen.

@Adda0 Adda0 requested review from jurajsic and koniksedy November 15, 2024 08:51
Copy link
Collaborator

@koniksedy koniksedy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jurajsic
Copy link
Member

What does increment_current_state do? It just ignores the flag?

@Adda0
Copy link
Collaborator Author

Adda0 commented Nov 18, 2024

Can you clarify on what do you want to better explain? RE2 can increment its internal state multiple times while still operating on a single Mata NFA state. Therefore, we need to omit incrementing the current Mata NFA state when, for example, RE2 increments its state for ^ symbol (the beginning of the line). This bool flag just allows us to specify that when we get such a state, we should not increment the current Mata NFA state.

@jurajsic
Copy link
Member

If I understand correctly, it just ignores the flags ^, $, \b, etc., right?

@Adda0
Copy link
Collaborator Author

Adda0 commented Nov 18, 2024

As of right now, yes. But it is meant to be generally usable if something like this appears again. It is a mechanism to make the NFA state independent of RE2 state.

@jurajsic
Copy link
Member

I would have some discussion whether it is not better to throw an error for some of the flags, that we cannot handle them or something, but I am still ok with this.

@Adda0
Copy link
Collaborator Author

Adda0 commented Nov 18, 2024

Definitely. We have not tested \b etc. And even more, RE2 seems to throw states with state kind EndOfText or BeginOfText for $ and ^ even though the kind should be EndOfLine and BeginOfLine. I do not understand it, but as of now, these non-printable characters are skipped.

This will play a role when we open a discussion about regex interpretation, that is, a{2}b should match aab, but also aab inside fffaabfff. We have two matching approaches: the first is just an automaton matching a{2}b precisely, the other is .*a{2}b.*, which is what normal regex matchers do. We should have a flag (by default, set to the first approach), where the user can define which matching approach they want (what kind of NFA they get from the regex). Then, the ^ and $ will play a role. In the first approach, they are irrelevant.

@jurajsic
Copy link
Member

The EndOfLine vs EndOfText could be related with whether multi-line mode is enabled or not, by default I think it is disabled.

@Adda0
Copy link
Collaborator Author

Adda0 commented Nov 18, 2024

Good point. I believe this is correct. I will add the comment to the linked issue.

@Adda0 Adda0 merged commit e806b8d into devel Nov 18, 2024
18 checks passed
@Adda0 Adda0 deleted the re2parser-bug-begin-end-line-markers branch November 18, 2024 09:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants