RE2Parser bug with regex begin line and end line markers #459

Adda0 · 2024-11-15T08:51:29Z

This PR attempts to fix parsing ^ and $ in regexes parsed by RE2. There are a lot of missing features that remain unimplemented and will have to be resolved in the future.

The PR is supposed to fix issues from #457, #437, and #450. Whether it actually fixes the issues, that remains to be seen.

koniksedy

LGTM.

jurajsic · 2024-11-18T08:08:51Z

What does increment_current_state do? It just ignores the flag?

Adda0 · 2024-11-18T08:13:24Z

Can you clarify on what do you want to better explain? RE2 can increment its internal state multiple times while still operating on a single Mata NFA state. Therefore, we need to omit incrementing the current Mata NFA state when, for example, RE2 increments its state for ^ symbol (the beginning of the line). This bool flag just allows us to specify that when we get such a state, we should not increment the current Mata NFA state.

jurajsic · 2024-11-18T08:16:05Z

If I understand correctly, it just ignores the flags ^, $, \b, etc., right?

Adda0 · 2024-11-18T08:21:15Z

As of right now, yes. But it is meant to be generally usable if something like this appears again. It is a mechanism to make the NFA state independent of RE2 state.

jurajsic · 2024-11-18T08:26:50Z

I would have some discussion whether it is not better to throw an error for some of the flags, that we cannot handle them or something, but I am still ok with this.

Adda0 · 2024-11-18T08:35:25Z

Definitely. We have not tested \b etc. And even more, RE2 seems to throw states with state kind EndOfText or BeginOfText for $ and ^ even though the kind should be EndOfLine and BeginOfLine. I do not understand it, but as of now, these non-printable characters are skipped.

This will play a role when we open a discussion about regex interpretation, that is, a{2}b should match aab, but also aab inside fffaabfff. We have two matching approaches: the first is just an automaton matching a{2}b precisely, the other is .*a{2}b.*, which is what normal regex matchers do. We should have a flag (by default, set to the first approach), where the user can define which matching approach they want (what kind of NFA they get from the regex). Then, the ^ and $ will play a role. In the first approach, they are irrelevant.

jurajsic · 2024-11-18T09:03:26Z

The EndOfLine vs EndOfText could be related with whether multi-line mode is enabled or not, by default I think it is disabled.

Adda0 · 2024-11-18T09:09:24Z

Good point. I believe this is correct. I will add the comment to the linked issue.

Adda0 added 2 commits November 15, 2024 09:51

feat(re2): Add tests for ^ and $ regex symbols

0d12da5

fix(re2): Fix creating NFAs from regexes with ^ and $ regex symbols

459c6cd

Adda0 requested review from jurajsic and koniksedy November 15, 2024 08:51

koniksedy approved these changes Nov 15, 2024

View reviewed changes

jurajsic approved these changes Nov 18, 2024

View reviewed changes

Adda0 mentioned this pull request Nov 18, 2024

Construct NFAs for regex matching inside text #464

Open

Adda0 merged commit e806b8d into devel Nov 18, 2024
18 checks passed

Adda0 deleted the re2parser-bug-begin-end-line-markers branch November 18, 2024 09:14

This was referenced Nov 18, 2024

Unexpected behavior of regex parser with $ and | #457

Open

Combining ^ with | in regex leads to unexpected behavior #450

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RE2Parser bug with regex begin line and end line markers #459

RE2Parser bug with regex begin line and end line markers #459

Adda0 commented Nov 15, 2024

koniksedy left a comment

jurajsic commented Nov 18, 2024

Adda0 commented Nov 18, 2024

jurajsic commented Nov 18, 2024

Adda0 commented Nov 18, 2024

jurajsic commented Nov 18, 2024

Adda0 commented Nov 18, 2024

jurajsic commented Nov 18, 2024

Adda0 commented Nov 18, 2024 •

edited

Loading

RE2Parser bug with regex begin line and end line markers #459

RE2Parser bug with regex begin line and end line markers #459

Conversation

Adda0 commented Nov 15, 2024

koniksedy left a comment

Choose a reason for hiding this comment

jurajsic commented Nov 18, 2024

Adda0 commented Nov 18, 2024

jurajsic commented Nov 18, 2024

Adda0 commented Nov 18, 2024

jurajsic commented Nov 18, 2024

Adda0 commented Nov 18, 2024

jurajsic commented Nov 18, 2024

Adda0 commented Nov 18, 2024 • edited Loading

Adda0 commented Nov 18, 2024 •

edited

Loading