Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subtraction of ascii_shift_value caused overflow. #463

Merged
merged 1 commit into from
Nov 16, 2024
Merged

Conversation

koniksedy
Copy link
Collaborator

@koniksedy koniksedy commented Nov 16, 2024

This PR improves the handling of case-insensitive matches in the re2parser. Originally, ascii_shift_value = 32 was subtracted from each symbol, which caused overflow and unexpected behavior. Now, ascii_shift_value is subtracted only from symbols between 'a' and 'z' (to convert them to uppercase).

Additionally, this PR resolves issue #456, where this bug was initially reported.

@koniksedy koniksedy requested review from jurajsic and Adda0 November 16, 2024 11:35
@koniksedy koniksedy linked an issue Nov 16, 2024 that may be closed by this pull request
Copy link
Collaborator

@Adda0 Adda0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I wondered about the handling of the case insensitive match, too. I will immediately merge it. Juraj can review retrospectively, if he wishes to.

@Adda0 Adda0 merged commit 2dc8437 into devel Nov 16, 2024
18 checks passed
@Adda0 Adda0 deleted the foldcase_bug branch November 16, 2024 12:10
@jurajsic
Copy link
Member

Hmm, looking at the regex from the issue, why is the case-insensitive matching even enabled? inst->foldcase() should be false and no transitions should be added even before this fix.

@jurajsic
Copy link
Member

jurajsic commented Nov 18, 2024

Can you also add to test cases something that contains only lowercase and a test that contains only uppercase characters?

@koniksedy
Copy link
Collaborator Author

Hmm, looking at the regex from the issue, why is the case-insensitive matching even enabled? inst->foldcase() should be false and no transitions should be added even before this fix.

The process by which RE2 decides whether to use foldcase mode or not is complex. Below is a detailed explanation:

1. Bitvector Computation for the Content Inside [ ... ]

bool CharClassBuilder::AddRange(Rune lo, Rune hi) {
if (hi < lo)
return false;
if (lo <= 'z' && hi >= 'A') {
// Overlaps some alpha, maybe not all.
// Update bitmaps telling which ASCII letters are in the set.
Rune lo1 = std::max<Rune>(lo, 'A');
Rune hi1 = std::min<Rune>(hi, 'Z');
if (lo1 <= hi1)
upper_ |= ((1 << (hi1 - lo1 + 1)) - 1) << (lo1 - 'A');
lo1 = std::max<Rune>(lo, 'a');
hi1 = std::min<Rune>(hi, 'z');
if (lo1 <= hi1)
lower_ |= ((1 << (hi1 - lo1 + 1)) - 1) << (lo1 - 'a');
}

RE2 calculates two bitvectors, upper_ and lower_, for all intervals specified within square brackets together (e.g., [A-Za-z]):

  • The upper_ bitvector represents the set of uppercase letters used within the square brackets.
  • The lower_ bitvector represents the set of lowercase letters.

2. Bitvector Equality Test

bool CharClassBuilder::FoldsASCII() {
return ((upper_ ^ lower_) & AlphaMask) == 0;
}

An XOR operation is performed on lower_ and upper_ to determine if they are equal.

  • If the bitvectors are equal, RE2 enables foldcase mode.
  • foldcase is activated only if the lowercase range represents exactly the same letters as the uppercase range.
    • [A-Ma-m] will activate foldcase.
    • [A-Ma-o] will not activate foldcase.

3. Disabling foldcase (Optimization)

// If this range contains all of A-Za-z or none of it,
// the fold flag is unnecessary; don't bother.
bool fold = foldascii;
if ((i->lo <= 'A' && 'z' <= i->hi) || i->hi < 'A' || 'z' < i->lo ||
('Z' < i->lo && i->hi < 'a'))
fold = false;

RE2 can disable foldcase mode if it determines that all letters are represented in a single interval.

  • This check applies to individual intervals within the square brackets.
  • For example:
    • [\\0x00-\\0x7F] will deactivate foldcase.
    • [\\0x00-\\0x5A\\0x5C-\\0x7F] will not deactivate foldcase because it consists of two intervals, and neither fulfills the condition from the referenced code.

Examples

Regex Range foldcase After Step 2 foldcase After Step 3
[A-Za-z] true false
[A-Ma-m] true true
[A-Ma-o] false false
[\\0x00-\\0x7F] true false
[\\0x00-\\0x5A\\0x5C-\\0x7F] true true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Possible overflow in regex parser
3 participants