-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subtraction of ascii_shift_value
caused overflow.
#463
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I wondered about the handling of the case insensitive match, too. I will immediately merge it. Juraj can review retrospectively, if he wishes to.
Hmm, looking at the regex from the issue, why is the case-insensitive matching even enabled? |
Can you also add to test cases something that contains only lowercase and a test that contains only uppercase characters? |
The process by which RE2 decides whether to use 1. Bitvector Computation for the Content Inside
|
bool CharClassBuilder::AddRange(Rune lo, Rune hi) { | |
if (hi < lo) | |
return false; | |
if (lo <= 'z' && hi >= 'A') { | |
// Overlaps some alpha, maybe not all. | |
// Update bitmaps telling which ASCII letters are in the set. | |
Rune lo1 = std::max<Rune>(lo, 'A'); | |
Rune hi1 = std::min<Rune>(hi, 'Z'); | |
if (lo1 <= hi1) | |
upper_ |= ((1 << (hi1 - lo1 + 1)) - 1) << (lo1 - 'A'); | |
lo1 = std::max<Rune>(lo, 'a'); | |
hi1 = std::min<Rune>(hi, 'z'); | |
if (lo1 <= hi1) | |
lower_ |= ((1 << (hi1 - lo1 + 1)) - 1) << (lo1 - 'a'); | |
} |
RE2 calculates two bitvectors,
upper_
and lower_
, for all intervals specified within square brackets together (e.g., [A-Za-z]
):
- The
upper_
bitvector represents the set of uppercase letters used within the square brackets. - The
lower_
bitvector represents the set of lowercase letters.
2. Bitvector Equality Test
mata/3rdparty/re2/re2/regexp.cc
Lines 642 to 644 in e806b8d
bool CharClassBuilder::FoldsASCII() { | |
return ((upper_ ^ lower_) & AlphaMask) == 0; | |
} |
An XOR operation is performed on
lower_
and upper_
to determine if they are equal.
- If the bitvectors are equal, RE2 enables
foldcase
mode. foldcase
is activated only if the lowercase range represents exactly the same letters as the uppercase range.[A-Ma-m]
will activatefoldcase
.[A-Ma-o]
will not activatefoldcase
.
3. Disabling foldcase
(Optimization)
mata/3rdparty/re2/re2/compile.cc
Lines 920 to 926 in e806b8d
// If this range contains all of A-Za-z or none of it, | |
// the fold flag is unnecessary; don't bother. | |
bool fold = foldascii; | |
if ((i->lo <= 'A' && 'z' <= i->hi) || i->hi < 'A' || 'z' < i->lo || | |
('Z' < i->lo && i->hi < 'a')) | |
fold = false; | |
RE2 can disable
foldcase
mode if it determines that all letters are represented in a single interval.
- This check applies to individual intervals within the square brackets.
- For example:
[\\0x00-\\0x7F]
will deactivatefoldcase
.[\\0x00-\\0x5A\\0x5C-\\0x7F]
will not deactivatefoldcase
because it consists of two intervals, and neither fulfills the condition from the referenced code.
Examples
Regex Range | foldcase After Step 2 |
foldcase After Step 3 |
---|---|---|
[A-Za-z] |
true | false |
[A-Ma-m] |
true | true |
[A-Ma-o] |
false | false |
[\\0x00-\\0x7F] |
true | false |
[\\0x00-\\0x5A\\0x5C-\\0x7F] |
true | true |
This PR improves the handling of case-insensitive matches in the
re2parser
. Originally,ascii_shift_value
= 32 was subtracted from each symbol, which caused overflow and unexpected behavior. Now,ascii_shift_value
is subtracted only from symbols between 'a' and 'z' (to convert them to uppercase).Additionally, this PR resolves issue #456, where this bug was initially reported.