You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Rules that contain multibyte UTF-8 characters do not behave as you would expect.
To Reproduce
Here is a sample file, utf8rules.yml:
rules:
- name: UTF-8 Test Rule
id: utf8.1
# regular single-byte characters work without surprise
pattern: |
(?x)
(
good\ day
)
examples:
- 'good day'
negative_examples:
- 'Good Day!'
# literal utf-8 multibyte characters also seem to work without surprise
- name: UTF-8 Test Rule
id: utf8.2
pattern: |
(?x)
(
Güten\ Tag
)
examples:
- 'Güten Tag'
negative_examples:
- 'güten Tag'
# When you use the case-insensitive (?i) flag, utf-8 multibyte characters DON'T
# work as you expect; presumably the single bytes of the `ü` are individually
# handled case-insensitively, which is the wrong thing
- name: UTF-8 Test Rule
id: utf8.3
pattern: |
(?x)(?i)
(
Güten\ Tag
)
examples:
- 'Güten Tag'
- 'güten tag'
negative_examples:
# one would like this to actually match, but the (?i) flag doesn't interact
# properly with multibyte characters in Nosey Parker
- 'GÜTEN TAG'
# You can explicitly specify different multibyte UTF-8 characters using regex
# alternation, so as to verbosely approximate case-insensitivity.
# But you have to use regex alternation, not character classes, to avoid a
# Vectorscan error about `Unicode not allowed here`.
- name: UTF-8 Test Rule
id: utf8.4
pattern: |
(?x)(?i)
(
G (?: ü | Ü ) ten\ Tag
)
examples:
- 'Güten Tag'
- 'güten tag'
- 'GÜTEN TAG'
rulesets:
- name: UTF-8 Tests
description: 'Tests for UTF-8 rule and input handling'
id: utf8
include_rule_ids:
- utf8.1
- utf8.2
- utf8.3
- utf8.4
Validate with noseyparker rules check --rules-path utf8rules.yml
Expected behavior
Multibyte UTF-8 sequences would work as expected in all pattern contexts (character classes, etc), without surprise.
Actual behavior
A bunch of workarounds are required.
Output of noseyparker --version
This applies to all versions of Nosey Parker
The text was updated successfully, but these errors were encountered:
Internally, Nosey Parker uses two regex engines to do its matching.
First, Vectorscan does simultaneous matching all the patterns of the enabled rules on the input. This runs VERY fast (something like 4GB/s per core), but only provides the ID of the pattern that matched and the end byte offset of the match. It also has "all matches" semantics, different from most other regex engines, and requires a pass to discard all but the longest of each match. This all happens in the Matcher::scan_blob function.
The Vectorscan C++ library is exposed via the vectorscan-rs crate.
Second, Rust's regex crate is run with the appropriate pattern on each of the Vectorscan matches to determine the start of the match and the content of the capture groups.
Anyway, both of these regex engines support UTF-8 patterns and inputs. It should be possible to enhance Nosey Parker so that multibyte UTF-8 characters appearing in rules behave without surprise. However, this will take some thought and implementation work.
Note that Vectorscan's UTF-8 support is limited to matching on well-formed UTF-8 inputs. This is NOT the case for the regex crate, whose bytes::Regex doesn't have such restrictions.
The possible implementation that seems like it would have the best quality is this:
Add a proper regex parser / frontend to Nosey Parker
Have the frontend compile away (?i) flags from the patterns, explicitly transforming that into character classes or regex alternation + byte sequences for multibyte UTF-8 characters
With this implementation:
The pattern strings given to either vectorscan-rs or regex for matching would then not contain any multibyte characters or (?i) flags
vectorscan would be able to do UTF-8 matching even on invalid UTF-8 inputs (which has undefined behavior using its built-in UTF-8 support)
There would be no surprises with multibyte character handling
Describe the bug
Rules that contain multibyte UTF-8 characters do not behave as you would expect.
To Reproduce
Here is a sample file,
utf8rules.yml
:Validate with
noseyparker rules check --rules-path utf8rules.yml
Expected behavior
Multibyte UTF-8 sequences would work as expected in all pattern contexts (character classes, etc), without surprise.
Actual behavior
A bunch of workarounds are required.
Output of
noseyparker --version
This applies to all versions of Nosey Parker
The text was updated successfully, but these errors were encountered: