Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rules with multibyte UTF-8 characters do not work right #246

Open
bradlarsen opened this issue Jan 16, 2025 · 2 comments
Open

Rules with multibyte UTF-8 characters do not work right #246

bradlarsen opened this issue Jan 16, 2025 · 2 comments
Labels
bug Something isn't working detection Related to rules or detection of sensitive information

Comments

@bradlarsen
Copy link
Collaborator

Describe the bug
Rules that contain multibyte UTF-8 characters do not behave as you would expect.

To Reproduce
Here is a sample file, utf8rules.yml:

rules:

- name: UTF-8 Test Rule
  id: utf8.1

  # regular single-byte characters work without surprise
  pattern: |
    (?x)
    (
      good\ day
    )

  examples:
  - 'good day'

  negative_examples:
  - 'Good Day!'


# literal utf-8 multibyte characters also seem to work without surprise
- name: UTF-8 Test Rule
  id: utf8.2

  pattern: |
    (?x)
    (
      Güten\ Tag
    )

  examples:
  - 'Güten Tag'

  negative_examples:
  - 'güten Tag'


# When you use the case-insensitive (?i) flag, utf-8 multibyte characters DON'T
# work as you expect; presumably the single bytes of the `ü` are individually
# handled case-insensitively, which is the wrong thing
- name: UTF-8 Test Rule
  id: utf8.3

  pattern: |
    (?x)(?i)
    (
      Güten\ Tag
    )

  examples:
  - 'Güten Tag'
  - 'güten tag'

  negative_examples:
  # one would like this to actually match, but the (?i) flag doesn't interact
  # properly with multibyte characters in Nosey Parker
  - 'GÜTEN TAG'


# You can explicitly specify different multibyte UTF-8 characters using regex
# alternation, so as to verbosely approximate case-insensitivity.
# But you have to use regex alternation, not character classes, to avoid a
# Vectorscan error about `Unicode not allowed here`.
- name: UTF-8 Test Rule
  id: utf8.4

  pattern: |
    (?x)(?i)
    (
      G (?: ü | Ü ) ten\ Tag
    )

  examples:
  - 'Güten Tag'
  - 'güten tag'
  - 'GÜTEN TAG'


rulesets:

- name: UTF-8 Tests
  description: 'Tests for UTF-8 rule and input handling'
  id: utf8

  include_rule_ids:
  - utf8.1
  - utf8.2
  - utf8.3
  - utf8.4

Validate with noseyparker rules check --rules-path utf8rules.yml

Expected behavior
Multibyte UTF-8 sequences would work as expected in all pattern contexts (character classes, etc), without surprise.

Actual behavior
A bunch of workarounds are required.

Output of noseyparker --version
This applies to all versions of Nosey Parker

@bradlarsen bradlarsen added bug Something isn't working detection Related to rules or detection of sensitive information labels Jan 16, 2025
@bradlarsen
Copy link
Collaborator Author

Internally, Nosey Parker uses two regex engines to do its matching.

First, Vectorscan does simultaneous matching all the patterns of the enabled rules on the input. This runs VERY fast (something like 4GB/s per core), but only provides the ID of the pattern that matched and the end byte offset of the match. It also has "all matches" semantics, different from most other regex engines, and requires a pass to discard all but the longest of each match. This all happens in the Matcher::scan_blob function.

The Vectorscan C++ library is exposed via the vectorscan-rs crate.

Second, Rust's regex crate is run with the appropriate pattern on each of the Vectorscan matches to determine the start of the match and the content of the capture groups.

Anyway, both of these regex engines support UTF-8 patterns and inputs. It should be possible to enhance Nosey Parker so that multibyte UTF-8 characters appearing in rules behave without surprise. However, this will take some thought and implementation work.

Note that Vectorscan's UTF-8 support is limited to matching on well-formed UTF-8 inputs. This is NOT the case for the regex crate, whose bytes::Regex doesn't have such restrictions.

@bradlarsen
Copy link
Collaborator Author

The possible implementation that seems like it would have the best quality is this:

  • Add a proper regex parser / frontend to Nosey Parker
  • Have the frontend compile away (?i) flags from the patterns, explicitly transforming that into character classes or regex alternation + byte sequences for multibyte UTF-8 characters

With this implementation:

  • The pattern strings given to either vectorscan-rs or regex for matching would then not contain any multibyte characters or (?i) flags
  • vectorscan would be able to do UTF-8 matching even on invalid UTF-8 inputs (which has undefined behavior using its built-in UTF-8 support)
  • There would be no surprises with multibyte character handling

It would be a bit of work though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working detection Related to rules or detection of sensitive information
Projects
None yet
Development

No branches or pull requests

1 participant