Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Case-insensitive literal matches throw off error position #361

Closed
emk opened this issue Oct 6, 2023 · 6 comments
Closed

Case-insensitive literal matches throw off error position #361

emk opened this issue Oct 6, 2023 · 6 comments

Comments

@emk
Copy link
Contributor

emk commented Oct 6, 2023

First, I want to thank you for my absolute favorite Rust parser library. I've written so many little peg parsers for all kinds of things in so many open source tools. peg is an amazing tool.

I'm trying to write a parser that records token locations and whitespace, and I've run into a weird problem that throws off error positions. I'm trying to use the basic trick in #216, and it winds up looking like this:

        /// Keywords. These use case-insensitive matching, and may not be
        /// followed by a valid identifier character.
        rule k(kw: &'static str) -> Token
            = s:position!() input:$([_]*<{kw.len()}>) ws:$(_) e:position!() {?
                if input.eq_ignore_ascii_case(kw) {
                    Ok(Token {
                        span: s..e,
                        ws_offset: input.len(),
                        text: format!("{}{}", input, ws),
                    })
                } else {
                    Err(kw)
                }
            }

However, $([_]*<{kw.len()}>) will match all the way to end of kw.len(). In a language that has some long keywords, this means that errors may be reported far to the right of where they actually occur.

This version is better:

        rule k(kw: &'static str) -> Token
            = quiet! { s:position!() input:$([_]*<{kw.len()}>) ws:$(_) e:position!() {?
                if input.eq_ignore_ascii_case(kw) {
                    Ok(Token {
                        span: s..e,
                        ws_offset: input.len(),
                        text: format!("{}{}", input, ws),
                    })
                } else {
                    Err(kw)
                }
            } }
            / expected!("keyword")

...but it can only ever report expected keyword, not expected SELECT, DROP, or CREATE.

I tried replacing / expected!("keyword") with / {? Err(keyword) }. Unfortunately, this triggers an infinite loop in the grammar, because peg can't see the guaranteed failure and thinks it's an empty rule.

Then I started to try weird things, like positive lookahead:

/// A function that compares two strings for equality.
type StrCmp = dyn Fn(&str, &str) -> bool;

        /// Tricky zero-length rule for asserting the next token without
        /// advancing the parse location.
        rule want(s: &'static str, cmp: &StrCmp)
            = &(found:$([_]*<{s.len()}>) {?
                if cmp(found, s) {
                    Ok(())
                } else {
                    Err(s)
                }
            })

        rule k(kw: &'static str) -> Token
            = want(kw, &str::eq_ignore_ascii_case)
              s:position!() input:$([_]*<{kw.len()}>) ws:$(_) e:position!()
            {
                if !KEYWORDS.contains(kw) {
                    panic!("BUG: {:?} is not in KEYWORDS", kw);
                }
                Token {
                    span: s..e,
                    ws_offset: input.len(),
                    text: format!("{}{}", input, ws),
                }
            }

This does other weird things, like fail with expected <unreported>, because apparently positive look-ahead is silent on failure?

Anyway, I'll try a few more things and see how it goes. Maybe want(..) should match twice, once as positive lookahead and a second time for real?

@emk
Copy link
Contributor Author

emk commented Oct 6, 2023

Hmm, there's no obvious way to fix want without either silently losing errors from k, or matching too far ahead.

Interestingly, I strongly suspect that support for / expected!(kw) would be enough to fix this pretty cleanly. Then I could use quiet! { ... } / expected!(...), which seems to provide the most accurate positions for errors out of all the alternatives.

Thank you for any thoughts or suggestions! And thank you for a great parser.

emk added a commit to faradayio/joinery that referenced this issue Oct 6, 2023
See kevinmehall/rust-peg#361.

I have fixed the positions but lost the reporting of actual
tokens/keywords.
@kevinmehall
Copy link
Owner

expected!(non_literal_expr) seems like a good idea and shouldn't be hard to support.

If you want something that works on the current version, you could trick the infinite loop detection with {? Err(...) } "no_match" or even {? Err(...) } "" (literals are always considered non-nullable without checking if they're empty). The loop detection is supposed to conservatively avoid rejecting non-looping code, so {? ..} should probably be relaxed -- right now it's not handled separately from unconditional sequences.

This would also be a good use case for #284 once added.

@emk
Copy link
Contributor Author

emk commented Oct 7, 2023

Thank you for the suggestions!

The no_match trick seems to fail with an error, though:

error: expected one of "#", "/", ";", "crate", "pub", "rule", "use", "}"
    --> src/ast.rs:2510:27
     |
2510 |             / {? Err(s) } "no_match"
     |                           ^^^^^^^^^^

@kevinmehall
Copy link
Owner

Oops, try (({? Err(s) }) "no_match"). Normally there wouldn't be any reason to put another expression after a block, but what this is doing is giving the loop checker something in sequence that would consume input. It's after the expression has already failed, so doesn't affect what it matches.

@kevinmehall
Copy link
Owner

a85e71b allows expected!() to take an expression evaluating to &'static str rather than just a literal. Released in 0.8.2.

@emk
Copy link
Contributor Author

emk commented Oct 11, 2023

Thank you, this is excellent! This will allow me to give much better errors in certain parsers. As always, peg is fantastic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants