Case-insensitive literal matches throw off error position #361

emk · 2023-10-06T10:48:53Z

First, I want to thank you for my absolute favorite Rust parser library. I've written so many little peg parsers for all kinds of things in so many open source tools. peg is an amazing tool.

I'm trying to write a parser that records token locations and whitespace, and I've run into a weird problem that throws off error positions. I'm trying to use the basic trick in #216, and it winds up looking like this:

        /// Keywords. These use case-insensitive matching, and may not be
        /// followed by a valid identifier character.
        rule k(kw: &'static str) -> Token
            = s:position!() input:$([_]*<{kw.len()}>) ws:$(_) e:position!() {?
                if input.eq_ignore_ascii_case(kw) {
                    Ok(Token {
                        span: s..e,
                        ws_offset: input.len(),
                        text: format!("{}{}", input, ws),
                    })
                } else {
                    Err(kw)
                }
            }

However, $([_]*<{kw.len()}>) will match all the way to end of kw.len(). In a language that has some long keywords, this means that errors may be reported far to the right of where they actually occur.

This version is better:

        rule k(kw: &'static str) -> Token
            = quiet! { s:position!() input:$([_]*<{kw.len()}>) ws:$(_) e:position!() {?
                if input.eq_ignore_ascii_case(kw) {
                    Ok(Token {
                        span: s..e,
                        ws_offset: input.len(),
                        text: format!("{}{}", input, ws),
                    })
                } else {
                    Err(kw)
                }
            } }
            / expected!("keyword")

...but it can only ever report expected keyword, not expected SELECT, DROP, or CREATE.

I tried replacing / expected!("keyword") with / {? Err(keyword) }. Unfortunately, this triggers an infinite loop in the grammar, because peg can't see the guaranteed failure and thinks it's an empty rule.

Then I started to try weird things, like positive lookahead:

/// A function that compares two strings for equality.
type StrCmp = dyn Fn(&str, &str) -> bool;

        /// Tricky zero-length rule for asserting the next token without
        /// advancing the parse location.
        rule want(s: &'static str, cmp: &StrCmp)
            = &(found:$([_]*<{s.len()}>) {?
                if cmp(found, s) {
                    Ok(())
                } else {
                    Err(s)
                }
            })

        rule k(kw: &'static str) -> Token
            = want(kw, &str::eq_ignore_ascii_case)
              s:position!() input:$([_]*<{kw.len()}>) ws:$(_) e:position!()
            {
                if !KEYWORDS.contains(kw) {
                    panic!("BUG: {:?} is not in KEYWORDS", kw);
                }
                Token {
                    span: s..e,
                    ws_offset: input.len(),
                    text: format!("{}{}", input, ws),
                }
            }

This does other weird things, like fail with expected <unreported>, because apparently positive look-ahead is silent on failure?

Anyway, I'll try a few more things and see how it goes. Maybe want(..) should match twice, once as positive lookahead and a second time for real?

The text was updated successfully, but these errors were encountered:

emk · 2023-10-06T10:57:28Z

Hmm, there's no obvious way to fix want without either silently losing errors from k, or matching too far ahead.

Interestingly, I strongly suspect that support for / expected!(kw) would be enough to fix this pretty cleanly. Then I could use quiet! { ... } / expected!(...), which seems to provide the most accurate positions for errors out of all the alternatives.

Thank you for any thoughts or suggestions! And thank you for a great parser.

See kevinmehall/rust-peg#361. I have fixed the positions but lost the reporting of actual tokens/keywords.

kevinmehall · 2023-10-06T14:52:43Z

expected!(non_literal_expr) seems like a good idea and shouldn't be hard to support.

If you want something that works on the current version, you could trick the infinite loop detection with {? Err(...) } "no_match" or even {? Err(...) } "" (literals are always considered non-nullable without checking if they're empty). The loop detection is supposed to conservatively avoid rejecting non-looping code, so {? ..} should probably be relaxed -- right now it's not handled separately from unconditional sequences.

This would also be a good use case for #284 once added.

emk · 2023-10-07T11:31:46Z

Thank you for the suggestions!

The no_match trick seems to fail with an error, though:

error: expected one of "#", "/", ";", "crate", "pub", "rule", "use", "}"
    --> src/ast.rs:2510:27
     |
2510 |             / {? Err(s) } "no_match"
     |                           ^^^^^^^^^^

kevinmehall · 2023-10-07T15:28:27Z

Oops, try (({? Err(s) }) "no_match"). Normally there wouldn't be any reason to put another expression after a block, but what this is doing is giving the loop checker something in sequence that would consume input. It's after the expression has already failed, so doesn't affect what it matches.

kevinmehall · 2023-10-11T02:49:12Z

a85e71b allows expected!() to take an expression evaluating to &'static str rather than just a literal. Released in 0.8.2.

emk · 2023-10-11T12:00:13Z

Thank you, this is excellent! This will allow me to give much better errors in certain parsers. As always, peg is fantastic.

emk added a commit to faradayio/joinery that referenced this issue Oct 6, 2023

Try to fix error messages

e9060e7

See kevinmehall/rust-peg#361. I have fixed the positions but lost the reporting of actual tokens/keywords.

kevinmehall closed this as completed Oct 11, 2023

kevinmehall mentioned this issue Oct 11, 2023

{? Err("x") } should not be considered nullable for infinite loop check #362

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Case-insensitive literal matches throw off error position #361

Case-insensitive literal matches throw off error position #361

emk commented Oct 6, 2023

emk commented Oct 6, 2023

kevinmehall commented Oct 6, 2023

emk commented Oct 7, 2023

kevinmehall commented Oct 7, 2023

kevinmehall commented Oct 11, 2023

emk commented Oct 11, 2023

Case-insensitive literal matches throw off error position #361

Case-insensitive literal matches throw off error position #361

Comments

emk commented Oct 6, 2023

emk commented Oct 6, 2023

kevinmehall commented Oct 6, 2023

emk commented Oct 7, 2023

kevinmehall commented Oct 7, 2023

kevinmehall commented Oct 11, 2023

emk commented Oct 11, 2023