Deeply nested binary arithmetic segfaults under native parser #619

isidentical · 2022-01-22T15:28:41Z

This is definietly not a realistic code (but it is still used in projects like this), though I think it might point to an underlying stack overflow or something like that (no experience in debugging rust applications, so unfortunately can't provide any more insight than this)

Here is the exampe payload:

(
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' + 
    'X' + 'Y' + 'Z' + 'Q' + 'T' + 
    'X' + 'Y' + 'Z' + 'Q' + 'T' + 
    'X' + 'Y' + 'Z' + 'Q' + 'T' + 
    'X' + 'Y' + 'Z' + 'Q' + 'T' + 
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' + 
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T'
)

Works perfectly with the python parser, but the native one crashes:

 $ LIBCST_PARSER_TYPE=native python -m libcst.tool print t.py                                                 444ms
[1]    11295 segmentation fault (core dumped)  LIBCST_PARSER_TYPE=native python -m libcst.tool print t.py

The text was updated successfully, but these errors were encountered:

zsol · 2022-01-25T11:13:19Z

Hmmm, I can't reproduce this with the release build.
The debug build does throw a stack overflow, but that's kinda expected.

zsol · 2022-01-25T11:16:34Z

Alright I can repro with 5000 + operators :) We need this to not crash, but it'd be nice to handle arbitrarily nested expressions.

(BTW the original parser works fine even with this many operators, but printing the tree fails with a RecursionError, so I'm not sure we can reasonably support huge expressions like this at the moment)

isidentical · 2022-01-25T11:23:20Z

Ah, I can also confirm. The parser itself is working fine (or giving a stack overflow) but the problem seems to be at libcst.tool print which is just segfaulting (under LIBCST_PARSER_TYPE=native):

$ ./target/release/parse < t.py                                                                                3ms
(
   [...]
)

$ ./target/debug/parse < t.py                                                                                 15ms
thread 'main' has overflowed its stack
fatal runtime error: stack overflow
[1]    18897 IOT instruction (core dumped)  ./target/debug/parse < t.py

 $ python -m libcst.tool print t.py                                                                             0ms
[1]    18999 segmentation fault (core dumped)  python -m libcst.tool print t.py

zsol · 2022-01-25T13:42:15Z

Looking at the core file, the bottleneck is not the parsing (strictly speaking), but the cloning of the excessively nested datastructure causing ~3500 stack frames. There are two things which we can do:

reduce the memory footprint of the rust-native CST objects. There are some low-hanging fruits here, that the rust linter even warns about. this enum for example is huge, but it doesn't need to be; and it appears on every stack of a BinaryOp's clone function.
Somehow limit the depth of the nesting in the parser itself. I'm not 100% sure how to do this, but we could have counters to increment on recursion.

isidentical · 2022-01-25T13:49:34Z

Somehow limit the depth of the nesting in the parser itself. I'm not 100% sure how to do this, but we could have counters to increment on recursion.

In CPython, that is exactly what we do. We hold an integer in the parser state, and increase it before each call (and check for the depth) and decrease it afterwards.
https://github.com/python/cpython/blob/ee60550e9ba3ab94ca43a890cf4313b63ffa1a81/Tools/peg_generator/pegen/c_generator.py#L368-L373

zsol · 2022-01-27T15:26:00Z

#632 should help with any real world use cases. There's kevinmehall/rust-peg#282 open to address this in the parser framework. I'll see if I can hack something up in the meanwhile.

isidentical · 2022-01-27T15:28:22Z

Amazing, thank you @zsol!

zsol added the bug Something isn't working label Jan 25, 2022

zsol mentioned this issue Jan 27, 2022

[native] Box most enums #632

Merged

zsol added the parsing Converting source code into CST nodes label Jun 16, 2022

zsol mentioned this issue Aug 26, 2023

Stack overflow when parsing an expression #1000

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deeply nested binary arithmetic segfaults under native parser #619

Deeply nested binary arithmetic segfaults under native parser #619

isidentical commented Jan 22, 2022

zsol commented Jan 25, 2022

zsol commented Jan 25, 2022 •

edited

Loading

isidentical commented Jan 25, 2022

zsol commented Jan 25, 2022

isidentical commented Jan 25, 2022

zsol commented Jan 27, 2022

isidentical commented Jan 27, 2022

Deeply nested binary arithmetic segfaults under native parser #619

Deeply nested binary arithmetic segfaults under native parser #619

Comments

isidentical commented Jan 22, 2022

zsol commented Jan 25, 2022

zsol commented Jan 25, 2022 • edited Loading

isidentical commented Jan 25, 2022

zsol commented Jan 25, 2022

isidentical commented Jan 25, 2022

zsol commented Jan 27, 2022

isidentical commented Jan 27, 2022

zsol commented Jan 25, 2022 •

edited

Loading