Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deeply nested binary arithmetic segfaults under native parser #619

Open
isidentical opened this issue Jan 22, 2022 · 7 comments
Open

Deeply nested binary arithmetic segfaults under native parser #619

isidentical opened this issue Jan 22, 2022 · 7 comments
Labels
bug Something isn't working parsing Converting source code into CST nodes

Comments

@isidentical
Copy link
Contributor

This is definietly not a realistic code (but it is still used in projects like this), though I think it might point to an underlying stack overflow or something like that (no experience in debugging rust applications, so unfortunately can't provide any more insight than this)

Here is the exampe payload:

(
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' + 
    'X' + 'Y' + 'Z' + 'Q' + 'T' + 
    'X' + 'Y' + 'Z' + 'Q' + 'T' + 
    'X' + 'Y' + 'Z' + 'Q' + 'T' + 
    'X' + 'Y' + 'Z' + 'Q' + 'T' + 
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' + 
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T'
)

Works perfectly with the python parser, but the native one crashes:

 $ LIBCST_PARSER_TYPE=native python -m libcst.tool print t.py                                                 444ms
[1]    11295 segmentation fault (core dumped)  LIBCST_PARSER_TYPE=native python -m libcst.tool print t.py
@zsol
Copy link
Member

zsol commented Jan 25, 2022

Hmmm, I can't reproduce this with the release build.
The debug build does throw a stack overflow, but that's kinda expected.

@zsol
Copy link
Member

zsol commented Jan 25, 2022

Alright I can repro with 5000 + operators :) We need this to not crash, but it'd be nice to handle arbitrarily nested expressions.

(BTW the original parser works fine even with this many operators, but printing the tree fails with a RecursionError, so I'm not sure we can reasonably support huge expressions like this at the moment)

@zsol zsol added the bug Something isn't working label Jan 25, 2022
@isidentical
Copy link
Contributor Author

Ah, I can also confirm. The parser itself is working fine (or giving a stack overflow) but the problem seems to be at libcst.tool print which is just segfaulting (under LIBCST_PARSER_TYPE=native):

$ ./target/release/parse < t.py                                                                                3ms
(
   [...]
)

$ ./target/debug/parse < t.py                                                                                 15ms
thread 'main' has overflowed its stack
fatal runtime error: stack overflow
[1]    18897 IOT instruction (core dumped)  ./target/debug/parse < t.py

 $ python -m libcst.tool print t.py                                                                             0ms
[1]    18999 segmentation fault (core dumped)  python -m libcst.tool print t.py

@zsol
Copy link
Member

zsol commented Jan 25, 2022

Looking at the core file, the bottleneck is not the parsing (strictly speaking), but the cloning of the excessively nested datastructure causing ~3500 stack frames. There are two things which we can do:

  1. reduce the memory footprint of the rust-native CST objects. There are some low-hanging fruits here, that the rust linter even warns about. this enum for example is huge, but it doesn't need to be; and it appears on every stack of a BinaryOp's clone function.
  2. Somehow limit the depth of the nesting in the parser itself. I'm not 100% sure how to do this, but we could have counters to increment on recursion.

@isidentical
Copy link
Contributor Author

Somehow limit the depth of the nesting in the parser itself. I'm not 100% sure how to do this, but we could have counters to increment on recursion.

In CPython, that is exactly what we do. We hold an integer in the parser state, and increase it before each call (and check for the depth) and decrease it afterwards.
https://github.com/python/cpython/blob/ee60550e9ba3ab94ca43a890cf4313b63ffa1a81/Tools/peg_generator/pegen/c_generator.py#L368-L373

@zsol
Copy link
Member

zsol commented Jan 27, 2022

#632 should help with any real world use cases. There's kevinmehall/rust-peg#282 open to address this in the parser framework. I'll see if I can hack something up in the meanwhile.

@isidentical
Copy link
Contributor Author

Amazing, thank you @zsol!

@zsol zsol added the parsing Converting source code into CST nodes label Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working parsing Converting source code into CST nodes
Projects
None yet
Development

No branches or pull requests

2 participants