-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use FSMs for scanning during grammar-guided generation #178
Use FSMs for scanning during grammar-guided generation #178
Conversation
714cb7b
to
955de6d
Compare
99b8aaf
to
9eaff91
Compare
@@ -332,17 +368,17 @@ def _partial_match( | |||
if not terminated and state == fsm.initial: | |||
return None, None | |||
|
|||
return None if not terminated else i, accepted_states |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised I did not see any bug related to this
b65b38a
to
c9d223e
Compare
It is I, Daniel, smasher of for loops, obsfucator of code here to tell you that you can replace the for loop here (aka the one starting at line 108 of /examples/parsing.py with the probably more efficient, if somewhat less clear
Is this a sensible change for an example? No. Will it be faster when |
1b00a49
to
b6afe29
Compare
b3af5df
to
4eb84b4
Compare
OOC how does this work and the corresponding paper relate to the "Parsing with Derivatives" from @mattmight et al? |
We're aware of that work, but this PR does not use anything from it directly. The work being done here is almost exclusively concerned with the adaptation of existing LALR(1) parsers and the use of the "indices" described in our technical paper. |
e3042d0
to
4df0c7c
Compare
4df0c7c
to
2ef7d68
Compare
All the terminal symbols regexs in each parse-state-dependent lexer are combined/unioned into a single FSM, and scanning is performed according to those combined FSMs. The function `fsm_union` was added for that purpose. Since the parser needs to know exactly which terminal symbols were matched, we now need to the ability to determine exactly which sub-FSM (i.e. one of the combined terminal symbol FSMs) accepted an input string. The function `get_sub_fsms_from_seq` serves this purpose.
9b095bb
to
f5a62ac
Compare
I'm going to merge this after the tests pass and create follow-up issues for the remaining work mentioned in the description. Doing so will fix the flaky FSM tests. |
I'm very interested in this PR. Will there be documentation detailing how a context free grammar can be specified for guided generation? |
Yes, and hopefully very soon! We have a development roadmap for integrating this into our |
Thank you for the update! Seeing as #272 is closed, what are the next steps to get DSLs running? Regarding Sequence and a user interface, I'm not finding information on that in the repo or your website. Do you have some more information about your plans there? |
Working on it! |
This PR is the first step to closing #170 using the approach described in our paper "Efficient Guided Generation for Large Language Models".
In other words, the changes in this PR allow us to create "index"
dict
s for full grammars. By "index" we mean adict
that maps partial parse states to subsets of a vocabulary that are valid continuations of the parse state.NOTE: The indexing work is being split off into its own PR.
TODO:
This will remove a large number of unnecessary states and decrease the overall cost of constructing indices.
terminal_symbol_scan
.They are currently being completely ignored, which incorrectly excludes vocabulary strings consisting of only ignored tokens (e.g.
" "
in a Python grammar).