Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use FSMs for scanning during grammar-guided generation #178

Merged

Conversation

brandonwillard
Copy link
Member

@brandonwillard brandonwillard commented Jul 10, 2023

This PR is the first step to closing #170 using the approach described in our paper "Efficient Guided Generation for Large Language Models".

In other words, the changes in this PR allow us to create "index" dicts for full grammars. By "index" we mean a dict that maps partial parse states to subsets of a vocabulary that are valid continuations of the parse state.

NOTE: The indexing work is being split off into its own PR.

TODO:

  • Use a more efficient DFA implementation.
  • Use DFA minimization on the parse state FSMs.
    This will remove a large number of unnecessary states and decrease the overall cost of constructing indices.
  • We need to do something about ignored tokens in terminal_symbol_scan.
    They are currently being completely ignored, which incorrectly excludes vocabulary strings consisting of only ignored tokens (e.g. " " in a Python grammar).
  • Convert the terminal symbol FSM states used to compute the indices to their corresponding parser state (aggregate/unioned) FSM states.

@brandonwillard brandonwillard added enhancement text Linked to text generation labels Jul 10, 2023
@brandonwillard brandonwillard marked this pull request as draft July 10, 2023 03:36
@brandonwillard brandonwillard force-pushed the lex-state-merged-fsm-approach branch 4 times, most recently from 714cb7b to 955de6d Compare July 13, 2023 20:25
@rlouf rlouf added the structured generation Linked to structured generation label Jul 17, 2023
@brandonwillard brandonwillard force-pushed the lex-state-merged-fsm-approach branch 11 times, most recently from 99b8aaf to 9eaff91 Compare July 24, 2023 19:51
@@ -332,17 +368,17 @@ def _partial_match(
if not terminated and state == fsm.initial:
return None, None

return None if not terminated else i, accepted_states
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised I did not see any bug related to this

@brandonwillard brandonwillard force-pushed the lex-state-merged-fsm-approach branch 8 times, most recently from b65b38a to c9d223e Compare July 27, 2023 22:49
@dpsimpson
Copy link

It is I, Daniel, smasher of for loops, obsfucator of code here to tell you that you can replace the for loop here (aka the one starting at line 108 of /examples/parsing.py with the probably more efficient, if somewhat less clear
(assuming that next_vocab is a list of integers, otherwise add some code to make it so)

ids = torch.tensor(next_vocab, dtype = torch.uint8).unsqueeze(0) # same number of dims as mask
mask = mask.scatter_(1, ids, torch.zeros_like(ids) # trailing _ is the in-place version

Is this a sensible change for an example? No. Will it be faster when next_vocab is big and you're on a cuda device? It should be.

@brandonwillard brandonwillard force-pushed the lex-state-merged-fsm-approach branch 2 times, most recently from 1b00a49 to b6afe29 Compare August 15, 2023 21:37
@brandonwillard brandonwillard force-pushed the lex-state-merged-fsm-approach branch 4 times, most recently from b3af5df to 4eb84b4 Compare August 18, 2023 00:50
@samuela
Copy link

samuela commented Aug 23, 2023

OOC how does this work and the corresponding paper relate to the "Parsing with Derivatives" from @mattmight et al?

@brandonwillard
Copy link
Member Author

OOC how does this work and the corresponding paper relate to the "Parsing with Derivatives" from @mattmight et al?

We're aware of that work, but this PR does not use anything from it directly. The work being done here is almost exclusively concerned with the adaptation of existing LALR(1) parsers and the use of the "indices" described in our technical paper.

@brandonwillard brandonwillard force-pushed the lex-state-merged-fsm-approach branch 3 times, most recently from e3042d0 to 4df0c7c Compare September 6, 2023 18:58
@brandonwillard brandonwillard marked this pull request as ready for review September 11, 2023 20:59
All the terminal symbols regexs in each parse-state-dependent lexer are
combined/unioned into a single FSM, and scanning is performed according to those
combined FSMs.  The function `fsm_union` was added for that purpose.

Since the parser needs to know exactly which terminal symbols were matched, we
now need to the ability to determine exactly which sub-FSM (i.e. one of the
combined terminal symbol FSMs) accepted an input string.  The function
`get_sub_fsms_from_seq` serves this purpose.
@brandonwillard brandonwillard force-pushed the lex-state-merged-fsm-approach branch 2 times, most recently from 9b095bb to f5a62ac Compare September 11, 2023 21:06
@brandonwillard
Copy link
Member Author

I'm going to merge this after the tests pass and create follow-up issues for the remaining work mentioned in the description. Doing so will fix the flaky FSM tests.

@brandonwillard brandonwillard merged commit e6ff583 into dottxt-ai:main Sep 11, 2023
4 checks passed
@brandonwillard brandonwillard deleted the lex-state-merged-fsm-approach branch September 11, 2023 21:14
@Wehzie
Copy link

Wehzie commented Sep 21, 2023

I'm very interested in this PR. Will there be documentation detailing how a context free grammar can be specified for guided generation?

@brandonwillard
Copy link
Member Author

I'm very interested in this PR. Will there be documentation detailing how a context free grammar can be specified for guided generation?

Yes, and hopefully very soon! We have a development roadmap for integrating this into our Sequence framework and user interface and providing a performant sampling-based approach (to be followed by indexing and more). The next steps on that path are taking place in #272 right now.

@Wehzie
Copy link

Wehzie commented Oct 18, 2023

Thank you for the update! Seeing as #272 is closed, what are the next steps to get DSLs running? Regarding Sequence and a user interface, I'm not finding information on that in the repo or your website. Do you have some more information about your plans there?

@rlouf
Copy link
Member

rlouf commented Oct 19, 2023

Working on it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement structured generation Linked to structured generation text Linked to text generation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants