Use FSMs for scanning during grammar-guided generation #178

brandonwillard · 2023-07-10T03:35:59Z

This PR is the first step to closing #170 using the approach described in our paper "Efficient Guided Generation for Large Language Models".

In other words, the changes in this PR allow us to create "index" dicts for full grammars. By "index" we mean a dict that maps partial parse states to subsets of a vocabulary that are valid continuations of the parse state.

NOTE: The indexing work is being split off into its own PR.

TODO:

Use a more efficient DFA implementation.
Use DFA minimization on the parse state FSMs.
This will remove a large number of unnecessary states and decrease the overall cost of constructing indices.
We need to do something about ignored tokens in terminal_symbol_scan.
They are currently being completely ignored, which incorrectly excludes vocabulary strings consisting of only ignored tokens (e.g. " " in a Python grammar).
Convert the terminal symbol FSM states used to compute the indices to their corresponding parser state (aggregate/unioned) FSM states.

rlouf · 2023-07-26T14:35:41Z

outlines/text/parsing.py

@@ -332,17 +368,17 @@ def _partial_match(
        if not terminated and state == fsm.initial:
            return None, None

-        return None if not terminated else i, accepted_states


I'm surprised I did not see any bug related to this

dpsimpson · 2023-08-09T04:13:22Z

It is I, Daniel, smasher of for loops, obsfucator of code here to tell you that you can replace the for loop here (aka the one starting at line 108 of /examples/parsing.py with the probably more efficient, if somewhat less clear
(assuming that next_vocab is a list of integers, otherwise add some code to make it so)

ids = torch.tensor(next_vocab, dtype = torch.uint8).unsqueeze(0) # same number of dims as mask
mask = mask.scatter_(1, ids, torch.zeros_like(ids) # trailing _ is the in-place version

Is this a sensible change for an example? No. Will it be faster when next_vocab is big and you're on a cuda device? It should be.

samuela · 2023-08-23T21:21:17Z

OOC how does this work and the corresponding paper relate to the "Parsing with Derivatives" from @mattmight et al?

brandonwillard · 2023-08-23T21:37:53Z

OOC how does this work and the corresponding paper relate to the "Parsing with Derivatives" from @mattmight et al?

We're aware of that work, but this PR does not use anything from it directly. The work being done here is almost exclusively concerned with the adaptation of existing LALR(1) parsers and the use of the "indices" described in our technical paper.

All the terminal symbols regexs in each parse-state-dependent lexer are combined/unioned into a single FSM, and scanning is performed according to those combined FSMs. The function `fsm_union` was added for that purpose. Since the parser needs to know exactly which terminal symbols were matched, we now need to the ability to determine exactly which sub-FSM (i.e. one of the combined terminal symbol FSMs) accepted an input string. The function `get_sub_fsms_from_seq` serves this purpose.

brandonwillard · 2023-09-11T21:10:19Z

I'm going to merge this after the tests pass and create follow-up issues for the remaining work mentioned in the description. Doing so will fix the flaky FSM tests.

Wehzie · 2023-09-21T11:25:55Z

I'm very interested in this PR. Will there be documentation detailing how a context free grammar can be specified for guided generation?

brandonwillard · 2023-09-21T16:14:09Z

I'm very interested in this PR. Will there be documentation detailing how a context free grammar can be specified for guided generation?

Yes, and hopefully very soon! We have a development roadmap for integrating this into our Sequence framework and user interface and providing a performant sampling-based approach (to be followed by indexing and more). The next steps on that path are taking place in #272 right now.

Wehzie · 2023-10-18T13:30:08Z

Thank you for the update! Seeing as #272 is closed, what are the next steps to get DSLs running? Regarding Sequence and a user interface, I'm not finding information on that in the repo or your website. Do you have some more information about your plans there?

rlouf · 2023-10-19T05:59:09Z

Working on it!

brandonwillard assigned brandonwillard and rlouf Jul 10, 2023

brandonwillard added enhancement text Linked to text generation labels Jul 10, 2023

brandonwillard marked this pull request as draft July 10, 2023 03:36

brandonwillard force-pushed the lex-state-merged-fsm-approach branch 4 times, most recently from 714cb7b to 955de6d Compare July 13, 2023 20:25

rlouf added the structured generation Linked to structured generation label Jul 17, 2023

brandonwillard force-pushed the lex-state-merged-fsm-approach branch 11 times, most recently from 99b8aaf to 9eaff91 Compare July 24, 2023 19:51

rlouf reviewed Jul 26, 2023

View reviewed changes

brandonwillard force-pushed the lex-state-merged-fsm-approach branch 8 times, most recently from b65b38a to c9d223e Compare July 27, 2023 22:49

brandonwillard force-pushed the lex-state-merged-fsm-approach branch 2 times, most recently from 1b00a49 to b6afe29 Compare August 15, 2023 21:37

rlouf mentioned this pull request Aug 16, 2023

Change default generate behavior from fixed seed to random seed #228

Merged

brandonwillard force-pushed the lex-state-merged-fsm-approach branch 4 times, most recently from b3af5df to 4eb84b4 Compare August 18, 2023 00:50

brandonwillard force-pushed the lex-state-merged-fsm-approach branch 3 times, most recently from e3042d0 to 4df0c7c Compare September 6, 2023 18:58

brandonwillard mentioned this pull request Sep 6, 2023

Introduce Numba-based FSM utilities #272

Merged

7 tasks

brandonwillard force-pushed the lex-state-merged-fsm-approach branch from 4df0c7c to 2ef7d68 Compare September 11, 2023 20:58

brandonwillard marked this pull request as ready for review September 11, 2023 20:59

brandonwillard added 4 commits September 11, 2023 16:01

Use custom Lark objects

f4b29fd

Use deterministic FSM state labels

7be6220

Make parse tree/value computations optional

f5a62ac

brandonwillard force-pushed the lex-state-merged-fsm-approach branch 2 times, most recently from 9b095bb to f5a62ac Compare September 11, 2023 21:06

brandonwillard merged commit e6ff583 into dottxt-ai:main Sep 11, 2023
4 checks passed

brandonwillard deleted the lex-state-merged-fsm-approach branch September 11, 2023 21:14

brandonwillard mentioned this pull request Sep 11, 2023

Use DFA minimization on the parse state FSMs #278

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use FSMs for scanning during grammar-guided generation #178

Use FSMs for scanning during grammar-guided generation #178

brandonwillard commented Jul 10, 2023 •

edited

Loading

rlouf Jul 26, 2023

dpsimpson commented Aug 9, 2023

samuela commented Aug 23, 2023

brandonwillard commented Aug 23, 2023

brandonwillard commented Sep 11, 2023

Wehzie commented Sep 21, 2023

brandonwillard commented Sep 21, 2023

Wehzie commented Oct 18, 2023

rlouf commented Oct 19, 2023

Use FSMs for scanning during grammar-guided generation #178

Use FSMs for scanning during grammar-guided generation #178

Conversation

brandonwillard commented Jul 10, 2023 • edited Loading

rlouf Jul 26, 2023

Choose a reason for hiding this comment

dpsimpson commented Aug 9, 2023

samuela commented Aug 23, 2023

brandonwillard commented Aug 23, 2023

brandonwillard commented Sep 11, 2023

Wehzie commented Sep 21, 2023

brandonwillard commented Sep 21, 2023

Wehzie commented Oct 18, 2023

rlouf commented Oct 19, 2023

brandonwillard commented Jul 10, 2023 •

edited

Loading