Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

only allowing valid tokenizations in grammars #1

Open
3 tasks
mmoskal opened this issue Jul 26, 2024 · 0 comments
Open
3 tasks

only allowing valid tokenizations in grammars #1

mmoskal opened this issue Jul 26, 2024 · 0 comments
Assignees

Comments

@mmoskal
Copy link
Collaborator

mmoskal commented Jul 26, 2024

See https://vivien000.github.io/blog/journal/llm-decoding-with-regex-constraints.html and https://github.com/vivien000/regex-constrained-decoding/blob/main/technical_appendix.pdf

Thoughts (unorganized):

  • there is an automaton that recognizes all valid sequences of tokens, and moreover this automaton only depends on the previous token; that is, there is a function from tokens that tells you the set of valid next tokens
  • by itself it's probably not that useful, since the automaton doesn't exclude enough states, for example if the regex is for a JSON string, and the previous token is foo then the set of allowed next tokens includes " and ",; however after sampling " it will not allow , anymore (so in other words we would have to look ahead in this automaton)
  • implement .forced_byte() method in derivre
  • use this for cheap .forced_byte() impl in llguidance
  • while walking token trie, remember all forced paths (there shouldn't be too many of them)

The tokens we most need to discard will be along the forced path, for example after " the , is forced. Note that if the grammar allows white space between " and ,, there is no forced paths and moreover the token " should be still allowed (unless there are tokens " , "\n, "\t etc covering all of white space; but I think this is very unlikely).

In toktrie walk, if we encounter a forced byte, we go into forced mode where we just chase all forced bytes. The first token we find on this path we put on some list. We do not add any of these tokens to the allow set.

Then, after token trie walk, for every token on this list we re-create the forced byte string, tokenize, chop excessive tokens, and add the first token from tokenization to allow set and remaining tokens (if any) as conditional
splice.

Transferred from hudson-ai/guidance#13

@mmoskal mmoskal self-assigned this Jul 26, 2024
mmoskal added a commit that referenced this issue Nov 30, 2024
…penssl-0.10.66

Bump openssl from 0.10.64 to 0.10.66 in /hf_tokenizers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant