How does "Accelerated Guidance" work? #138

RamiAwar · 2023-05-29T08:47:33Z

RamiAwar
May 29, 2023

Tried looking into the code for this but it was too complicated to unwrap in 10 minutes. I'm not sure I understand this part of the docs well, it's covered too briefly I think. Can you explain in more detail how this works? How can you ask the LLM to only generate the values of a JSON structure for example and have that be accurate? This library went beyond prompt engineering into prompt sorcery

Answered by slundberg

Jun 1, 2023

GPT style LLMs are all auto-regressive and process tokens in two modes:

they suck in chunks of prompt tokens to fill in a KV-cache that represents the prompt to the model.
they generate a single new token (and add one position to the KV-cache). This generation is done by generating a probability vector over a set of tokens, then sampling from that.

Guidance can "accelerate" inference because "prompt tokens" are way cheaper/faster than "generation tokens" (due to lots of factors like GPU batching). Because a guidance program specifies much of the structure of the output we can convert many of the output tokens into batches of prompt-like tokens that are cheaper. We also can use the struc…

View full answer

slundberg · 2023-06-01T22:04:49Z

slundberg
Jun 1, 2023
Maintainer

GPT style LLMs are all auto-regressive and process tokens in two modes:

they suck in chunks of prompt tokens to fill in a KV-cache that represents the prompt to the model.
they generate a single new token (and add one position to the KV-cache). This generation is done by generating a probability vector over a set of tokens, then sampling from that.

Guidance can "accelerate" inference because "prompt tokens" are way cheaper/faster than "generation tokens" (due to lots of factors like GPU batching). Because a guidance program specifies much of the structure of the output we can convert many of the output tokens into batches of prompt-like tokens that are cheaper. We also can use the structure of the template to dynamically bias the next token probabilities to make sure the text that comes next aligns with the template and is optimally tokenized (e.g. token healing).

Note that we can only do this for models we have control over (currently local models in Transformers), though we are working on exposing a server version as well.

2 replies

RamiAwar Jun 2, 2023
Author

Ah I see, I was trying to understand how this was possible without having access to the model internals. Now it makes sense :)

linonetwo Jun 7, 2023

Can you point out the implementation code lines for this? Thanks!

roy-pstr · 2023-06-06T18:31:34Z

roy-pstr
Jun 6, 2023

Are there any plans for LLMs models over API to support this feature? such as OpenAI?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does "Accelerated Guidance" work? #138

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How does "Accelerated Guidance" work? #138

RamiAwar May 29, 2023

Replies: 2 comments · 2 replies

slundberg Jun 1, 2023 Maintainer

RamiAwar Jun 2, 2023 Author

linonetwo Jun 7, 2023

roy-pstr Jun 6, 2023

RamiAwar
May 29, 2023

Replies: 2 comments 2 replies

slundberg
Jun 1, 2023
Maintainer

RamiAwar Jun 2, 2023
Author

roy-pstr
Jun 6, 2023