Skip to content

How does "Accelerated Guidance" work? #138

Answered by slundberg
RamiAwar asked this question in Q&A
Discussion options

You must be logged in to vote

GPT style LLMs are all auto-regressive and process tokens in two modes:

  1. they suck in chunks of prompt tokens to fill in a KV-cache that represents the prompt to the model.
  2. they generate a single new token (and add one position to the KV-cache). This generation is done by generating a probability vector over a set of tokens, then sampling from that.

Guidance can "accelerate" inference because "prompt tokens" are way cheaper/faster than "generation tokens" (due to lots of factors like GPU batching). Because a guidance program specifies much of the structure of the output we can convert many of the output tokens into batches of prompt-like tokens that are cheaper. We also can use the struc…

Replies: 2 comments 2 replies

Comment options

You must be logged in to vote
2 replies
@RamiAwar
Comment options

@linonetwo
Comment options

Answer selected by RamiAwar
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
4 participants