Skip to content

Commit

Permalink
Fix broken imgs link on doc pages
Browse files Browse the repository at this point in the history
  • Loading branch information
mishig25 committed Apr 11, 2024
1 parent 48d302f commit 8a5673b
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 4 deletions.
4 changes: 2 additions & 2 deletions unit2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,15 +28,15 @@ Fine-tuning typically works best if the new data somewhat resembles the base mod

Unconditional models don't give much control over what is generated. We can train a conditional model (more on that in the next section) that takes additional inputs to help steer the generation process, but what if we already have a trained unconditional model we'd like to use? Enter guidance, a process by which the model predictions at each step in the generation process are evaluated against some guidance function and modified such that the final generated image is more to our liking.

![guidance example image](guidance_eg.png)
![guidance example image](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/diffusion-course/guidance_eg.png)

This guidance function can be almost anything, making this a powerful technique! In the notebook, we build up from a simple example (controlling the color, as illustrated in the example output above) to one utilizing a powerful pre-trained model called CLIP which lets us guide generation based on a text description.

## Conditioning

Guidance is a great way to get some additional mileage from an unconditional diffusion model, but if we have additional information (such as a class label or an image caption) available during training then we can also feed this to the model for it to use as it makes its predictions. In doing so, we create a **conditional** model, which we can control at inference time by controlling what is fed in as conditioning. The notebook shows an example of a class-conditioned model which learns to generate images according to a class label.

![conditioning example](conditional_digit_generation.png)
![conditioning example](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/diffusion-course/conditional_digit_generation.png)

There are a number of ways to pass in this conditioning information, such as
- Feeding it in as additional channels in the input to the UNet. This is often used when the conditioning information is the same shape as the image, such as a segmentation mask, a depth map or a blurry version of the image (in the case of a restoration/superresolution model). It does work for other types of conditioning too. For example, in the notebook, the class label is mapped to an embedding and then expanded to be the same width and height as the input image so that it can be fed in as additional channels.
Expand Down
4 changes: 2 additions & 2 deletions unit3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,12 +37,12 @@ By applying the diffusion process on these **latent representations** rather tha

In Unit 2 we showed how feeding additional information to the UNet allows us to have some additional control over the types of images generated. We call this conditioning. Given a noisy version of an image, the model is tasked with predicting the denoised version **based on additional clues** such as a class label or, in the case of Stable Diffusion, a text description of the image. At inference time, we can feed in the description of an image we'd like to see and some pure noise as a starting point, and the model does its best to 'denoise' the random input into something that matches the caption.

![text encoder diagram](text_encoder_noborder.png)<br>
![text encoder diagram](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/diffusion-course/text_encoder_noborder.png)<br>
_Diagram showing the text encoding process which transforms the input prompt into a set of text embeddings (the encoder_hidden_states) which can then be fed in as conditioning to the UNet._

For this to work, we need to create a numeric representation of the text that captures relevant information about what it describes. To do this, SD leverages a pre-trained transformer model based on something called CLIP. CLIP's text encoder was designed to process image captions into a form that could be used to compare images and text, so it is well suited to the task of creating useful representations from image descriptions. An input prompt is first tokenized (based on a large vocabulary where each word or sub-word is assigned a specific token) and then fed through the CLIP text encoder, producing a 768-dimensional (in the case of SD 1.X) or 1024-dimensional (SD 2.X) vector for each token. To keep things consistent prompts are always padded/truncated to be 77 tokens long, and so the final representation which we use as conditioning is a tensor of shape 77x1024 per prompt.

![conditioning diagram](sd_unet_color.png)
![conditioning diagram](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/diffusion-course/sd_unet_color.png)

OK, so how do we actually feed this conditioning information into the UNet for it to use as it makes predictions? The answer is something called cross-attention. Scattered throughout the UNet are cross-attention layers. Each spatial location in the UNet can 'attend' to different tokens in the text conditioning, bringing in relevant information from the prompt. The diagram above shows how this text conditioning (as well as timestep-based conditioning) is fed in at different points. As you can see, at every level the UNet has ample opportunity to make use of this conditioning!

Expand Down

0 comments on commit 8a5673b

Please sign in to comment.