-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2D RoPE + CLIP updates #1973
2D RoPE + CLIP updates #1973
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1973
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit 77ecb82 with merge base bca5899 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
buttttt tests that run in CI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, but should probably get @pbontrager 's eyes on this as well.
E.g. for ``patch_size=40``, a tile of shape (400, 400) will have 10x10 grid of patches. | ||
tile_size (int): The size of your image tiles, if the image was tile-cropped in advance. Otherwise, | ||
the size of the input image. In this case, the function will consider your image as a single tile. | ||
with shape (40, 40) each. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a single tile with shape (400, 400)? if max_num_tiles=1, then tile size should be equal to patch_size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooh ya good catch, let me update this and the docstring where I copied this from
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that "your image as a single tile. with shape (40, 40) each." ---> "your image as a single tile with shape (400, 400)."
I didnt read the code, just sharing what i know from llama 3.2 vision. This explains it visually:
Image: shape (8x8) |
A image is broken into tiles. E.g, if max_num_tiles=4 and tile_size=400, a 1600x1600 image is broken into 4 tiles of 400x400, e.g. (num images, num tiles, x, y) = (1,4,400x400)
Then each tile is broken down into patches.
If max_num_tiles=1, then it means that 1 image = 1 tile. The tile dimension doesnt go away, e.g. (num images, num tiles, x, y) = (1,1,x, y)
Context
Two-dimensional rotary positional embeddings have been added to vision transformers to improve performance. This was explored in papers such as Eva-02-CLIP, which found that 2D RoPE improved performance and had more stable training as opposed to 1D RoPE for images. Another novel architecture FiT (Flexible Vision Transformer for Diffusion Model) similarly employs 2D RoPE for image resolution generalization. Pixtral, a multimodal LLM, also uses a similar 2D RoPE mechanism as well, as seen in Hugging Face. A full survey of 2D RoPE can be found in this paper.
Here, we add
VisionRotaryPositionalEmbeddings
as a general component. The forward is identical toRotaryPositionalEmbeddings
; the major difference is in thebuild_rope_cache
, where we need to take into account the x and y positions of the patches in the image grid, as defined bypatch_size
andtile_size
. It simply applies 1D RoPE with half the usual dim on x and y independently and concatenates them together.This is exposed as
use_rope
in theclip_vision_encoder
builder, which will enable 2D RoPE. I also include some various fixes for CLIP and additional configurabilityChangelog
What are the changes made in this PR?
VisionRotaryPositionalEmbeddings
with testsintermediate_act
is not used in theclip_vision_encoder
. This is superceded byactivation
, so I just removed itapply_lora_to_output
is not used inlora_clip_vision_encoder
. There is a TODO to add this to theCLSProjection
. I removed it so users don't accidentally configure it and get unexpected behaviorCLSEmbedding
adds the CLS token to the beginning of the input sequence. I expose this as a configurable parameter to allow the option of adding it to the end of the sequence instead, which some variations of CLIP may do.attn_bias
was not being propagated inlora_clip_vision_encoder
, so I fixed this and changed default to False to match current behavior.Test plan
clip_vision_encoder
withappend_cls_token=True
and manually verified that cls token embedding is in correct position, added testVisionRotaryPositionalEmbeddings
and parity check against an internal reference implementation