2D RoPE + CLIP updates #1973

RdoubleA · 2024-11-08T17:47:09Z

Context

Two-dimensional rotary positional embeddings have been added to vision transformers to improve performance. This was explored in papers such as Eva-02-CLIP, which found that 2D RoPE improved performance and had more stable training as opposed to 1D RoPE for images. Another novel architecture FiT (Flexible Vision Transformer for Diffusion Model) similarly employs 2D RoPE for image resolution generalization. Pixtral, a multimodal LLM, also uses a similar 2D RoPE mechanism as well, as seen in Hugging Face. A full survey of 2D RoPE can be found in this paper.

Here, we add VisionRotaryPositionalEmbeddings as a general component. The forward is identical to RotaryPositionalEmbeddings; the major difference is in the build_rope_cache, where we need to take into account the x and y positions of the patches in the image grid, as defined by patch_size and tile_size. It simply applies 1D RoPE with half the usual dim on x and y independently and concatenates them together.

This is exposed as use_rope in the clip_vision_encoder builder, which will enable 2D RoPE. I also include some various fixes for CLIP and additional configurability

Changelog

What are the changes made in this PR?

Add VisionRotaryPositionalEmbeddings with tests
Expose ability to configure RoPE on clip encoder
intermediate_act is not used in the clip_vision_encoder. This is superceded by activation, so I just removed it
apply_lora_to_output is not used in lora_clip_vision_encoder. There is a TODO to add this to the CLSProjection. I removed it so users don't accidentally configure it and get unexpected behavior
CLIP's CLSEmbedding adds the CLS token to the beginning of the input sequence. I expose this as a configurable parameter to allow the option of adding it to the end of the sequence instead, which some variations of CLIP may do.
attn_bias was not being propagated in lora_clip_vision_encoder, so I fixed this and changed default to False to match current behavior.
Typing fixes for llama3 rope parameters
Move Phi3 rope tests to its own file

Test plan

Recreated clip_vision_encoder with append_cls_token=True and manually verified that cls token embedding is in correct position, added test
Tests for VisionRotaryPositionalEmbeddings and parity check against an internal reference implementation

pytorch-bot · 2024-11-08T17:47:12Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1973

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[DomainsOnly] Jobs fail with GLIBC version not found

✅ No Failures

As of commit 77ecb82 with merge base bca5899 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

joecummings

buttttt tests that run in CI?

joecummings

This looks good to me, but should probably get @pbontrager 's eyes on this as well.

acisseJZhong · 2024-11-15T16:52:47Z

torchtune/modules/position_embeddings.py

+            E.g. for ``patch_size=40``, a tile of shape (400, 400) will have 10x10 grid of patches.
+        tile_size (int): The size of your image tiles, if the image was tile-cropped in advance. Otherwise,
+            the size of the input image. In this case, the function will consider your image as a single tile.
+            with shape (40, 40) each.


a single tile with shape (400, 400)? if max_num_tiles=1, then tile size should be equal to patch_size?

ooh ya good catch, let me update this and the docstring where I copied this from

I think that "your image as a single tile. with shape (40, 40) each." ---> "your image as a single tile with shape (400, 400)."

I didnt read the code, just sharing what i know from llama 3.2 vision. This explains it visually:

torchtune/torchtune/modules/vision_transformer.py

Line 61 in bd89bb8

Image: shape (8x8)

A image is broken into tiles. E.g, if max_num_tiles=4 and tile_size=400, a 1600x1600 image is broken into 4 tiles of 400x400, e.g. (num images, num tiles, x, y) = (1,4,400x400)

Then each tile is broken down into patches.

If max_num_tiles=1, then it means that 1 image = 1 tile. The tile dimension doesnt go away, e.g. (num images, num tiles, x, y) = (1,1,x, y)

fix parameters

898b50d

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 8, 2024

RdoubleA changed the title ~~Configure cls token position in CLIP CLSEmbedding, fix CLIP parameters~~ Configure cls token position in CLIP, other CLIP and ViT fixes Nov 8, 2024

RdoubleA changed the title ~~Configure cls token position in CLIP, other CLIP and ViT fixes~~ Configure cls token position in CLIP, fix CLIP parameters Nov 8, 2024

RdoubleA added 2 commits November 8, 2024 09:57

accidental deletion

96d649d

expose rope for clip

a36dde5

RdoubleA changed the title ~~Configure cls token position in CLIP, fix CLIP parameters~~ Configure cls token position and rope in CLIP Nov 8, 2024

joecummings reviewed Nov 8, 2024

View reviewed changes

RdoubleA added 5 commits November 8, 2024 12:11

rope base float -> int

4c37a9d

Merge remote-tracking branch 'upstream/main' into clip_updates

a45354c

Merge branch 'main' into clip_updates

3bcbefe

2d rope

8d53b5f

Merge branch 'main' into clip_updates

00de69c

RdoubleA changed the title ~~Configure cls token position and rope in CLIP~~ 2D RoPE + CLIP updates Nov 15, 2024

RdoubleA added 2 commits November 15, 2024 07:08

fix test

99a3a8a

add append cls test

0704be6

joecummings approved these changes Nov 15, 2024

View reviewed changes

acisseJZhong reviewed Nov 15, 2024

View reviewed changes

RdoubleA added 2 commits November 15, 2024 16:59

Merge branch 'main' into clip_updates

379ee20

fix docstring

77ecb82

RdoubleA merged commit bce7091 into pytorch:main Nov 17, 2024
17 checks passed

RdoubleA deleted the clip_updates branch November 17, 2024 15:23

ebsmothers mentioned this pull request Nov 26, 2024

v0.5.0 tracker #2008

Closed

44 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2D RoPE + CLIP updates #1973

2D RoPE + CLIP updates #1973

RdoubleA commented Nov 8, 2024 •

edited

Loading

pytorch-bot bot commented Nov 8, 2024 •

edited

Loading

joecummings left a comment

joecummings left a comment

acisseJZhong Nov 15, 2024

RdoubleA Nov 15, 2024

felipemello1 Nov 15, 2024

2D RoPE + CLIP updates #1973

2D RoPE + CLIP updates #1973

Conversation

RdoubleA commented Nov 8, 2024 • edited Loading

Context

Changelog

Test plan

pytorch-bot bot commented Nov 8, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1973

❗ 1 Active SEVs

✅ No Failures

joecummings left a comment

Choose a reason for hiding this comment

joecummings left a comment

Choose a reason for hiding this comment

acisseJZhong Nov 15, 2024

Choose a reason for hiding this comment

RdoubleA Nov 15, 2024

Choose a reason for hiding this comment

felipemello1 Nov 15, 2024

Choose a reason for hiding this comment

RdoubleA commented Nov 8, 2024 •

edited

Loading

pytorch-bot bot commented Nov 8, 2024 •

edited

Loading