Added Support for Rotary Positional Embeddings (both non-fused and fused kernel) #99

alexkranias-amd · 2024-11-13T22:13:07Z

Motivation

Original Paper: RoFormer: Enhanced Transformer with Rotary Position Embedding

Rotary Positional Embeddings (RoPEs) are a common positional embedding type used in many transformer models today.

RoPEs work by applying a unique rotation transformation to the vectors that represent each token within our q and k tensors based on each token's respective position in the sequence $$m$$.

To compute attention, we must first compute $$\text{matmul(}Q \text{,} ~ K^T \text{)}$$. This effectively is taking the dot product between the vector embeddings of tokens in $$Q$$ and $$K^T$$. Given two tokens at positions $$i$$ and $$j$$, the closer $$i$$ and $$j$$ are to each other, then their vector embeddings will end up getting rotated roughly the same amount, and the dot product between these two token embedding vectors will be largely unchanged. However, the further away these tokens are from each other, the more the transformation applied to these two vector embeddings diverges, which causes the dot product to decay. As the dot product decays, so does the attention weighting applied between the two tokens, and likewise this effectively leads the model to learning that for a single token the tokens near it should be paid more attention to than the tokens much further away.

A more detailed explanation

Fundamentally RoPEs work by dividing the embedding space of our q and k vectors (the $$\text{head}$$ _ $$\text{dim}$$) into many chunks of two. Each 2-dimensional chunk can be thought of as a vector subcomponent of q and k projected on a 2-dimensional plane that exists within the higher dimensional space of the q and k embedding. RoPE "rotates" the planar chunks of our q and k vectors uniquely based on the index of the token in the sequence. Each "chunk" is rotated some unique amount $$\theta_{m, d/2}$$ based on the index of the token in the sequence $$m$$, and the dimension $$d$$ of the subcomponents of q and k being rotated.

…failing) rotary

micmelesse · 2024-11-15T15:53:01Z

tests/test_flash_attn_triton_amd.py

-# @pytest.mark.parametrize("rotary_fraction", [0.0])
+# @pytest.mark.parametrize("rotary_interleaved", [True])
+# @pytest.mark.parametrize("rotary_fraction", [0.0, 0.5, 1.0])
+@pytest.mark.parametrize("rotary_fraction", [0.5, 1.0])


Have this match the original tests as much as possible.

micmelesse · 2024-11-15T15:53:17Z

tests/test_flash_attn_triton_amd.py

@@ -1921,7 +1922,7 @@ def test_flash_attn_kvcache(
    device = "cuda"
    # set seed
    torch.random.manual_seed(0)
-    batch_size = 2
+    batch_size = 4


Good catch. Oddly enough batch_size = 4 causes the tests to segfault randomly. batch_size = 2 however passes just fine. Might be something to explore?

micmelesse · 2024-11-15T15:55:09Z

flash_attn/flash_attn_triton_amd/fwd_decode.py

+        Rotary_interleaved = rotary_interleaved,
+        Rotary_conjugate = rotary_conjugate,
+        IS_SEQLEN_OFFSETS_TENSOR = isinstance(cache_seqlens, torch.Tensor),
+        IS_VARLEN = False,


why is IS_VARLEN manually set to false

Because with rotary you have the option to use varlen but decode only use batched. We don't have a varlen parameter to pass in from decode tests

micmelesse · 2024-11-15T15:58:21Z

flash_attn/flash_attn_triton_amd/fwd_decode.py

+    COS,            # tensor of shape (seqlen (m), ro_dim // 2)
+    SIN,            # tensor of shape (seqlen (m), ro_dim // 2)
+    SEQLEN_OFFSET,  # we use this as an offset into COS and SIN to apply the correct rotation
+    SEQLEN_OFFSET_IS_TENSOR: tl.constexpr, # if seqlen_offset is a tensor it has shape (num_batch, )


why do have to versions of SEQLEN_OFFSET? It seems an int and tensor which we do a load

micmelesse · 2024-11-15T15:59:31Z

flash_attn/flash_attn_triton_amd/fwd_decode.py

+
+    ro_dim_half = rotary_dim // 2       # length of cos/sin
+
+    if SEQLEN_OFFSET_IS_TENSOR:


why do we have 2 versions.

micmelesse · 2024-11-15T16:03:24Z

flash_attn/flash_attn_triton_amd/fwd_decode.py

+    # Misc
+    INTERLEAVED: tl.constexpr,
+    CONJUGATE: tl.constexpr,
+    TRANSPOSE: tl.constexpr,


I don't think we should do the transpose inside the rotary function.. It probably makes sense for the caller to do that if it makes sense for them

alexkranias-amd added 2 commits November 13, 2024 15:51

feat: added rotary support in kvcache

dc1271a

confirmed non-fused rotary passes all tests

e02ceee

alexkranias-amd requested a review from micmelesse November 13, 2024 22:13

alexkranias-amd self-assigned this Nov 13, 2024

alexkranias-amd added 5 commits November 14, 2024 10:49

feat: added ability to switch between non-fused (passing) and fused (…

6ead05a

…failing) rotary

NOTE: fails for any num_batch and num_head that is not 1.

7ed3dc1

fix: for kv used q_head offset not kv_head offset

bdf5c42

fix: gqa supported. prev kv head computed incorrectly

9394dc8

save

1bf5ff1

alexkranias-amd changed the title ~~Added Rotary Embedding (non-fused kernel)~~ Added Rotary Embedding (both non-fused and fused kernel) Nov 14, 2024

alexkranias-amd changed the title ~~Added Rotary Embedding (both non-fused and fused kernel)~~ Added Support for Rotary Positional Embeddings (both non-fused and fused kernel) Nov 14, 2024

micmelesse reviewed Nov 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Support for Rotary Positional Embeddings (both non-fused and fused kernel) #99

Added Support for Rotary Positional Embeddings (both non-fused and fused kernel) #99

alexkranias-amd commented Nov 13, 2024 •

edited

Loading

micmelesse Nov 15, 2024 •

edited

Loading

micmelesse Nov 15, 2024

alexkranias-amd Nov 15, 2024

micmelesse Nov 15, 2024

alexkranias-amd Nov 15, 2024 •

edited

Loading

micmelesse Nov 15, 2024

micmelesse Nov 15, 2024

micmelesse Nov 15, 2024


		ro_dim_half = rotary_dim // 2 # length of cos/sin

		if SEQLEN_OFFSET_IS_TENSOR:

Added Support for Rotary Positional Embeddings (both non-fused and fused kernel) #99

Are you sure you want to change the base?

Added Support for Rotary Positional Embeddings (both non-fused and fused kernel) #99

Conversation

alexkranias-amd commented Nov 13, 2024 • edited Loading

Motivation

A more detailed explanation

micmelesse Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

micmelesse Nov 15, 2024

Choose a reason for hiding this comment

alexkranias-amd Nov 15, 2024

Choose a reason for hiding this comment

micmelesse Nov 15, 2024

Choose a reason for hiding this comment

alexkranias-amd Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

micmelesse Nov 15, 2024

Choose a reason for hiding this comment

micmelesse Nov 15, 2024

Choose a reason for hiding this comment

micmelesse Nov 15, 2024

Choose a reason for hiding this comment

alexkranias-amd commented Nov 13, 2024 •

edited

Loading

micmelesse Nov 15, 2024 •

edited

Loading

alexkranias-amd Nov 15, 2024 •

edited

Loading