[JAX] Collective GEMM custom op with `nvte_cublas_gemm` (no comm. overlap) #1307

denera · 2024-11-02T02:30:10Z

Description

Implements both old-style and new FFI-based XLA custom calls in C++, and the corresponding JAX primitive including custom partitioning rules.

Custom partitioning rules for a LHS:([B,] M, K) x RHS:([B,] K, N) = OUT:([B,] M, N) batched mat-mul operation where [B] is the batch dimension:

Preserve the partitioning of the [B] dimension for all operands.
Always all-gather LHS along the M dimension.
Error out if RHS is partitioned in both K and N dimensions.
Force the K dimension of LHS to match the partitioning of the K dimension of RHS.
If K dimension is partitioned but M dimension is not, jax.lax.psum (all-reduce) the output over the TP mesh resource.
If both the M and K dimensions are partitioned, jax.lax.psum_scatter (reduce-scatter) the output over the TP mesh resource.

In practice, the RHS matrix (typically the weight tensor) should be allocated with transposed contracting dimensions ([B,] N, K) for optimal GEMM heuristics in cuBlasLt. This layout is also mandatory for FP8 inputs.

This PR does NOT update fused ops or Flax/Praxis modules to use the new GEMM custom op over the existing XLA pattern matching approach.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Added XLA custom calls for nvte_cublas_gemm.
Added JAX primitive for the new XLA custom call.
Added new serial unit test.
Add distributed unit test.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

nouiz · 2024-11-04T16:44:13Z

Why? Normal JAX behavior is to do some gathering.

huanghua1994 · 2024-11-04T22:58:56Z

It seems that currently the batch size is not handled in the C++ code. Since JAX is using row-major storage for tensor by default, probably the batch dimension should be combined with the m dimension for LHS or the n dimension for RHS?

Signed-off-by: Alp Dener <[email protected]> Added XLA FFI custom op for TE GEMM Signed-off-by: Alp Dener <[email protected]> finished GEMM custom op primitive and serial unit test Signed-off-by: Alp Dener <[email protected]> fixed GEMM custom op batcher Signed-off-by: Alp Dener <[email protected]> fixed output dtype error and contracting dimensions options Signed-off-by: Alp Dener <[email protected]> AG overlap working but executes scatter to match outer LHS dim Signed-off-by: Alp Dener <[email protected]> both all-gather and all-reduce are now working Signed-off-by: Alp Dener <[email protected]> code style Signed-off-by: Alp Dener <[email protected]> changed kwargs in abstract to be explicit Signed-off-by: Alp Dener <[email protected]> added fwd/bwd implementation for non-fp8 gemm Signed-off-by: Alp Dener <[email protected]>

Signed-off-by: Alp Dener <[email protected]>

for more information, see https://pre-commit.ci

abhinavgoel95

@denera I have some questions about the PR.

abhinavgoel95 · 2024-11-15T00:05:08Z

transformer_engine/jax/cpp_extensions/gemm.py

+
+        # Validate operand layouts
+        lhs_inner_dim, rhs_inner_dim = map(
+            lambda inner_dim, ndims: (ndims - inner_dim) if inner_dim < 0 else inner_dim,


@denera should be ndims + inner_dim when inner_dim is negative, right?

abhinavgoel95 · 2024-11-15T00:29:34Z

transformer_engine/jax/cpp_extensions/gemm.py

+        rhs_trans = contracting_dims[1] == rhs.ndim - 1
+        lhs = jnp.matrix_transpose(lhs) if lhs_trans and jax_dtype_is_fp8(lhs.dtype) else lhs
+        rhs = jnp.matrix_transpose(rhs) if not rhs_trans and jax_dtype_is_fp8(rhs.dtype) else rhs
+        contracting_dims = (1, 1)


@denera is there a need to hard-code this?

cuBlasLt GEMM requires non-transposed LHS and transposed RHS for FP8 GEMM, but the batcher is not the right place to check/force that. Also, leaving contracting_dims=(1, 1) out of the conditional for FP8 type is a mistake. Thanks for catching it!

abhinavgoel95 · 2024-11-15T00:44:21Z

transformer_engine/jax/cpp_extensions/gemm.py

+            grad=grad,
+            accumulate=accumulate,
+            use_split_accumulator=use_split_accumulator,
+        )(lhs_bdims, out_amax_bdims, out_scale_bdims, gelu_input_bdims, bias_bdims)


This gives me an error.
Line: https://github.com/NVIDIA/TransformerEngine/pull/1307/files#diff-f5b74ca3c5a70acb3d764e9b8adea40b8bab554fe4d2362f3052b7b932c0464dR187-R194 returns a tuple.

TypeError: 'list' object is not callable

cc @denera

… passing test Signed-off-by: Alp Dener <[email protected]>

for more information, see https://pre-commit.ci

denera added the jax label Nov 2, 2024

denera requested review from nouiz and phu0ngng November 2, 2024 02:30

denera self-assigned this Nov 2, 2024

denera changed the title ~~[JAX] Collective GEMM custom op with nvte_cublas_gemm~~ [JAX] Collective GEMM custom op with nvte_cublas_gemm (no comm. overlap) Nov 2, 2024

denera force-pushed the jax-collective-gemm branch from bb2be56 to fea0728 Compare November 6, 2024 02:13

denera force-pushed the jax-collective-gemm branch from 3ec3eca to 941f5bb Compare November 14, 2024 09:30

fixed batching rules to accommodated batched RHS operand for GEMM

f440094

Signed-off-by: Alp Dener <[email protected]>

denera force-pushed the jax-collective-gemm branch from 6444211 to f440094 Compare November 14, 2024 18:14

[pre-commit.ci] auto fixes from pre-commit.com hooks

52af237

for more information, see https://pre-commit.ci

abhinavgoel95 reviewed Nov 15, 2024

View reviewed changes

denera mentioned this pull request Nov 15, 2024

[C/JAX] Comm+GEMM Overlap API for TE/JAX #1337

Draft

13 tasks

phu0ngng requested a review from huanghua1994 November 15, 2024 16:34

re-applied bug fixes to working older version, updated backward pass,…

b989641

… passing test Signed-off-by: Alp Dener <[email protected]>

denera force-pushed the jax-collective-gemm branch from 30b7b06 to b989641 Compare November 15, 2024 23:56

[pre-commit.ci] auto fixes from pre-commit.com hooks

a3d86f2

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JAX] Collective GEMM custom op with `nvte_cublas_gemm` (no comm. overlap) #1307

[JAX] Collective GEMM custom op with `nvte_cublas_gemm` (no comm. overlap) #1307

denera commented Nov 2, 2024 •

edited

Loading

nouiz commented Nov 4, 2024

huanghua1994 commented Nov 4, 2024

abhinavgoel95 left a comment

abhinavgoel95 Nov 15, 2024

abhinavgoel95 Nov 15, 2024

denera Nov 15, 2024 •

edited

Loading

abhinavgoel95 Nov 15, 2024

[JAX] Collective GEMM custom op with nvte_cublas_gemm (no comm. overlap) #1307

Are you sure you want to change the base?

[JAX] Collective GEMM custom op with nvte_cublas_gemm (no comm. overlap) #1307

Conversation

denera commented Nov 2, 2024 • edited Loading

Description

Type of change

Changes

Checklist:

nouiz commented Nov 4, 2024

huanghua1994 commented Nov 4, 2024

abhinavgoel95 left a comment

Choose a reason for hiding this comment

abhinavgoel95 Nov 15, 2024

Choose a reason for hiding this comment

abhinavgoel95 Nov 15, 2024

Choose a reason for hiding this comment

denera Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

abhinavgoel95 Nov 15, 2024

Choose a reason for hiding this comment

[JAX] Collective GEMM custom op with `nvte_cublas_gemm` (no comm. overlap) #1307

[JAX] Collective GEMM custom op with `nvte_cublas_gemm` (no comm. overlap) #1307

denera commented Nov 2, 2024 •

edited

Loading

denera Nov 15, 2024 •

edited

Loading