[C] Normalization Refactor + Adding CUDNN backend #1315

phu0ngng · 2024-11-05T23:16:36Z

Description

TODO:

Adapt normalization in JAX and Paddle.
Benchmark performance of new APIs.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Phuong Nguyen <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Phuong Nguyen <[email protected]>

timmoon10 · 2024-11-07T18:48:37Z

transformer_engine/pytorch/csrc/extensions/normalization.cu

-                                            dgamma_part.dtype());
-  dbeta_part = makeTransformerEngineTensor(dbeta_part_data.data_ptr(), dbeta_part.shape(),
-                                           dbeta_part.dtype());
+  if (!std::getenv("NVTE_BWD_LAYERNORM_USE_CUDNN")) {


We would have unexpected behavior if the user sets NVTE_BWD_LAYERNORM_USE_CUDNN=0:

Suggested change

if (!std::getenv("NVTE_BWD_LAYERNORM_USE_CUDNN")) {

if (!std::getenv("NVTE_BWD_LAYERNORM_USE_CUDNN")

|| !std::atoi(std::getenv("NVTE_BWD_LAYERNORM_USE_CUDNN"))) {

Alternatively, we could use TE's getenv function:

TransformerEngine/transformer_engine/common/util/system.h

Line 22 in e5ffaa7

T getenv(const char *variable);

However, this is delicate since it can run into issues if the core lib and framework lib are compiled with different C++ ABIs. A solution might be to make getenv a header-only impl.

Hi, I had not yet introduced changes in the framework part yet.
With the new change in PyTorch side in the latest commit, we no longer need this env check.

transformer_engine/common/include/transformer_engine/layernorm.h

Signed-off-by: Phuong Nguyen <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Phuong Nguyen <[email protected]>

timmoon10 · 2024-11-13T23:54:41Z

transformer_engine/common/normalization/common.h

+enum class NVTE_Norm_Backend { Te, Cudnn };
+enum class NVTE_Norm_Type { LayerNorm, RMSNorm };
+enum class NVTE_Norm_Stage { Forward, Backward };


Nit: We use the NVTE_ prefix for C-style enums to avoid ambiguity, but it's not necessary for these C++-style enums since they're defined within the transformer_engine::normalization namespace:

Suggested change

enum class NVTE_Norm_Backend { Te, Cudnn };

enum class NVTE_Norm_Type { LayerNorm, RMSNorm };

enum class NVTE_Norm_Stage { Forward, Backward };

enum class NormBackend { Te, Cudnn };

enum class NormType { LayerNorm, RMSNorm };

enum class NormStage { Forward, Backward };

Not relevant yet, but in the future we might want to have separate stages for ForwardTrain and ForwardInfer if cuDNN exposes an optimized inference impl.

timmoon10 · 2024-11-13T23:58:30Z

transformer_engine/common/normalization/layernorm/ln_api.cpp

+
+  NVTE_Norm_Backend norm_backend;
+  bool is_aligned = true;
+  if (std::getenv("NVTE_FWD_LAYERNORM_USE_CUDNN")) {


We should use TE's getenv function so that users can specify NVTE_FWD_LAYERNORM_USE_CUDNN=0:

Suggested change

if (std::getenv("NVTE_FWD_LAYERNORM_USE_CUDNN")) {

if (getenv<bool>("NVTE_FWD_LAYERNORM_USE_CUDNN")) {

See:

TransformerEngine/transformer_engine/common/util/system.h

Line 22 in c0a539c

T getenv(const char *variable);

Similar changes should be made in the backward function, as well as RMSNorm. We might also want to consider caching the value to avoid CPU overheads:

static const bool use_cudnn_backend = getenv<bool>("NVTE_FWD_LAYERNORM_USE_CUDNN"); if (use_cudnn_backend) {

Related: #1315 (comment)

timmoon10 · 2024-11-14T00:11:42Z

transformer_engine/common/normalization/layernorm/ln_api.cpp

+  if (workspace->data.shape.empty()) {
+    CheckInputTensor(x, "x");
+    CheckInputTensor(gamma, "gamma");
+    CheckInputTensor(beta, "beta");
+
+    CheckOutputTensor(*z, "z");
+    CheckOutputTensor(*mu, "mu");
+    CheckOutputTensor(*rsigma, "rsigma");
+  }


It's not immediately obvious why we only check tensor dimensions when the workspace is empty. If it's cheap, it'll be easier to check tensors every time. Otherwise, it would be helpful to add a comment that we assume that this function is called twice (to query workspace and to launch kernel) and that we only check tensors the first time.

Similar changes should be made in the backward function, as well as RMSNorm.

timmoon10 · 2024-11-14T00:17:43Z

transformer_engine/common/include/transformer_engine/normalization.h

- * where
- * @f[
- * RMS_\varepsilon(x) = \sqrt{\frac{1}{n}\sum_{i=0}^{n-1} x_i^2 + \varepsilon}
+ * y = \frac{x - E[x]}{\sqrt{Var[x] + \varepsilon}}\gamma) + \beta


Mismatched parenthesis:

Suggested change

* y = \frac{x - E[x]}{\sqrt{Var[x] + \varepsilon}}\gamma) + \beta

* y = \frac{x - E[x]}{\sqrt{Var[x] + \varepsilon}} \gamma + \beta

We could also make the LaTeX look nicer with:

* y = \frac{x - \mathbb{E}[x]}{\sqrt{\text{Var}[x] + \varepsilon}} \gamma + \beta

Any changes here should also be made in the backward function and in RMSNorm.

timmoon10 · 2024-11-14T00:21:20Z

transformer_engine/common/include/transformer_engine/normalization.h

 *  \param[in]     multiprocessorCount Number of SMs in the device.
+ *  \param[in]     zero_centered_gamma If zero_centered_gamma is enabled


We should provide more details since we don't include the formula in the main documentation.

Suggested change

* \param[in] zero_centered_gamma If zero_centered_gamma is enabled

* \param[in] zero_centered_gamma Multiply normalized values by @f$ \gamma+1 @f$ instead of @f$ \gamma @f$

Similar changes should be made to the backward function as well as RMSNorm.

timmoon10 · 2024-11-14T01:27:10Z

tests/cpp/operator/test_normalization.cu

+  double atol_bwd = 1e-3;
+  double rtol_bwd = 1e-3;


Do we need to relax tolerances?

Suggested change

double atol_bwd = 1e-3;

double rtol_bwd = 1e-3;

double atol_bwd = 1e-4;

double rtol_bwd = 1e-4;

Ideally we would use tight tolerances like:

TransformerEngine/tests/pytorch/utils.py

Lines 73 to 84 in c0a539c

if dtype == torch.float16:

return dict(rtol=1e-3, atol=1e-5)

if dtype == torch.bfloat16:

return dict(rtol=1.6e-2, atol=1e-5)

if dtype == torch.float32:

return dict(rtol=1.3e-6, atol=1e-5)

if dtype == torch.float64:

return dict(rtol=1e-7, atol=1e-7)

if dtype == torch.float8_e4m3fn:

return dict(rtol=0.125, atol=0.0675) # epsilon = 0.0625

if dtype == torch.float8_e5m2:

return dict(rtol=0.25, atol=0.125) # epsilon = 0.152

timmoon10 · 2024-11-14T02:34:15Z

transformer_engine/common/normalization/common.h

+  LaunchParams<KernelParamsType> _launch_params;
+  std::function<void(LaunchParams<KernelParamsType>&, const bool)> _kernel;
+
+  bool _is_layernorm;


Nit: Caching this value makes the code slightly more concise, but I think it's clearer and more general to just store the original enum:

Suggested change

bool _is_layernorm;

NVTE_Norm_Type _norm_type;

timmoon10 · 2024-11-14T02:39:08Z

transformer_engine/common/normalization/common.h

+  virtual void execute(Tensor* z, void* x_dptr, void* gamma_dptr, void* beta_dptr, void* mean_dptr,
+                       void* eps_dptr, void* rsigma_dptr, void* workspace_dptr,
+                       cudaStream_t stream) = 0;
+
+  virtual void execute(void* x_dptr, void* gamma_dptr, void* mean_dptr, void* rsigma_dptr,
+                       void* dx_dptr, void* dz_dptr, void* dbeta_dptr, void* dgamma_dptr,
+                       void* workspace_dptr, cudaStream_t stream) = 0;


It's a bit awkward that NormalizationPlanBase includes logic for both the forward and backward kernels. I feel the "right" solution is to implement separate abstract base classes for forward and backward. That said, this approach does allow some code reuse with the cuDNN backend though, so I think it's fine for now. We can revisit in the future if needed, e.g. if we add a separate forward inference stage.

timmoon10 · 2024-11-14T02:57:45Z

transformer_engine/common/normalization/layernorm/ln_api.cpp

+
+  NVTE_Norm_Backend norm_backend;
+  bool is_aligned = true;
+  if (std::getenv("NVTE_FWD_LAYERNORM_USE_CUDNN")) {


One problem with these envvars is it's all-or-nothing. This is especially troublesome for testing, where we may want to test TE and cuDNN norm kernels within the same process. Perhaps we could have some function like nvte_enable_cudnn_layernorm that sets a global variable, and the initial value is based on an envvar.

timmoon10 · 2024-11-14T03:00:25Z

tests/cpp/operator/test_normalization.cu

 INSTANTIATE_TEST_SUITE_P(
    OperatorTest,
-    LNTestSuite,
+    NormTestSuite,
    ::testing::Combine(
+          ::testing::Values(NormType::LayerNorm, NormType::RMSNorm),
        ::testing::Values(DType::kFloat32, DType::kBFloat16, DType::kFloat16),
        ::testing::Values(DType::kFloat32, DType::kBFloat16, DType::kFloat16, DType::kFloat8E4M3),
        ::testing::ValuesIn(test_cases),
        ::testing::Values(false, true)),


It would be helpful to test both cuDNN and TE kernels. We can reduce the number of test shapes to keep the total number of tests manageable.

Signed-off-by: Phuong Nguyen <[email protected]>

phu0ngng added 30 commits November 5, 2024 15:01

norms without zeros

34c4995

Signed-off-by: Phuong Nguyen <[email protected]>

added zero centering

8f007af

Signed-off-by: Phuong Nguyen <[email protected]>

added zero centering

32bb2cc

Signed-off-by: Phuong Nguyen <[email protected]>

added sm carveout

ca39664

Signed-off-by: Phuong Nguyen <[email protected]>

fixes for ln

f88c03e

Signed-off-by: Phuong Nguyen <[email protected]>

renamed node to tensor

215d242

Signed-off-by: Phuong Nguyen <[email protected]>

shorten RMS test cases

0b897ce

Signed-off-by: Phuong Nguyen <[email protected]>

temporary rm cudnnDestroy

3370829

Signed-off-by: Phuong Nguyen <[email protected]>

cleanup

f056224

Signed-off-by: Phuong Nguyen <[email protected]>

fixed _handle creation and variant_pack

5180958

Signed-off-by: Phuong Nguyen <[email protected]>

narrow down test cases

3a54fb6

Signed-off-by: Phuong Nguyen <[email protected]>

excluded scale_inv check for now

3e265ec

Signed-off-by: Phuong Nguyen <[email protected]>

narrow down to 1 test case for RMS

d7ada96

Signed-off-by: Phuong Nguyen <[email protected]>

set wtype for gamma_tensor

4843d9e

Signed-off-by: Phuong Nguyen <[email protected]>

add scalar_offset to fwd_class

57bbef8

Signed-off-by: Phuong Nguyen <[email protected]>

wtype for one_tensor

e61d9ba

Signed-off-by: Phuong Nguyen <[email protected]>

added type convert for scalar_offset

caeb6c3

Signed-off-by: Phuong Nguyen <[email protected]>

separated workspace for fwd and bwd

e99ab6d

Signed-off-by: Phuong Nguyen <[email protected]>

relax the tol for bwd test

103fd0e

Signed-off-by: Phuong Nguyen <[email protected]>

relaxed tols

81714e3

Signed-off-by: Phuong Nguyen <[email protected]>

cleanup

13fd5f8

Signed-off-by: Phuong Nguyen <[email protected]>

eps pass by value

976d4ee

Signed-off-by: Phuong Nguyen <[email protected]>

reset the tols

c395071

Signed-off-by: Phuong Nguyen <[email protected]>

added add size test back

359b257

Signed-off-by: Phuong Nguyen <[email protected]>

formatted

645ee36

Signed-off-by: Phuong Nguyen <[email protected]>

added scale_inv cal for cudnn norms

dffaec8

Signed-off-by: Phuong Nguyen <[email protected]>

cache init

a996acc

Signed-off-by: Phuong Nguyen <[email protected]>

cleanup

6391019

Signed-off-by: Phuong Nguyen <[email protected]>

ln fwd works

01c6d32

Signed-off-by: Phuong Nguyen <[email protected]>

zero centering hashed in key

4cf59b0

Signed-off-by: Phuong Nguyen <[email protected]>

phu0ngng added 12 commits November 5, 2024 15:02

moves eps tensor creation into fwd

60f62b7

Signed-off-by: Phuong Nguyen <[email protected]>

added all test cases back

1cf1ec4

Signed-off-by: Phuong Nguyen <[email protected]>

fixed scale_inv

3635431

Signed-off-by: Phuong Nguyen <[email protected]>

added run tests script

fb716d1

Signed-off-by: Phuong Nguyen <[email protected]>

added ctype to all compute node

5897eb7

Signed-off-by: Phuong Nguyen <[email protected]>

added stream

c3e0c2e

Signed-off-by: Phuong Nguyen <[email protected]>

fix scale_inv

b014fef

Signed-off-by: Phuong Nguyen <[email protected]>

formatted

2ca9313

Signed-off-by: Phuong Nguyen <[email protected]>

restructure norms

7425546

Signed-off-by: Phuong Nguyen <[email protected]>

used common kernel_traits.h

cf70734

Signed-off-by: Phuong Nguyen <[email protected]>

refactor te norms

670445d

Signed-off-by: Phuong Nguyen <[email protected]>

fixed headers

321f17c

Signed-off-by: Phuong Nguyen <[email protected]>

phu0ngng requested a review from timmoon10 November 5, 2024 23:16

pre-commit-ci bot and others added 2 commits November 5, 2024 23:16

[pre-commit.ci] auto fixes from pre-commit.com hooks

0e344e2

for more information, see https://pre-commit.ci

rm test script

f854e37

Signed-off-by: Phuong Nguyen <[email protected]>

timmoon10 reviewed Nov 8, 2024

View reviewed changes

timmoon10 self-requested a review November 8, 2024 00:47

phu0ngng and others added 8 commits November 8, 2024 17:32

fixed rms

15ad350

Signed-off-by: Phuong Nguyen <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

18e9c87

for more information, see https://pre-commit.ci

rm 1+ from latex description

c355540

Signed-off-by: Phuong Nguyen <[email protected]>

adapted pytorch part

3f379ef

Signed-off-by: Phuong Nguyen <[email protected]>

update header include

864ecf9

Signed-off-by: Phuong Nguyen <[email protected]>

wip jax + norm

f9c085e

Signed-off-by: Phuong Nguyen <[email protected]>

fixed header include

86d06ee

Signed-off-by: Phuong Nguyen <[email protected]>

check if workspace is empty instead of nullptr

a1f0a05

Signed-off-by: Phuong Nguyen <[email protected]>

phu0ngng requested a review from ptrendx November 13, 2024 19:18

timmoon10 reviewed Nov 14, 2024

View reviewed changes

timmoon10 self-requested a review November 14, 2024 03:04

phu0ngng added 2 commits November 14, 2024 11:28

fix for compilation

8ff73bc

Signed-off-by: Phuong Nguyen <[email protected]>

do compute_t(gamma + input_t(1.0f)) in the ref

be4b767

Signed-off-by: Phuong Nguyen <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C] Normalization Refactor + Adding CUDNN backend #1315

[C] Normalization Refactor + Adding CUDNN backend #1315

phu0ngng commented Nov 5, 2024 •

edited

Loading

timmoon10 Nov 7, 2024

phu0ngng Nov 12, 2024

timmoon10 Nov 13, 2024

timmoon10 Nov 13, 2024

timmoon10 Nov 14, 2024

timmoon10 Nov 14, 2024

timmoon10 Nov 14, 2024

timmoon10 Nov 14, 2024

timmoon10 Nov 14, 2024

timmoon10 Nov 14, 2024

timmoon10 Nov 14, 2024

timmoon10 Nov 14, 2024

	if (!std::getenv("NVTE_BWD_LAYERNORM_USE_CUDNN")) {
	if (!std::getenv("NVTE_BWD_LAYERNORM_USE_CUDNN")
	\|\| !std::atoi(std::getenv("NVTE_BWD_LAYERNORM_USE_CUDNN"))) {

	if (std::getenv("NVTE_FWD_LAYERNORM_USE_CUDNN")) {
	if (getenv<bool>("NVTE_FWD_LAYERNORM_USE_CUDNN")) {

	* y = \frac{x - E[x]}{\sqrt{Var[x] + \varepsilon}}\gamma) + \beta
	* y = \frac{x - E[x]}{\sqrt{Var[x] + \varepsilon}} \gamma + \beta

		* \param[in] multiprocessorCount Number of SMs in the device.
		* \param[in] zero_centered_gamma If zero_centered_gamma is enabled

	* \param[in] zero_centered_gamma If zero_centered_gamma is enabled
	* \param[in] zero_centered_gamma Multiply normalized values by @f$ \gamma+1 @f$ instead of @f$ \gamma @f$

	if dtype == torch.float16:
	return dict(rtol=1e-3, atol=1e-5)
	if dtype == torch.bfloat16:
	return dict(rtol=1.6e-2, atol=1e-5)
	if dtype == torch.float32:
	return dict(rtol=1.3e-6, atol=1e-5)
	if dtype == torch.float64:
	return dict(rtol=1e-7, atol=1e-7)
	if dtype == torch.float8_e4m3fn:
	return dict(rtol=0.125, atol=0.0675) # epsilon = 0.0625
	if dtype == torch.float8_e5m2:
	return dict(rtol=0.25, atol=0.125) # epsilon = 0.152

[C] Normalization Refactor + Adding CUDNN backend #1315

Are you sure you want to change the base?

[C] Normalization Refactor + Adding CUDNN backend #1315

Conversation

phu0ngng commented Nov 5, 2024 • edited Loading

Description

Type of change

Changes

Checklist:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phu0ngng commented Nov 5, 2024 •

edited

Loading