attention_mask fill with -inf for UnfusedDotProductAttention #1268

Agoniii · 2024-10-18T03:36:00Z

Description

UnfusedDotProductAttention in TE uses -10000 to fill in the attention mask, but the value is not small enough for some cases which leads to large diff between TE and HF.

The logits of model forward with FP32.
HF baseline:

logits: tensor([[[-0.6598, -4.5184, -4.0881,  ..., -5.0321, -5.0322, -5.0315],
         [ 2.3831,  2.2049, -0.3057,  ..., -6.5912, -6.5907, -6.5909],
         [-2.3334, -2.7213, -4.7815,  ..., -9.8526, -9.8525, -9.8523],
         ...,
         [ 3.3538,  1.4944,  1.0958,  ..., -6.4574, -6.4574, -6.4577],
         [ 5.6523,  0.4127,  1.0138,  ..., -8.7066, -8.7066, -8.7068],
         [ 0.3297, -0.7106, -0.8580,  ..., -7.7929, -7.7928, -7.7930]]],
       device='cuda:0', grad_fn=<UnsafeViewBackward0>)

TE before:

logits: tensor([[[-0.6598, -4.5184, -4.0881,  ..., -5.0321, -5.0322, -5.0316],
         [ 2.1408,  2.3935, -0.1826,  ..., -6.7005, -6.7000, -6.7002],
         [-2.5307, -3.0045, -4.9231,  ..., -9.4020, -9.4019, -9.4017],
         ...,
         [ 3.3538,  1.4944,  1.0958,  ..., -6.4574, -6.4574, -6.4578],
         [ 5.6524,  0.4127,  1.0138,  ..., -8.7066, -8.7066, -8.7068],
         [ 0.3297, -0.7106, -0.8580,  ..., -7.7929, -7.7928, -7.7930]]],
       device='cuda:0', grad_fn=<TransposeBackward0>)

TE after:

logits: tensor([[[-0.6598, -4.5184, -4.0881,  ..., -5.0321, -5.0322, -5.0316],
         [ 2.3831,  2.2049, -0.3057,  ..., -6.5912, -6.5907, -6.5909],
         [-2.3334, -2.7213, -4.7815,  ..., -9.8526, -9.8525, -9.8523],
         ...,
         [ 3.3538,  1.4944,  1.0958,  ..., -6.4574, -6.4574, -6.4578],
         [ 5.6524,  0.4127,  1.0138,  ..., -8.7066, -8.7066, -8.7068],
         [ 0.3297, -0.7106, -0.8580,  ..., -7.7929, -7.7928, -7.7930]]],
       device='cuda:0', grad_fn=<TransposeBackward0>)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

xrennvidia · 2024-10-18T04:48:06Z

/te-ci pytorch

Signed-off-by: Agoniii <[email protected]>

xrennvidia · 2024-10-18T05:03:38Z

/te-ci pytorch

xrennvidia requested review from cyanguwa and xrennvidia October 18, 2024 04:46

attention_mask fill with -inf for UnfusedDotProductAttention

8d27109

Signed-off-by: Agoniii <[email protected]>

Agoniii force-pushed the xueh/fix branch from 78ffc1c to 8d27109 Compare October 18, 2024 04:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

attention_mask fill with -inf for UnfusedDotProductAttention #1268

attention_mask fill with -inf for UnfusedDotProductAttention #1268

Agoniii commented Oct 18, 2024 •

edited

Loading

xrennvidia commented Oct 18, 2024

xrennvidia commented Oct 18, 2024

attention_mask fill with -inf for UnfusedDotProductAttention #1268

Are you sure you want to change the base?

attention_mask fill with -inf for UnfusedDotProductAttention #1268

Conversation

Agoniii commented Oct 18, 2024 • edited Loading

Description

Type of change

Changes

Checklist:

xrennvidia commented Oct 18, 2024

xrennvidia commented Oct 18, 2024

Agoniii commented Oct 18, 2024 •

edited

Loading