Non causal sliding window mask ? #73

Optimox · 2024-11-02T14:18:08Z

Hi,

I am trying to port gemma 2 to torchtune library.

When following the code for sliding window mask generation I first used a binary causal mask but it seems that this creates a non causal sliding window mask with 1s for previous tokens, 0s for future sliding window tokens and -inf for the other tokens.

I have shared a minimal reproducible code to demonstrate my problem here.

Could someone clarify what format is expected for the causal or block causal masks in your implementation ?

I would also be curious to know why your are using -2.3819763e38 instead of -torch.inf ?

Thank you for helping me clarifying the situation!

The text was updated successfully, but these errors were encountered:

Gopi-Uppari · 2024-11-04T08:36:07Z

Hi @Optimox,

In the Gemma model's implementation, causal masks are utilized to ensure that each token in a sequence can only attend to itself and preceding tokens.
Expected Format for Causal Mask should Elements on or below the diagonal are set to 0 (indicating allowed attention), while elements above the diagonal
are set to a large negative value -2.3819763e38 to effectively mask future tokens by assigning them negligible attention scores.

Use of -2.3819763e38 Instead of -torch.inf because numerical stability (Using extremely large negative values can prevent potential issues with floating-point precision during computations)
and compatibility reasons (Some hardware accelerators or libraries may not handle -inf gracefully, leading to undefined behavior or errors.)

For more reference, could you please refer to this reference

Thank you.

Gopi-Uppari · 2024-11-05T04:02:59Z

Hi @Optimox,

Could you please confirm if this issue is resolved for you with the above comment ? Please feel free to close the issue if it is resolved ?

Thank you.

Optimox · 2024-11-05T08:44:29Z

Yes thank you for your help!

Optimox mentioned this issue Nov 4, 2024

feat: add gemma2b variants pytorch/torchtune#1835

Merged

13 tasks

Optimox closed this as completed Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non causal sliding window mask ? #73

Non causal sliding window mask ? #73

Optimox commented Nov 2, 2024

Gopi-Uppari commented Nov 4, 2024

Gopi-Uppari commented Nov 5, 2024

Optimox commented Nov 5, 2024

Non causal sliding window mask ? #73

Non causal sliding window mask ? #73

Comments

Optimox commented Nov 2, 2024

Gopi-Uppari commented Nov 4, 2024

Gopi-Uppari commented Nov 5, 2024

Optimox commented Nov 5, 2024