-
Notifications
You must be signed in to change notification settings - Fork 516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non causal sliding window mask ? #73
Comments
Hi @Optimox, In the Gemma model's implementation, causal masks are utilized to ensure that each token in a sequence can only attend to itself and preceding tokens. Use of -2.3819763e38 Instead of -torch.inf because numerical stability (Using extremely large negative values can prevent potential issues with floating-point precision during computations) For more reference, could you please refer to this reference Thank you. |
Hi @Optimox, Could you please confirm if this issue is resolved for you with the above comment ? Please feel free to close the issue if it is resolved ? Thank you. |
Yes thank you for your help! |
Hi,
I am trying to port gemma 2 to torchtune library.
When following the code for sliding window mask generation I first used a binary causal mask but it seems that this creates a non causal sliding window mask with 1s for previous tokens, 0s for future sliding window tokens and -inf for the other tokens.
I have shared a minimal reproducible code to demonstrate my problem here.
Could someone clarify what format is expected for the causal or block causal masks in your implementation ?
I would also be curious to know why your are using
-2.3819763e38
instead of-torch.inf
?Thank you for helping me clarifying the situation!
The text was updated successfully, but these errors were encountered: