-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support attention masking to prevent attention across EOT tokens #206
Comments
nice find! |
This seems to only help 0.1% compared to not using it with the current implementation #213. Not sure if we can get more benefits. |
@GeorgiosSmyrnis interesting! do u know what the impact on speed is? say at 1B with 2-4 nodes? |
Not 100% sure for 1B, but in the smaller scales it was running at 1/3 of the speed (since the current implementation relies on different attention per document, so you cannot really use xformers directly as far as I understand). There should be a way to do this better as far as performance goes, but given the marginal benefits in downstream performance I'm not sure if it's worth it. |
Closing this as wontfix for now given we're not seeing improvements on test accuracy, and the current implementation slows down training. We can re-open if someone wants it or if we think there is a strong need for it in certain settings. |
By default, openlm performs causal attention across all tokens in a sequence, even if this sequence contains multiple documents separated by EOT. This might be related to #194. I think we can start with supporting it just for xformers using BlockDiagonalCausalMask https://facebookresearch.github.io/xformers/components/ops.html#xformers.ops.fmha.attn_bias.BlockDiagonalCausalMask
cc @sagadre
The text was updated successfully, but these errors were encountered: