Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support attention masking to prevent attention across EOT tokens #206

Closed
achalddave opened this issue Jan 25, 2024 · 5 comments
Closed

Support attention masking to prevent attention across EOT tokens #206

achalddave opened this issue Jan 25, 2024 · 5 comments
Assignees

Comments

@achalddave
Copy link
Collaborator

achalddave commented Jan 25, 2024

By default, openlm performs causal attention across all tokens in a sequence, even if this sequence contains multiple documents separated by EOT. This might be related to #194. I think we can start with supporting it just for xformers using BlockDiagonalCausalMask https://facebookresearch.github.io/xformers/components/ops.html#xformers.ops.fmha.attn_bias.BlockDiagonalCausalMask
cc @sagadre

@sagadre
Copy link
Collaborator

sagadre commented Jan 25, 2024

nice find!

@GeorgiosSmyrnis
Copy link
Collaborator

This seems to only help 0.1% compared to not using it with the current implementation #213. Not sure if we can get more benefits.

@sagadre
Copy link
Collaborator

sagadre commented Feb 29, 2024

@GeorgiosSmyrnis interesting! do u know what the impact on speed is? say at 1B with 2-4 nodes?

@GeorgiosSmyrnis
Copy link
Collaborator

Not 100% sure for 1B, but in the smaller scales it was running at 1/3 of the speed (since the current implementation relies on different attention per document, so you cannot really use xformers directly as far as I understand).

There should be a way to do this better as far as performance goes, but given the marginal benefits in downstream performance I'm not sure if it's worth it.

@achalddave
Copy link
Collaborator Author

Closing this as wontfix for now given we're not seeing improvements on test accuracy, and the current implementation slows down training. We can re-open if someone wants it or if we think there is a strong need for it in certain settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants