Support attention masking to prevent attention across EOT tokens #206

achalddave · 2024-01-25T18:00:42Z

By default, openlm performs causal attention across all tokens in a sequence, even if this sequence contains multiple documents separated by EOT. This might be related to #194. I think we can start with supporting it just for xformers using BlockDiagonalCausalMask https://facebookresearch.github.io/xformers/components/ops.html#xformers.ops.fmha.attn_bias.BlockDiagonalCausalMask
cc @sagadre

sagadre · 2024-01-25T18:39:42Z

nice find!

GeorgiosSmyrnis · 2024-02-28T16:26:10Z

This seems to only help 0.1% compared to not using it with the current implementation #213. Not sure if we can get more benefits.

sagadre · 2024-02-29T00:54:18Z

@GeorgiosSmyrnis interesting! do u know what the impact on speed is? say at 1B with 2-4 nodes?

GeorgiosSmyrnis · 2024-02-29T00:59:20Z

Not 100% sure for 1B, but in the smaller scales it was running at 1/3 of the speed (since the current implementation relies on different attention per document, so you cannot really use xformers directly as far as I understand).

There should be a way to do this better as far as performance goes, but given the marginal benefits in downstream performance I'm not sure if it's worth it.

achalddave · 2024-04-28T23:41:39Z

Closing this as wontfix for now given we're not seeing improvements on test accuracy, and the current implementation slows down training. We can re-open if someone wants it or if we think there is a strong need for it in certain settings.

achalddave assigned GeorgiosSmyrnis Jan 25, 2024

achalddave closed this as completed Apr 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support attention masking to prevent attention across EOT tokens #206

Support attention masking to prevent attention across EOT tokens #206

achalddave commented Jan 25, 2024 •

edited

Loading

sagadre commented Jan 25, 2024

GeorgiosSmyrnis commented Feb 28, 2024

sagadre commented Feb 29, 2024

GeorgiosSmyrnis commented Feb 29, 2024

achalddave commented Apr 28, 2024

Support attention masking to prevent attention across EOT tokens #206

Support attention masking to prevent attention across EOT tokens #206

Comments

achalddave commented Jan 25, 2024 • edited Loading

sagadre commented Jan 25, 2024

GeorgiosSmyrnis commented Feb 28, 2024

sagadre commented Feb 29, 2024

GeorgiosSmyrnis commented Feb 29, 2024

achalddave commented Apr 28, 2024

achalddave commented Jan 25, 2024 •

edited

Loading