-
Notifications
You must be signed in to change notification settings - Fork 472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
packed errors #2218
Comments
Hi @chg0901 thanks for creating the issue. Our packed dataset implementation uses flex attention under the hood to support the necessary block causal mask while still retaining good performance. Unfortunately there are some nuances here -- specifically flex attention hardcodes some kernel configs depending on the type of hardware you're using, and these aren't currently optimized for A6000. I would check this comment (along with others in the same thread for more context) for one way to get around this in the short term. This is a known issue in PyTorch core (see pytorch/pytorch#133254) and longer-term, the flex attention authors are working on fixing this -- see pytorch/pytorch#137959 (I believe the Regarding internLM, we aren't currently working on enabling it. Can you open a separate issue with a formal feature request? It'd be helpful to gauge community interest before opening a PR. |
The following configuration encounters the same issue on L40s.
|
and it can work on Qwen2.5-0.5B-Instruct |
It likely works on 0.5B because the head_dim is smaller. As another temporary suggestion for anyone blocked, I'm pretty sure you can make it work on L40s if you change the way flex is compiled to let it find a kernel that is compatible with the cuda smem of the machine, i.e. change:
to: Alternatively, you can turn off flex attention by hard coding |
Does this setting or configuration work for other models such as llama
series?
Mircea Mironenco ***@***.***> 于 2025年1月7日周二 16:55写道:
… It likely works on 0.5B because the head_dim is smaller. As another
temporary suggestion for anyone blocked, I'm pretty sure you can make it
work on L40s if you change the way flex is compiled to let it find a kernel
that is compatible with the cuda smem of the machine, i.e. change:
https://github.com/pytorch/torchtune/blob/27fd3a14b04b5c3d428c723ef4a3a27e1595102b/torchtune/modules/attention_utils.py#L25
to: flex_attention_compiled = torch.compile( flex_attention,
dynamic=False, mode="max_autotune" )
Alternatively, you can turn off flex attention by hard coding _SUPPORTS_FLEX_ATTENTION
= False which will still allow for packed=True.
—
Reply to this email directly, view it on GitHub
<#2218 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB636WB6ETUC437WAWBGWHT2JOB6BAVCNFSM6AAAAABUNHOUS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZUGYYDQMZSHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
To be clear, you will need to uninstall whatever version of torchtune you have, clone the repo, make the change I described (either compile with |
I test the torchtune 0.5 with my A6000*4 linux PC
when I try to use packed dataset to accelerate the training, I faced this error
errors
configs
also tried to modify batch size
you could check A blog in Chiense for more details.
internLM support
by the way, I want to add a PR for internLM LLM, is there any work in process?
The text was updated successfully, but these errors were encountered: