Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[in progress] fp16 memory optimizations #96

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

nwatx
Copy link

@nwatx nwatx commented Apr 17, 2024

  • still need to bench performance accurately (will add a bench suite soon)

  • working torch.half() / floating point

  • model memory optimization

  • kv cache memory optimization

  • clean up code

@rishikksh20
Copy link

@nwatx Hi, is fp16 gives good output and speed up than normal ?

@nwatx
Copy link
Author

nwatx commented Apr 19, 2024

i haven't measured speed up, but from observation, it seems to improve memory consumption

@nwatx
Copy link
Author

nwatx commented Apr 19, 2024

the output seems to be of similar quality

@jasonppy jasonppy self-assigned this Apr 20, 2024
@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Apr 27, 2024

Basically I tested this and found that there is no difference beyond just changing the KV_CACHE to fp16. Using the autocasting and stuff like that gives no benefit that I can see. I was sorta hopeful this did something I didn't, but no such luck.

On a side note, I can generate on my 2080 22g using the fp16 cache, previously it would OOM but so far it has not.

@jasonppy
Copy link
Owner

flashattention might help

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Apr 28, 2024

There's a vllm one that will work for all tensor core cards: https://github.com/vllm-project/vllm/blob/main/vllm/attention/ops/triton_flash_attention.py

current one only supports ampere+

Not sure how to wrap it around your forwards.

@aashay-sarvam
Copy link

@nwatx I am getting error while running, specifically text_tokens.half() is creating an issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants