Quadro P6000 #8
Replies: 1 comment 4 replies
-
The 8bit is slower and a balancing act between oom and NaN errors. We don't have the particular version of HW matrix multiplication newer cards do, so it uses a work-around that performs like you see. GPTQ the way I have it implemented here is fairly fast. Stock ooba doesn't support some of the other models and moved to GPTQv2. I guess it will depend on whether you had a lot of old v1 models or only v2 models and if you want GPT-NeoX support. I have not tried windows yet, and I assume to use the autograd implementation you would have to compile a cuda kernel from (https://github.com/Ph0rk0z/GPTQ-Merged). If that works there, it should be fairly easy to compare. Its about to get interesting because we are locked out of triton and the newer cuda implementation is 1/3 as fast. |
Beta Was this translation helpful? Give feedback.
-
I see that you have the same card as I...
The 4bit stuff works at approximately: 5.13 tokens/s
The 8bit stuff works at approximately: 1.13 tokens/s
I am using windows and have installed the regular Oobabooga in WSL and Windows "native"...
Should I try your code?
Does the 8bit stuff just suck with terrible performance due to Bitsandbytes and GPTQ only supporting newer GPUs?
Beta Was this translation helpful? Give feedback.
All reactions