Quadro P6000 #8

MichaelMartinez · 2023-04-08T18:33:15Z

MichaelMartinez
Apr 8, 2023

I see that you have the same card as I...

The 4bit stuff works at approximately: 5.13 tokens/s

The 8bit stuff works at approximately: 1.13 tokens/s

I am using windows and have installed the regular Oobabooga in WSL and Windows "native"...

Should I try your code?

Does the 8bit stuff just suck with terrible performance due to Bitsandbytes and GPTQ only supporting newer GPUs?

Ph0rk0z · 2023-04-08T20:56:14Z

Ph0rk0z
Apr 8, 2023
Maintainer

The 8bit is slower and a balancing act between oom and NaN errors. We don't have the particular version of HW matrix multiplication newer cards do, so it uses a work-around that performs like you see.

GPTQ the way I have it implemented here is fairly fast. Stock ooba doesn't support some of the other models and moved to GPTQv2.

I guess it will depend on whether you had a lot of old v1 models or only v2 models and if you want GPT-NeoX support.

I have not tried windows yet, and I assume to use the autograd implementation you would have to compile a cuda kernel from (https://github.com/Ph0rk0z/GPTQ-Merged). If that works there, it should be fairly easy to compare.

Its about to get interesting because we are locked out of triton and the newer cuda implementation is 1/3 as fast.

4 replies

MichaelMartinez Apr 9, 2023
Author

Thank you, I will giver your repo a shot to test how it performs. Thank you for posting it here...

Its about to get interesting because we are locked out of triton and the newer cuda implementation is 1/3 as fast.

What do you mean by this? Can you elaborate?

Ph0rk0z Apr 9, 2023
Maintainer

Triton doesn't work with old cards and GPTQ moved to triton while slowing down the cuda implementation from how it was before.

So what I mean is expect annoyance coming from updates.

MichaelMartinez Apr 17, 2023
Author

Thank you, from the bottom of my heart for putting the work in on this... your version seems to have it all (even though I still don't know what half of it does. )

So after many attempts, starts and fits... I have your version working in WSL2. I haven't seen a significant speed increase or decrease. So I am questioning if I installed it correctly. Thank you for putting this out there for use... really appreciate it! The v1 vs v2 model thing is confusing for me as I have too many ideas and little things I am working on to really dive in.

As far as creating a LoRA from a model inside webUI, any suggestions on best approach?

Ph0rk0z Apr 17, 2023
Maintainer

I still haven't gotten there. V1 models use GPTQv1.. so all the old stuff as listed on the models thread. e.g. no act order and no true sequential.

4bit loras from the webUI would be the best way forward from that. Right now you can try with a small model in fp16 or 8bit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quadro P6000 #8

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Quadro P6000 #8

MichaelMartinez Apr 8, 2023

Replies: 1 comment · 4 replies

Ph0rk0z Apr 8, 2023 Maintainer

MichaelMartinez Apr 9, 2023 Author

Ph0rk0z Apr 9, 2023 Maintainer

MichaelMartinez Apr 17, 2023 Author

Ph0rk0z Apr 17, 2023 Maintainer

MichaelMartinez
Apr 8, 2023

Replies: 1 comment 4 replies

Ph0rk0z
Apr 8, 2023
Maintainer

MichaelMartinez Apr 9, 2023
Author

Ph0rk0z Apr 9, 2023
Maintainer

MichaelMartinez Apr 17, 2023
Author

Ph0rk0z Apr 17, 2023
Maintainer