-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add deps to evaluate qLora tuned model #312
Conversation
643b4db
to
af0b22e
Compare
23c71a2
to
6bde694
Compare
c874682
to
5089326
Compare
Signed-off-by: Angel Luu <[email protected]>
Signed-off-by: Angel Luu <[email protected]>
eb3399e
to
16aafc9
Compare
Signed-off-by: Angel Luu <[email protected]>
Signed-off-by: Angel Luu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Angel! Can you describe what testing you have done with this? And I left some questions that would be good to pose to Fabian and Aaron
pyproject.toml
Outdated
@@ -45,6 +45,7 @@ dev = ["wheel>=0.42.0,<1.0", "packaging>=23.2,<25", "ninja>=1.11.1.1,<2.0", "sci | |||
flash-attn = ["flash-attn>=2.5.3,<3.0"] | |||
aim = ["aim>=3.19.0,<4.0"] | |||
fms-accel = ["fms-acceleration>=0.1"] | |||
gptq = ["auto_gptq>0.4.2", "optimum>=1.15.0"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's call this gptq-dev
instead to note that these are not needed at training time but post. Adding a note in documentation to note that these are needed for HF loading and inference would also be useful.
build/Dockerfile
Outdated
RUN if [[ "${ENABLE_GPTQ}" == "true" ]]; then \ | ||
python -m pip install --user "$(head bdist_name)[gptq]"; \ | ||
fi | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this is a nice flag to have, I don't think we need it as users won't be enabling this flag. Since this is similar to a dev dependency, it's something users would have to manually install themselves if they want to use it. Doesn't hurt to have it here but wanted to note how it's different from aim and fms-acceleration.
) | ||
if is_quantized: | ||
gptq_config = GPTQConfig(bits=4, exllama_config={"version": 2}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you describe what this is doing? Loading with 4bit GPTQ, what is the exllama_config
version? It looks like its an exllama kernel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this should be loading 4 bit GPTQ with exllama
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aluu317 This should also be added to be able to load a base or fine tuned quantized model for inference not just for adapters. This logic should also exist in the else case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I added the case for loading a base model and tested with command:
python run_inference.py --text "This is a text" --use_flash_attn --model /testing/models/granite-34b-gptq
did not throw error.
Not sure about whether we can fine tune a quantized model to run inference on, maybe we will have that support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed im not sure about fine tuning quanitized models...but its nice to have to run inference against base models
scripts/run_inference.py
Outdated
attn_implementation="flash_attention_2" | ||
if use_flash_attn | ||
else None, | ||
device_map="cuda:0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this specifying GPU 0? Can this just be "cuda", why would we specify the GPU? In addition, can quantized models be loaded on CPUs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes i think its better to just set this to cuda
. If you specifically set cuda:0
, then it might only be loaded on a single GPU, even if the machine has multiple gpus.
Alternatively, you might want to consider device_map="auto"
, which will automatically fill up all the GPUs, https://huggingface.co/docs/accelerate/en/concept_guides/big_model_inference
I dont think quantized models can be loaded on CPUs without special libraries (like llamacpp
) that are not part of HF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabianlim When I did device_map="auto"
, I see this error from torch:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper_CUDA__native_layer_norm)
Am I missing something else?
@anhuong "cuda" works, updated!
scripts/run_inference.py
Outdated
if use_flash_attn | ||
else None, | ||
device_map="cuda:0", | ||
torch_dtype=torch.float16 if use_flash_attn else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this always have to be torch.float16 and is this relevant to flash-attn being set? I know Fabian and Aaron set float16, is this something needed for loading quantized models?
I know in the fms-acceleration docs:
When setting --auto_gptq triton_v2 plus note to also pass --torch_dtype float16 and --fp16, or an exception will be raised. This is because these kernels only support this dtype.
Is this also true here for loading the model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anhuong we are in the process of making a new plugin release, and we re-opened an investigation on the exact conditions this will be needed. We may have the flexibility to also tune with auto_gptq triton_v2
in bfloat16. We will report more when we find out very soon.
But in the meantime, we have tested float16
extensively and can confirm the above setting will work, but it will be nice if this restriction can be also relaxed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we don't use fms-acceleration for loading tuned models for evaluation, then using the kernels available from HF/AutoGPTQ directly will require dtype to be compatible with whatever kernel you use. In this case, exllama is a fp16 kernel and requires dtype to be float16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anhuong sorry i think our responses are abit confusing
- in the fms-accel docs, we set
--auto_gptq triton_v2
because we are using different kernels for training (i.e., notexllama
as inrun_inference.py
. - when usin
--auto_gptq triton_v2
we specified in docs that usefloat16
, but this is under investigation and could be relaxed. - when using
exllama
, you must usefloat16
. - However, the trained checkpoint if loaded in some other inference framework (.e.g, vllm) can be loaded in other types as they are supported.
To conclude, the type depends on what inference framework you load the checkpoint in. Here run_inference
loads in exllama
, so you must respect the type it supports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you, these details are very helpful to understand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on these details above it sounds like this should always be set to float16 even if flash-attn is not used, @aluu317
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh yup, updated!
scripts/run_inference.py
Outdated
is_quantized = os.path.exists( | ||
os.path.join(base_model_name_or_path, "quantize_config.json") | ||
) | ||
if is_quantized: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be interested to know if these are common across other quantized models and if the configurations set below work for other quantize tuning techniques.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The config here is specific only to GPTQ for setting the faster exllamav2 kernel. HF's from_pretrained
will accept a quantization_config
argument from any of these configs, so it might be better to generalize this in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@achew010 If they do not use anything else other than gptq
then this is fine.
Signed-off-by: Angel Luu <[email protected]>
Signed-off-by: Angel Luu <[email protected]>
Signed-off-by: Angel Luu <[email protected]>
Signed-off-by: Angel Luu <[email protected]>
Signed-off-by: Angel Luu <[email protected]>
Signed-off-by: Angel Luu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good! THe only question I have is that I think torch_dtype=torch.float16 if use_flash_attn else None,
should just be torch_dtype=torch.float16
for loading quantized model case, please correct me if im wrong
Signed-off-by: Angel Luu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Angel!
Description of the change
This PR:
gptq
to include dependencies of auto_gptq and optimum needed to load a quantized model. Mostly needed to run inference with.run_inference.py
to properly load a quantized modelDockerfile
to add an additional flag for enabling gptq, and therefore installs the subpackage mentioned above.Related issue number
How to verify the PR
Was the PR tested