Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add deps to evaluate qLora tuned model #312

Merged
merged 12 commits into from
Sep 16, 2024

Conversation

aluu317
Copy link
Collaborator

@aluu317 aluu317 commented Aug 23, 2024

Description of the change

This PR:

  • adds a new subpackage gptq to include dependencies of auto_gptq and optimum needed to load a quantized model. Mostly needed to run inference with.
  • Updates the run_inference.py to properly load a quantized model
  • Updates the Dockerfile to add an additional flag for enabling gptq, and therefore installs the subpackage mentioned above.

Related issue number

How to verify the PR

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

@aluu317 aluu317 force-pushed the qlora_dev_deps branch 3 times, most recently from 643b4db to af0b22e Compare August 26, 2024 17:58
@aluu317 aluu317 changed the title Add support to load qLora tuned model feat: Add support to load qLora tuned model Aug 26, 2024
@aluu317 aluu317 force-pushed the qlora_dev_deps branch 4 times, most recently from 23c71a2 to 6bde694 Compare August 30, 2024 14:30
@aluu317 aluu317 changed the title feat: Add support to load qLora tuned model feat: Add deps to evaluate qLora tuned model Aug 30, 2024
Signed-off-by: Angel Luu <[email protected]>
@aluu317 aluu317 marked this pull request as ready for review September 4, 2024 18:16
Signed-off-by: Angel Luu <[email protected]>
@anhuong anhuong requested review from fabianlim and removed request for alex-jw-brooks September 5, 2024 21:35
Copy link
Collaborator

@anhuong anhuong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Angel! Can you describe what testing you have done with this? And I left some questions that would be good to pose to Fabian and Aaron

pyproject.toml Outdated
@@ -45,6 +45,7 @@ dev = ["wheel>=0.42.0,<1.0", "packaging>=23.2,<25", "ninja>=1.11.1.1,<2.0", "sci
flash-attn = ["flash-attn>=2.5.3,<3.0"]
aim = ["aim>=3.19.0,<4.0"]
fms-accel = ["fms-acceleration>=0.1"]
gptq = ["auto_gptq>0.4.2", "optimum>=1.15.0"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's call this gptq-dev instead to note that these are not needed at training time but post. Adding a note in documentation to note that these are needed for HF loading and inference would also be useful.

build/Dockerfile Outdated
Comment on lines 149 to 152
RUN if [[ "${ENABLE_GPTQ}" == "true" ]]; then \
python -m pip install --user "$(head bdist_name)[gptq]"; \
fi

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this is a nice flag to have, I don't think we need it as users won't be enabling this flag. Since this is similar to a dev dependency, it's something users would have to manually install themselves if they want to use it. Doesn't hurt to have it here but wanted to note how it's different from aim and fms-acceleration.

)
if is_quantized:
gptq_config = GPTQConfig(bits=4, exllama_config={"version": 2})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe what this is doing? Loading with 4bit GPTQ, what is the exllama_config version? It looks like its an exllama kernel?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this should be loading 4 bit GPTQ with exllama

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aluu317 This should also be added to be able to load a base or fine tuned quantized model for inference not just for adapters. This logic should also exist in the else case

Copy link
Collaborator Author

@aluu317 aluu317 Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I added the case for loading a base model and tested with command:

python run_inference.py --text "This is a text" --use_flash_attn --model /testing/models/granite-34b-gptq

did not throw error.

Not sure about whether we can fine tune a quantized model to run inference on, maybe we will have that support?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed im not sure about fine tuning quanitized models...but its nice to have to run inference against base models

attn_implementation="flash_attention_2"
if use_flash_attn
else None,
device_map="cuda:0",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this specifying GPU 0? Can this just be "cuda", why would we specify the GPU? In addition, can quantized models be loaded on CPUs?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes i think its better to just set this to cuda. If you specifically set cuda:0, then it might only be loaded on a single GPU, even if the machine has multiple gpus.

Alternatively, you might want to consider device_map="auto", which will automatically fill up all the GPUs, https://huggingface.co/docs/accelerate/en/concept_guides/big_model_inference

I dont think quantized models can be loaded on CPUs without special libraries (like llamacpp) that are not part of HF.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabianlim When I did device_map="auto", I see this error from torch:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper_CUDA__native_layer_norm)

Am I missing something else?

@anhuong "cuda" works, updated!

if use_flash_attn
else None,
device_map="cuda:0",
torch_dtype=torch.float16 if use_flash_attn else None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this always have to be torch.float16 and is this relevant to flash-attn being set? I know Fabian and Aaron set float16, is this something needed for loading quantized models?

I know in the fms-acceleration docs:

When setting --auto_gptq triton_v2 plus note to also pass --torch_dtype float16 and --fp16, or an exception will be raised. This is because these kernels only support this dtype.

Is this also true here for loading the model?

Copy link
Contributor

@achew010 achew010 Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anhuong we are in the process of making a new plugin release, and we re-opened an investigation on the exact conditions this will be needed. We may have the flexibility to also tune with auto_gptq triton_v2 in bfloat16. We will report more when we find out very soon.

But in the meantime, we have tested float16 extensively and can confirm the above setting will work, but it will be nice if this restriction can be also relaxed.

Copy link
Contributor

@achew010 achew010 Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't use fms-acceleration for loading tuned models for evaluation, then using the kernels available from HF/AutoGPTQ directly will require dtype to be compatible with whatever kernel you use. In this case, exllama is a fp16 kernel and requires dtype to be float16

Copy link
Collaborator

@fabianlim fabianlim Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anhuong sorry i think our responses are abit confusing

  1. in the fms-accel docs, we set --auto_gptq triton_v2 because we are using different kernels for training (i.e., not exllama as in run_inference.py.
  2. when usin--auto_gptq triton_v2 we specified in docs that use float16, but this is under investigation and could be relaxed.
  3. when using exllama, you must use float16.
  4. However, the trained checkpoint if loaded in some other inference framework (.e.g, vllm) can be loaded in other types as they are supported.

To conclude, the type depends on what inference framework you load the checkpoint in. Here run_inference loads in exllama, so you must respect the type it supports.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you, these details are very helpful to understand

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on these details above it sounds like this should always be set to float16 even if flash-attn is not used, @aluu317

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh yup, updated!

Comment on lines 187 to 190
is_quantized = os.path.exists(
os.path.join(base_model_name_or_path, "quantize_config.json")
)
if is_quantized:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be interested to know if these are common across other quantized models and if the configurations set below work for other quantize tuning techniques.

Copy link
Contributor

@achew010 achew010 Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config here is specific only to GPTQ for setting the faster exllamav2 kernel. HF's from_pretrained will accept a quantization_config argument from any of these configs, so it might be better to generalize this in future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@achew010 If they do not use anything else other than gptq then this is fine.

Copy link
Collaborator

@anhuong anhuong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good! THe only question I have is that I think torch_dtype=torch.float16 if use_flash_attn else None, should just be torch_dtype=torch.float16 for loading quantized model case, please correct me if im wrong

Copy link
Collaborator

@anhuong anhuong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Angel!

@anhuong anhuong merged commit 5dd5494 into foundation-model-stack:main Sep 16, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants