fix: remove lm_head post processing #333

Abhishek-TAMU · 2024-09-05T20:07:26Z

Description of the change

Removal of lm_head hack which was made to fix lm_head issue and now fixed with newer vllm versions, the change coming in as of v0.5.4

Related issue number

#1166

How to verify the PR

Running LoRA and full fine tuning of granite-3b and llama-8b model without removal of lm_head able to run inference on.

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

Signed-off-by: Abhishek <[email protected]>

anhuong · 2024-09-05T21:27:20Z

build/Dockerfile

@@ -105,7 +105,7 @@ FROM cuda-devel AS python-installations
 ARG WHEEL_VERSION
 ARG USER
 ARG USER_UID
-ARG ENABLE_FMS_ACCELERATION=false
+ARG ENABLE_FMS_ACCELERATION=true


Note that with this change, fms-acceleration can be installed by default thus enabling QLoRA support by default. The lm_head removal was blocking QLoRA enablement.

anhuong

Change looks good to me, waiting on test results from Abhishek

anhuong · 2024-09-10T17:17:27Z

After testing, found that accelerate version is not working as expected.

New logic intorduced in get_state_dict, also removes the top-level FSDP wrapper from the model. So then since FSDP keeps flattened params, all the parameters managed by the top-level wrapper will now remained flattened when model.state_dict is called. The other child FSDP wrappers will protect their parameters, since when the state_dict call recurses to them, they will use the FSDP version of state_dict to unwrap the wrappers.

This results in error:

size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([62915840]) from checkpoint, the shape in current model is torch.Size([49152, 2560]).
size mismatch for model.norm.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([2560]).

Signed-off-by: Anh Uong <[email protected]>

* fix: Removal of lm head hack Signed-off-by: Abhishek <[email protected]> * set fms_accelerate to true by default Signed-off-by: Anh Uong <[email protected]> --------- Signed-off-by: Abhishek <[email protected]> Signed-off-by: Anh Uong <[email protected]> Co-authored-by: Anh Uong <[email protected]> Signed-off-by: Angel Luu <[email protected]>

fix: Removal of lm head hack

405fde1

Signed-off-by: Abhishek <[email protected]>

Abhishek-TAMU marked this pull request as ready for review September 5, 2024 20:07

Abhishek-TAMU requested review from anhuong, Ssukriti and alex-jw-brooks as code owners September 5, 2024 20:07

anhuong changed the title ~~fix: Removal of lm head hack~~ fix: remove lm_head post processing with new accelerate version Sep 5, 2024

anhuong reviewed Sep 5, 2024

View reviewed changes

anhuong mentioned this pull request Sep 5, 2024

refactor: move removal of lm_head to save method #313

Closed

2 tasks

anhuong reviewed Sep 5, 2024

View reviewed changes

anhuong closed this Sep 10, 2024

anhuong reopened this Sep 11, 2024

anhuong changed the title ~~fix: remove lm_head post processing with new accelerate version~~ fix: remove lm_head post processing Sep 11, 2024

anhuong added 2 commits September 12, 2024 19:05

Merge branch 'main' into remove_lm_head

273ca1d

Signed-off-by: Anh Uong <[email protected]>

set fms_accelerate to true by default

2769eb8

Signed-off-by: Anh Uong <[email protected]>

anhuong approved these changes Sep 13, 2024

View reviewed changes

anhuong merged commit c40ae7f into foundation-model-stack:main Sep 13, 2024
7 checks passed

Abhishek-TAMU deleted the remove_lm_head branch September 19, 2024 21:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: remove lm_head post processing #333

fix: remove lm_head post processing #333

Abhishek-TAMU commented Sep 5, 2024 •

edited by anhuong

Loading

anhuong Sep 5, 2024

anhuong left a comment

anhuong commented Sep 10, 2024

fix: remove lm_head post processing #333

fix: remove lm_head post processing #333

Conversation

Abhishek-TAMU commented Sep 5, 2024 • edited by anhuong Loading

Description of the change

Related issue number

How to verify the PR

Was the PR tested

anhuong Sep 5, 2024

Choose a reason for hiding this comment

anhuong left a comment

Choose a reason for hiding this comment

anhuong commented Sep 10, 2024

Abhishek-TAMU commented Sep 5, 2024 •

edited by anhuong

Loading