refactor: move removal of lm_head to save method #313

anhuong · 2024-08-23T18:34:37Z

Description of the change

Utilize the removal of duplicated lm_head weight in granite models with llama arch in sft_trainer instead of just in accelerate_launch script. Removal of lm_head weight still occurs in the save_model_dir OR the last checkpoint.
Now on call to sft_trainer.save() the lm_head weight will automatically be removed
No longer need to reload the checkpoint to remove lm_head

Related issue number

How to verify the PR

Currently testing

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

anhuong · 2024-08-23T18:38:14Z

tuning/sft_trainer.py

-    if training_args.save_model_dir:
-        try:
+    try:
+        if training_args.save_model_dir:
            save(
                path=training_args.save_model_dir,
                trainer=trainer,
                log_level=training_args.log_level,
            )
-        except Exception as e:  # pylint: disable=broad-except
-            logger.error(traceback.format_exc())
-            write_termination_log(
-                f"Failed to save model to {training_args.save_model_dir}: {e}"
+        else:
+            # if granite with llama arch, remove lm_head in last checkpoint
+            save(
+                path=get_highest_checkpoint(training_args.output_dir),
+                trainer=trainer,
+                log_level=training_args.log_level,


This is currently removing lm_head from the model that is being saved at save_model_dir OR the last checkpoint. A few questions...

Do we need to remove from last checkpoint? This is for the use case if someone doesn't specify save_model_dir and retains current behavior but it is a little weird for the final checkpoint to be different than the rest of the checkpoints.

Do we want to remove from each checkpoint? This would require loading up each of the checkpoints again and resaving it, unless SFTTrainer.trainer has access to the checkpoints still but AFAIK it only has access to the last checkpoint

Signed-off-by: Anh Uong <[email protected]>

anhuong · 2024-09-05T21:33:25Z

will close in favor of PR: #333

anhuong commented Aug 23, 2024

View reviewed changes

refactor: move removal of lm_head to save method

ee0299e

Signed-off-by: Anh Uong <[email protected]>

anhuong force-pushed the refactor-remove-lmhead branch from b54eed5 to ee0299e Compare August 23, 2024 18:46

anhuong closed this Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: move removal of lm_head to save method #313

refactor: move removal of lm_head to save method #313

anhuong commented Aug 23, 2024 •

edited

Loading

anhuong Aug 23, 2024

anhuong commented Sep 5, 2024

refactor: move removal of lm_head to save method #313

refactor: move removal of lm_head to save method #313

Conversation

anhuong commented Aug 23, 2024 • edited Loading

Description of the change

Related issue number

How to verify the PR

Was the PR tested

anhuong Aug 23, 2024

Choose a reason for hiding this comment

anhuong commented Sep 5, 2024

anhuong commented Aug 23, 2024 •

edited

Loading