Add file logger callback & export train loss json file #22

alex-jw-brooks · 2024-01-24T07:48:09Z

This PR adds a new callback, which is invoked on the logging event - if the most recent log contains specific keys, the subdictionary is extracted, and the train_loss.json is reexported.

Example using LLama 7b with the twitter complaints example (tested on single GPU with a v100).

python tuning/sft_trainer.py  --model_name_or_path $MODEL_PATH  --data_path $DATA_PATH  --output_dir $OUTPUT_PATH  --num_train_epochs 5  --per_device_train_batch_size 4  --per_device_eval_batch_size 4  --gradient_accumulation_steps 4  --evaluation_strategy "no"  --save_strategy "epoch"  --learning_rate 1e-5  --weight_decay 0.  --warmup_ratio 0.03  --lr_scheduler_type "cosine"  --logging_steps 1  --include_tokens_per_second  --packing False  --response_template "\n### Label:"  --peft_method pt --tokenizer_name_or_path $MODEL_PATH --dataset_text_field "output" --torch_dtype "float32" --use_flash_attn False

Produces a train_loss.json file in the output directory:

$ ls out
checkpoint-13  checkpoint-15  checkpoint-3  checkpoint-6  checkpoint-9  train_loss.jsonl

Here is an example of the json file that gets created.

{"data": {"epoch": 0.31, "step": 1, "timestamp": "2024-02-13T12:29:27.387128", "value": 10.6876}, "name": "loss"}
{"data": {"epoch": 0.62, "step": 2, "timestamp": "2024-02-13T12:29:28.078879", "value": 10.8365}, "name": "loss"}
{"data": {"epoch": 0.92, "step": 3, "timestamp": "2024-02-13T12:29:28.749249", "value": 10.767}, "name": "loss"}
{"data": {"epoch": 1.23, "step": 4, "timestamp": "2024-02-13T12:29:31.796401", "value": 10.8091}, "name": "loss"}
{"data": {"epoch": 1.54, "step": 5, "timestamp": "2024-02-13T12:29:32.436218", "value": 10.7143}, "name": "loss"}
{"data": {"epoch": 1.85, "step": 6, "timestamp": "2024-02-13T12:29:33.187178", "value": 10.7687}, "name": "loss"}
{"data": {"epoch": 2.15, "step": 7, "timestamp": "2024-02-13T12:29:36.343676", "value": 10.7223}, "name": "loss"}
{"data": {"epoch": 2.46, "step": 8, "timestamp": "2024-02-13T12:29:37.021809", "value": 10.7292}, "name": "loss"}
{"data": {"epoch": 2.77, "step": 9, "timestamp": "2024-02-13T12:29:37.708132", "value": 10.8619}, "name": "loss"}
{"data": {"epoch": 3.08, "step": 10, "timestamp": "2024-02-13T12:29:42.506561", "value": 10.6114}, "name": "loss"}
{"data": {"epoch": 3.38, "step": 11, "timestamp": "2024-02-13T12:29:43.154734", "value": 10.6282}, "name": "loss"}
{"data": {"epoch": 3.69, "step": 12, "timestamp": "2024-02-13T12:29:43.838959", "value": 10.7648}, "name": "loss"}
{"data": {"epoch": 4.0, "step": 13, "timestamp": "2024-02-13T12:29:44.538090", "value": 10.8187}, "name": "loss"}
{"data": {"epoch": 4.31, "step": 14, "timestamp": "2024-02-13T12:29:48.925495", "value": 10.6256}, "name": "loss"}
{"data": {"epoch": 4.62, "step": 15, "timestamp": "2024-02-13T12:29:49.682787", "value": 10.7308}, "name": "loss"}

Note that logs are only exported from the main process to avoid writing from mulitple processes for multigpu tunings.

gkumbhat

General question, should we follow the same pattern as used by caikit-nlp for consistency and compatibility perspective ?
We'll need to see how the callbacks work for steps vs epochs. Since we may want to log individual steps, even if it is set to epoch wise decisions. Another scenario we'll need to understand better is multi-gpu case.

gkumbhat · 2024-01-25T19:37:25Z

tuning/sft_trainer.py

+        log_file_path = os.path.join(args.output_dir, "train_loss.json")
+        if logs is not None:
+            try:
+                # Take the subdict of the last log line; if any log_keys aren't part of this log
+                # object, asssume this line is something else, e.g., train completion, and skip.
+                log_obj = {k: logs[k] for k in FileLoggingCallback.log_keys}
+            except KeyError:
+                return
+
+            # Redump the json file in the checkpoint directory with the updated log line
+            self.existing_logs.append(log_obj)
+            with open(log_file_path, "w") as log_file:
+                json.dump(self.existing_logs, log_file, sort_keys=True, indent=4)


One thing I would be a bit worried about is the need to open and close file repeatedly. Also looks like we are overriding the existing logs instead of appending, any particular reason for that?

Just that this is writing a JSON file, so we need to reparse it to add to the list. We could alternatively write a list of JSON objects (i.e., each line is a JSON object representing one log), and in that case just append per log?

I agree that it's a bit of a bummer to open and close the file so much, although it's still probably quite small compared to the actual training of the model, and the parsing is the main problem. If we write a json object per log, we can keep it open in append mode and flush on each written log I guess?

tuning/sft_trainer.py

gkumbhat · 2024-01-25T19:45:38Z

tuning/sft_trainer.py

+        appends the subdict of the log & dumps the file.
+        """
+
+        log_file_path = os.path.join(args.output_dir, "train_loss.json")


Can we add either a conversion or a check to make sure loss is actually in float instead of other data types. Otherwise the json dump would fail.

alex-jw-brooks · 2024-01-30T13:37:37Z

Hey @gkumbhat! Thanks for the review. Some thoughts on your broader questions

General question, should we follow the same pattern as used by caikit-nlp for consistency and compatibility perspective ?

I would greatly prefer to avoid subclassing the trainer just for logging. The actual logging events look similar either way, so any logic in the logging should be pretty portable between the two approaches, and this project is already using callbacks, so callbacks feels more natural. Plus the SFT trainer is already a trainer subclass and pretty complex, subclassing it again just to write an extra file seems like unnecessary complexity to me, since it would basically be for the one function

We'll need to see how the callbacks work for steps vs epochs. Since we may want to log individual steps, even if it is set to epoch wise decisions. Another scenario we'll need to understand better is multi-gpu case.
For steps vs epochs etc - the event being handled here just gives a parsed version of the logged object, so we can just handle those keys separately, I think. I'll look into making sure this behavior is sensible.

For multigpu, we probably need to make sure that the file is only written by the master process, but I'm not sure how rank is taken into account for the logging event. I can look into this as well, although do you know if we formally support multiGPU at the moment, or is it still experimental?

gkumbhat · 2024-02-01T23:26:41Z

I would greatly prefer to avoid subclassing the trainer just for logging. The actual logging events look similar either way, so any logic in the logging should be pretty portable between the two approaches, and this project is already using callbacks, so callbacks feels more natural. Plus the SFT trainer is already a trainer subclass and pretty complex, subclassing it again just to write an extra file seems like unnecessary complexity to me, since it would basically be for the one function

By "follow pattern", I actually meant the formatting of the output file. Sorry I was not quite clear there

gkumbhat · 2024-02-01T23:27:51Z

do you know if we formally support multiGPU at the moment, or is it still experimental?

Ops, I don't know 🤷

raghukiran1224 · 2024-02-02T00:53:59Z

Yes, we can do multi gpu using this code base and it has been well tested

alex-jw-brooks · 2024-02-06T16:43:26Z

Thanks everyone - updated the PR to write in jsonl format, and only dump logs from process zero to prevent clobbering when running multiprocess training

Signed-off-by: Alex-Brooks <[email protected]>

alex-jw-brooks · 2024-02-13T18:33:48Z

Hey @gkumbhat - updated the PR and description based on our discussions to match the legacy log format, should be ready for another look when you have a moment

gkumbhat

LGTM

…del-stack#22) * Add file logger callback & export train loss json file Signed-off-by: Alex-Brooks <[email protected]> * only update logs from process 0 Signed-off-by: Alex-Brooks <[email protected]> * Export logs in jsonl format Signed-off-by: Alex-Brooks <[email protected]> * Formatting Signed-off-by: Alex-Brooks <[email protected]> --------- Signed-off-by: Alex-Brooks <[email protected]>

alex-jw-brooks mentioned this pull request Jan 24, 2024

Add loss function logs to a file #9

Closed

alex-jw-brooks requested a review from gkumbhat January 25, 2024 18:47

gkumbhat reviewed Jan 25, 2024

View reviewed changes

alex-jw-brooks force-pushed the train_loss_file branch from 98b18bf to 3b50b06 Compare February 5, 2024 16:26

alex-jw-brooks added 4 commits February 13, 2024 12:20

Add file logger callback & export train loss json file

839a42f

Signed-off-by: Alex-Brooks <[email protected]>

only update logs from process 0

6067687

Signed-off-by: Alex-Brooks <[email protected]>

Export logs in jsonl format

0347dd9

Signed-off-by: Alex-Brooks <[email protected]>

Formatting

5259c79

Signed-off-by: Alex-Brooks <[email protected]>

alex-jw-brooks force-pushed the train_loss_file branch from 3b50b06 to 5259c79 Compare February 13, 2024 18:30

gkumbhat approved these changes Feb 13, 2024

View reviewed changes

alex-jw-brooks merged commit 517652d into foundation-model-stack:main Feb 13, 2024
1 check passed

alex-jw-brooks deleted the train_loss_file branch February 13, 2024 21:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add file logger callback & export train loss json file #22

Add file logger callback & export train loss json file #22

alex-jw-brooks commented Jan 24, 2024 •

edited

Loading

gkumbhat left a comment

gkumbhat Jan 25, 2024

alex-jw-brooks Jan 30, 2024

gkumbhat Jan 25, 2024

alex-jw-brooks commented Jan 30, 2024 •

edited

Loading

gkumbhat commented Feb 1, 2024 •

edited

Loading

gkumbhat commented Feb 1, 2024 •

edited

Loading

raghukiran1224 commented Feb 2, 2024

alex-jw-brooks commented Feb 6, 2024

alex-jw-brooks commented Feb 13, 2024

gkumbhat left a comment

Add file logger callback & export train loss json file #22

Add file logger callback & export train loss json file #22

Conversation

alex-jw-brooks commented Jan 24, 2024 • edited Loading

gkumbhat left a comment

Choose a reason for hiding this comment

gkumbhat Jan 25, 2024

Choose a reason for hiding this comment

alex-jw-brooks Jan 30, 2024

Choose a reason for hiding this comment

gkumbhat Jan 25, 2024

Choose a reason for hiding this comment

alex-jw-brooks commented Jan 30, 2024 • edited Loading

gkumbhat commented Feb 1, 2024 • edited Loading

gkumbhat commented Feb 1, 2024 • edited Loading

raghukiran1224 commented Feb 2, 2024

alex-jw-brooks commented Feb 6, 2024

alex-jw-brooks commented Feb 13, 2024

gkumbhat left a comment

Choose a reason for hiding this comment

alex-jw-brooks commented Jan 24, 2024 •

edited

Loading

alex-jw-brooks commented Jan 30, 2024 •

edited

Loading

gkumbhat commented Feb 1, 2024 •

edited

Loading

gkumbhat commented Feb 1, 2024 •

edited

Loading