feat: track validation loss in logs file #51

anhuong · 2024-02-20T19:00:16Z

Changes include:

In addition to the train_loss.jsonl file that is outputted, a similar eval_loss.jsonl file is also created if eval loss is found in the logs.
Refactor for reuse of logic for tracking loss

Signed-off-by: Anh-Uong <[email protected]>

anhuong · 2024-02-20T19:04:19Z

Tested with tiny bloom model, PT, and setting validation_data_path to the same twitter dataset used for data_path and the eval_loss logs were printed and eval_loss.jsonl file was created alongside the checkpoints and training loss logs:

$ python tuning/sft_trainer.py --model_name_or_path BloomForCausalLM/ --data_path twitter_complaints.json --output_dir $OUTPUT_DIR --num_train_epochs 5 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 4 --evaluation_strategy "epoch" --save_strategy "epoch" --learning_rate 1e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --include_tokens_per_second  --packing False --response_template "\n### Label:" --dataset_text_field "output" --use_flash_attn False --tokenizer_name_or_path BloomForCausalLM --torch_dtype "float32" --peft_method "pt" --logging_strategy "epoch" --validation_data_path twitter_complaints.json

...
{'loss': 7.5713, 'learning_rate': 9.504844339512096e-06, 'epoch': 0.92}                                                   
{'eval_loss': 6.988901138305664, 'eval_runtime': 0.1005, 'eval_samples_per_second': 497.551, 'eval_steps_per_second': 129.363, 'epoch': 0.92}                                                                                                       
 20%|█████████████████▏                                                                    | 3/15 [00:01<00:04,  2.49it/s
...
{'loss': 7.5713, 'learning_rate': 7.169418695587791e-06, 'epoch': 1.85}                                                   
{'eval_loss': 6.988900184631348, 'eval_runtime': 0.0741, 'eval_samples_per_second': 674.716, 'eval_steps_per_second': 175.426, 'epoch': 1.85}                                                                                                       
 40%|██████████████████████████████████▍                                                   | 6/15 [00:01<00:01,  5.42it/s
...
{'loss': 6.9889, 'learning_rate': 0.0, 'epoch': 4.62}                                                                     
{'eval_loss': 6.988900184631348, 'eval_runtime': 0.0806, 'eval_samples_per_second': 620.329, 'eval_steps_per_second': 161.286, 'epoch': 4.62}                                                                                                       
100%|█████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:02<00:00,  7.08it/s
{'train_runtime': 3.3712, 'train_samples_per_second': 74.159, 'train_steps_per_second': 4.45, 'train_tokens_per_second': 6579.352, 'train_loss': 6.988884417215983, 'epoch': 4.62}
100%|█████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:02<00:00,  5.18it/s]

$ ls $OUTPUT_DIR
checkpoint-13  checkpoint-15  checkpoint-3  checkpoint-6  checkpoint-9	eval_loss.jsonl  train_loss.jsonl

$ cat $OUTPUT_DIR/eval_loss.jsonl
{"data": {"epoch": 0.92, "step": 3, "timestamp": "2024-02-20T13:48:17.023893", "value": 6.988901138305664}, "name": "eval_loss"}
{"data": {"epoch": 1.85, "step": 6, "timestamp": "2024-02-20T13:48:17.454848", "value": 6.988900184631348}, "name": "eval_loss"}
{"data": {"epoch": 2.77, "step": 9, "timestamp": "2024-02-20T13:48:17.847990", "value": 6.988900184631348}, "name": "eval_loss"}
{"data": {"epoch": 4.0, "step": 13, "timestamp": "2024-02-20T13:48:18.272102", "value": 6.988900184631348}, "name": "eval_loss"}
{"data": {"epoch": 4.62, "step": 15, "timestamp": "2024-02-20T13:48:18.584293", "value": 6.988900184631348}, "name": "eval_loss"}

alex-jw-brooks

looks great!

Signed-off-by: Anh-Uong <[email protected]>

anhuong added 2 commits February 16, 2024 16:08

track validation loss in logs file

cb6c354

Signed-off-by: Anh-Uong <[email protected]>

fix separation of eval and train loss

1005f2f

Signed-off-by: Anh-Uong <[email protected]>

anhuong requested a review from alex-jw-brooks February 20, 2024 19:06

alex-jw-brooks approved these changes Feb 20, 2024

View reviewed changes

Merge branch 'main' into validation-loss-file

1e817ca

Signed-off-by: Anh-Uong <[email protected]>

anhuong merged commit b9380e4 into main Feb 20, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: track validation loss in logs file #51

feat: track validation loss in logs file #51

anhuong commented Feb 20, 2024

anhuong commented Feb 20, 2024

alex-jw-brooks left a comment

feat: track validation loss in logs file #51

feat: track validation loss in logs file #51

Conversation

anhuong commented Feb 20, 2024

anhuong commented Feb 20, 2024

alex-jw-brooks left a comment

Choose a reason for hiding this comment