Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: track validation loss in logs file #51

Merged
merged 3 commits into from
Feb 20, 2024
Merged

Conversation

anhuong
Copy link
Collaborator

@anhuong anhuong commented Feb 20, 2024

Changes include:

  • In addition to the train_loss.jsonl file that is outputted, a similar eval_loss.jsonl file is also created if eval loss is found in the logs.
  • Refactor for reuse of logic for tracking loss

@anhuong
Copy link
Collaborator Author

anhuong commented Feb 20, 2024

Tested with tiny bloom model, PT, and setting validation_data_path to the same twitter dataset used for data_path and the eval_loss logs were printed and eval_loss.jsonl file was created alongside the checkpoints and training loss logs:

$ python tuning/sft_trainer.py --model_name_or_path BloomForCausalLM/ --data_path twitter_complaints.json --output_dir $OUTPUT_DIR --num_train_epochs 5 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 4 --evaluation_strategy "epoch" --save_strategy "epoch" --learning_rate 1e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --include_tokens_per_second  --packing False --response_template "\n### Label:" --dataset_text_field "output" --use_flash_attn False --tokenizer_name_or_path BloomForCausalLM --torch_dtype "float32" --peft_method "pt" --logging_strategy "epoch" --validation_data_path twitter_complaints.json

...
{'loss': 7.5713, 'learning_rate': 9.504844339512096e-06, 'epoch': 0.92}                                                   
{'eval_loss': 6.988901138305664, 'eval_runtime': 0.1005, 'eval_samples_per_second': 497.551, 'eval_steps_per_second': 129.363, 'epoch': 0.92}                                                                                                       
 20%|█████████████████▏                                                                    | 3/15 [00:01<00:04,  2.49it/s
...
{'loss': 7.5713, 'learning_rate': 7.169418695587791e-06, 'epoch': 1.85}                                                   
{'eval_loss': 6.988900184631348, 'eval_runtime': 0.0741, 'eval_samples_per_second': 674.716, 'eval_steps_per_second': 175.426, 'epoch': 1.85}                                                                                                       
 40%|██████████████████████████████████▍                                                   | 6/15 [00:01<00:01,  5.42it/s
...
{'loss': 6.9889, 'learning_rate': 0.0, 'epoch': 4.62}                                                                     
{'eval_loss': 6.988900184631348, 'eval_runtime': 0.0806, 'eval_samples_per_second': 620.329, 'eval_steps_per_second': 161.286, 'epoch': 4.62}                                                                                                       
100%|█████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:02<00:00,  7.08it/s
{'train_runtime': 3.3712, 'train_samples_per_second': 74.159, 'train_steps_per_second': 4.45, 'train_tokens_per_second': 6579.352, 'train_loss': 6.988884417215983, 'epoch': 4.62}
100%|█████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:02<00:00,  5.18it/s]

$ ls $OUTPUT_DIR
checkpoint-13  checkpoint-15  checkpoint-3  checkpoint-6  checkpoint-9	eval_loss.jsonl  train_loss.jsonl

$ cat $OUTPUT_DIR/eval_loss.jsonl
{"data": {"epoch": 0.92, "step": 3, "timestamp": "2024-02-20T13:48:17.023893", "value": 6.988901138305664}, "name": "eval_loss"}
{"data": {"epoch": 1.85, "step": 6, "timestamp": "2024-02-20T13:48:17.454848", "value": 6.988900184631348}, "name": "eval_loss"}
{"data": {"epoch": 2.77, "step": 9, "timestamp": "2024-02-20T13:48:17.847990", "value": 6.988900184631348}, "name": "eval_loss"}
{"data": {"epoch": 4.0, "step": 13, "timestamp": "2024-02-20T13:48:18.272102", "value": 6.988900184631348}, "name": "eval_loss"}
{"data": {"epoch": 4.62, "step": 15, "timestamp": "2024-02-20T13:48:18.584293", "value": 6.988900184631348}, "name": "eval_loss"}

Copy link
Collaborator

@alex-jw-brooks alex-jw-brooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great!

@anhuong anhuong merged commit b9380e4 into main Feb 20, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants