-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add file logger callback & export train loss json file #22
Add file logger callback & export train loss json file #22
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- General question, should we follow the same pattern as used by
caikit-nlp
for consistency and compatibility perspective ? - We'll need to see how the callbacks work for steps vs epochs. Since we may want to log individual steps, even if it is set to epoch wise decisions. Another scenario we'll need to understand better is multi-gpu case.
tuning/sft_trainer.py
Outdated
log_file_path = os.path.join(args.output_dir, "train_loss.json") | ||
if logs is not None: | ||
try: | ||
# Take the subdict of the last log line; if any log_keys aren't part of this log | ||
# object, asssume this line is something else, e.g., train completion, and skip. | ||
log_obj = {k: logs[k] for k in FileLoggingCallback.log_keys} | ||
except KeyError: | ||
return | ||
|
||
# Redump the json file in the checkpoint directory with the updated log line | ||
self.existing_logs.append(log_obj) | ||
with open(log_file_path, "w") as log_file: | ||
json.dump(self.existing_logs, log_file, sort_keys=True, indent=4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I would be a bit worried about is the need to open and close file repeatedly. Also looks like we are overriding the existing logs instead of appending, any particular reason for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just that this is writing a JSON file, so we need to reparse it to add to the list. We could alternatively write a list of JSON objects (i.e., each line is a JSON object representing one log), and in that case just append per log?
I agree that it's a bit of a bummer to open and close the file so much, although it's still probably quite small compared to the actual training of the model, and the parsing is the main problem. If we write a json object per log, we can keep it open in append mode and flush on each written log I guess?
tuning/sft_trainer.py
Outdated
appends the subdict of the log & dumps the file. | ||
""" | ||
|
||
log_file_path = os.path.join(args.output_dir, "train_loss.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add either a conversion or a check to make sure loss
is actually in float
instead of other data types. Otherwise the json dump would fail.
Hey @gkumbhat! Thanks for the review. Some thoughts on your broader questions
I would greatly prefer to avoid subclassing the trainer just for logging. The actual logging events look similar either way, so any logic in the logging should be pretty portable between the two approaches, and this project is already using callbacks, so callbacks feels more natural. Plus the SFT trainer is already a trainer subclass and pretty complex, subclassing it again just to write an extra file seems like unnecessary complexity to me, since it would basically be for the one function
For multigpu, we probably need to make sure that the file is only written by the master process, but I'm not sure how rank is taken into account for the logging event. I can look into this as well, although do you know if we formally support multiGPU at the moment, or is it still experimental? |
By "follow pattern", I actually meant the formatting of the output file. Sorry I was not quite clear there |
Ops, I don't know 🤷 |
Yes, we can do multi gpu using this code base and it has been well tested |
98b18bf
to
3b50b06
Compare
Thanks everyone - updated the PR to write in jsonl format, and only dump logs from process zero to prevent clobbering when running multiprocess training |
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
3b50b06
to
5259c79
Compare
Hey @gkumbhat - updated the PR and description based on our discussions to match the legacy log format, should be ready for another look when you have a moment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…del-stack#22) * Add file logger callback & export train loss json file Signed-off-by: Alex-Brooks <[email protected]> * only update logs from process 0 Signed-off-by: Alex-Brooks <[email protected]> * Export logs in jsonl format Signed-off-by: Alex-Brooks <[email protected]> * Formatting Signed-off-by: Alex-Brooks <[email protected]> --------- Signed-off-by: Alex-Brooks <[email protected]>
This PR adds a new callback, which is invoked on the logging event - if the most recent log contains specific keys, the subdictionary is extracted, and the
train_loss.json
is reexported.Example using LLama 7b with the twitter complaints example (tested on single GPU with a v100).
Produces a
train_loss.json
file in the output directory:Here is an example of the json file that gets created.
Note that logs are only exported from the main process to avoid writing from mulitple processes for multigpu tunings.