feat: Add timing tracker to fms-hf-tuning #378

willmj · 2024-10-21T20:17:37Z

Description of the change

This change will add another tracker to fms-hf-tuning which will, per process, log the training time. Additionally, if tuning is run using accelerate_launch.py, it will log the entire run time from beginning to end (including pre and post processing).

Here is an example log of a run with two processes:

{"data": {"end_time": "2024-10-21T19:32:36.657076", "start_time": "2024-10-21T19:31:21.727124", "timestamp": "2024-10-21T19:32:36.657152", "total_duration_seconds": 74.929952, "train_runtime": 73.5228}, "name": "training_timing"}
{"data": {"end_time": "2024-10-21T19:32:36.674714", "start_time": "2024-10-21T19:31:21.724035", "timestamp": "2024-10-21T19:32:36.674749", "total_duration_seconds": 74.950679, "train_runtime": 74.8826}, "name": "training_timing"}
{"data": {"end_time": "2024-10-21T19:33:23.181481", "start_time": "2024-10-21T19:23:54.401312", "total_duration_seconds": 568.780169}, "name": "total_runtime"}

Questions:

Is it helpful to have timer be per process or should there just be one train_runtime?
Is there a better way to use the tracker for total runtime without passing it in the launch command?

Related issue number

How to verify the PR

Run these changes in a dev branch on any model and view the output directory once training is complete.

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

Signed-off-by: Will Johnson <[email protected]>

github-actions · 2024-10-21T20:17:49Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

Signed-off-by: Will Johnson <[email protected]>

ashokponkumar · 2024-10-22T16:19:13Z

@willmj should we consider using https://github.com/foundation-model-stack/hf-resource-scanner for tracking?

Cc: @ChanderG

willmj · 2024-10-22T17:11:38Z

Yes definitely, thanks for the suggestion @ashokponkumar - I had not heard of this scanner. I can look into this implementation.

Signed-off-by: Will Johnson <[email protected]>

anhuong · 2024-10-23T20:59:24Z

@will Johnson few questions for you...

Do we need to create a new log file just for timing information? I thought the initial idea was to add this to the existing log file, would need to check with SW to ensure this wouldn’t break things for them since they pull the loss from this file to add to a graph in the UI
Do we need to create our own method of timing the training time? Is this in order to include the pre and post-processing that happens? The current trainer already calculates how long the training time takes and outputs it at the end of training. Could we just take this value and add it to the log?
I don’t think we need to also run https://github.com/foundation-model-stack/hf-resource-scanner since other metrics like memory consumption should be captured in AIM. This could be a future improvement as well if we want to gather other metrics and log them during training

will · 2024-10-24T05:24:19Z

@will Johnson few questions for you...

By putting a space in between will and Johnson it sends an alert to me and not Mr Johnson

willmj · 2024-11-05T17:23:47Z

Hey @anhuong sorry for the late response:

I see, my understanding was that it was going to be a separate log file but looking back at our initial discussion I see that understanding was incorrect. Who would be a good person to reach out to to see if this is breaking?
To get the train_runtime from trainer.train, it's my understanding we would have to change the code in sft_trainer in the following way:
- From
```
 trainer.train(resume_from_checkpoint)
```
- to
```
train_results = trainer.train(resume_from_checkpoint)
train_runtime = train_results.metrics.get("train_runtime", None)
```
- then we could save it by doing something like this in sft_trainer
```
with open("timing_logs.jsonl", "a") as f:
   json.dump({"train_runtime": train_runtime}, f)
```
I think this works but wasn't sure if we wanted to save the results from trainer.train, additionally I thought it might be nice to abstract the timing outside of the trainer. What do you think is best?
Sounds good. For now we won't move forward with hf-resource-scanner.

willmj added 4 commits October 15, 2024 10:17

feat: Timing tracker

915ca16

Signed-off-by: Will Johnson <[email protected]>

feat: total time

089ad23

Signed-off-by: Will Johnson <[email protected]>

docs: Add timer to tracking documentation

3605805

Signed-off-by: Will Johnson <[email protected]>

fix: Small changes

fb64827

Signed-off-by: Will Johnson <[email protected]>

fix: small changes

3868551

Signed-off-by: Will Johnson <[email protected]>

willmj changed the title ~~1371~~ feat: Add timing tracker to fms-hf-tuning Oct 21, 2024

github-actions bot added the feat label Oct 21, 2024

docs: Correct variable name in docs

48f4b51

Signed-off-by: Will Johnson <[email protected]>

willmj added 3 commits October 22, 2024 13:23

docs: Update docstring

0e01a6d

Signed-off-by: Will Johnson <[email protected]>

tests: Add unit tests similar to file logging tests for timer

5d18773

Signed-off-by: Will Johnson <[email protected]>

fmt

ace14c7

Signed-off-by: Will Johnson <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add timing tracker to fms-hf-tuning #378

feat: Add timing tracker to fms-hf-tuning #378

willmj commented Oct 21, 2024

github-actions bot commented Oct 21, 2024

ashokponkumar commented Oct 22, 2024

willmj commented Oct 22, 2024

anhuong commented Oct 23, 2024

will commented Oct 24, 2024

willmj commented Nov 5, 2024

feat: Add timing tracker to fms-hf-tuning #378

Are you sure you want to change the base?

feat: Add timing tracker to fms-hf-tuning #378

Conversation

willmj commented Oct 21, 2024

Description of the change

Related issue number

How to verify the PR

Was the PR tested

github-actions bot commented Oct 21, 2024

ashokponkumar commented Oct 22, 2024

willmj commented Oct 22, 2024

anhuong commented Oct 23, 2024

will commented Oct 24, 2024

willmj commented Nov 5, 2024