Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add timing tracker to fms-hf-tuning #378

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

willmj
Copy link
Collaborator

@willmj willmj commented Oct 21, 2024

Description of the change

This change will add another tracker to fms-hf-tuning which will, per process, log the training time. Additionally, if tuning is run using accelerate_launch.py, it will log the entire run time from beginning to end (including pre and post processing).

Here is an example log of a run with two processes:

{"data": {"end_time": "2024-10-21T19:32:36.657076", "start_time": "2024-10-21T19:31:21.727124", "timestamp": "2024-10-21T19:32:36.657152", "total_duration_seconds": 74.929952, "train_runtime": 73.5228}, "name": "training_timing"}
{"data": {"end_time": "2024-10-21T19:32:36.674714", "start_time": "2024-10-21T19:31:21.724035", "timestamp": "2024-10-21T19:32:36.674749", "total_duration_seconds": 74.950679, "train_runtime": 74.8826}, "name": "training_timing"}
{"data": {"end_time": "2024-10-21T19:33:23.181481", "start_time": "2024-10-21T19:23:54.401312", "total_duration_seconds": 568.780169}, "name": "total_runtime"}

Questions:

  • Is it helpful to have timer be per process or should there just be one train_runtime?
  • Is there a better way to use the tracker for total runtime without passing it in the launch command?

Related issue number

How to verify the PR

Run these changes in a dev branch on any model and view the output directory once training is complete.

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Copy link

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

Signed-off-by: Will Johnson <[email protected]>
@willmj willmj changed the title 1371 feat: Add timing tracker to fms-hf-tuning Oct 21, 2024
@github-actions github-actions bot added the feat label Oct 21, 2024
@ashokponkumar
Copy link
Collaborator

@willmj should we consider using https://github.com/foundation-model-stack/hf-resource-scanner for tracking?

Cc: @ChanderG

@willmj
Copy link
Collaborator Author

willmj commented Oct 22, 2024

Yes definitely, thanks for the suggestion @ashokponkumar - I had not heard of this scanner. I can look into this implementation.

@anhuong
Copy link
Collaborator

anhuong commented Oct 23, 2024

@will Johnson few questions for you...

  • Do we need to create a new log file just for timing information? I thought the initial idea was to add this to the existing log file, would need to check with SW to ensure this wouldn’t break things for them since they pull the loss from this file to add to a graph in the UI
  • Do we need to create our own method of timing the training time? Is this in order to include the pre and post-processing that happens? The current trainer already calculates how long the training time takes and outputs it at the end of training. Could we just take this value and add it to the log?
  • I don’t think we need to also run https://github.com/foundation-model-stack/hf-resource-scanner since other metrics like memory consumption should be captured in AIM. This could be a future improvement as well if we want to gather other metrics and log them during training

@will
Copy link

will commented Oct 24, 2024

@will Johnson few questions for you...

By putting a space in between will and Johnson it sends an alert to me and not Mr Johnson

@willmj
Copy link
Collaborator Author

willmj commented Nov 5, 2024

Hey @anhuong sorry for the late response:

  • I see, my understanding was that it was going to be a separate log file but looking back at our initial discussion I see that understanding was incorrect. Who would be a good person to reach out to to see if this is breaking?

  • To get the train_runtime from trainer.train, it's my understanding we would have to change the code in sft_trainer in the following way:

    • From
       trainer.train(resume_from_checkpoint)
      
    • to
      train_results = trainer.train(resume_from_checkpoint)
      train_runtime = train_results.metrics.get("train_runtime", None)
      
    • then we could save it by doing something like this in sft_trainer
      with open("timing_logs.jsonl", "a") as f:
         json.dump({"train_runtime": train_runtime}, f)
      

    I think this works but wasn't sure if we wanted to save the results from trainer.train, additionally I thought it might be nice to abstract the timing outside of the trainer. What do you think is best?

  • Sounds good. For now we won't move forward with hf-resource-scanner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants