Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: change tracker API to initialize tracker early and track additional metrics. #50

Closed
wants to merge 13 commits into from

Conversation

dushyantbehl
Copy link
Contributor

@dushyantbehl dushyantbehl commented Feb 19, 2024

  1. This PR adds a tracker API with Aimstack as the default tracker. This is simple plug and play architecture which can support multiple trackers.
  2. The tracker config is now taken as command line arguments (making it easier for any automation to pass tracker arguments).
  3. With the new API I have added support to track any additional metrics of interest.
  4. As an example I have added one single line to track model_load_time in AIM possibly fixing Add support for collecting metrics programmatically #33
  5. Also bumps Aim version to 3.18.1 which is more stable and newer release.
    See. Bump aim from 3.17.5 to 3.18.1 #42

Example of how new api can be invoked

 torchrun --nnodes=1 --nproc_per_node=2 --master_port=1234 tuning/sft_trainer.py --tokenizer_name_or_path ${MODEL_PATH} --model_name_or_path ${MODEL_PATH} --data_path ${DATA_PATH} --use_peft --bf16 True --output_dir ${OUTPUT_PATH} --num_train_epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_config tuning/config/fsdp_config.json --bf16 True --response_template "\n### Response:" --dataset_text_field "output" --tracker aim --aim_repo /data/aim --experiment sft-llama7b-test 

@dushyantbehl
Copy link
Contributor Author

cc @VassilisVassiliadis comments are welcome

Tracker now takes command line arguments as config.
Aim stack is the default tracker and code contains example to measure
additional metrics seamlessly into aimstack like 'model_load_time'

Signed-off-by: Dushyant Behl <[email protected]>
@dushyantbehl dushyantbehl changed the title Generic Tracker API with command line arguments. Change tracker API to initialize tracker early and track additional metrics. Feb 20, 2024
Fix duplicate argument

Signed-off-by: Dushyant Behl <[email protected]>
Signed-off-by: Dushyant Behl <[email protected]>
@dushyantbehl dushyantbehl changed the title Change tracker API to initialize tracker early and track additional metrics. feat: change tracker API to initialize tracker early and track additional metrics. Feb 21, 2024
Signed-off-by: Dushyant Behl <[email protected]>
Copy link
Contributor

@VassilisVassiliadis VassilisVassiliadis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of something along these lines (need to double check the indentation as I used the web interface to suggest the changes)

tuning/sft_trainer.py Outdated Show resolved Hide resolved
tuning/sft_trainer.py Outdated Show resolved Hide resolved
FSDP.

Add tracker factory.

Signed-off-by: Dushyant Behl <[email protected]>
@dushyantbehl
Copy link
Contributor Author

@VassilisVassiliadis the design and some functions had changed due to inconsistency with FSDP runs. I feel most of your ask is incorporated now. Can you check again if you have any other questions.

This argument --exp_metadata '{"gpu_model": "A100-80GB", "dataset_id": "my-amazing-dataset"}' passed to the run will result in the parameters recorded like this.

image

@VassilisVassiliadis
Copy link
Contributor

Thanks! this looks good to me

@dushyantbehl dushyantbehl force-pushed the main branch 2 times, most recently from 4bbe5d3 to 8661cc0 Compare February 23, 2024 10:47
@ashokponkumar
Copy link
Collaborator

@alex-jw-brooks PTAL

def train(
model_args: configs.ModelArguments,
data_args: configs.DataArguments,
train_args: configs.TrainingArguments,
peft_config: Optional[
Union[peft_config.LoraConfig, peft_config.PromptTuningConfig]
] = None,
callbacks: Optional[List[TrainerCallback]] = None,
tracker: Optional[Tracker] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the train function originall accepts only configurations by design, I feel we need a strong reason to allow it to also include objects. The more natural way would be to accept one "tracker config" input

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a yaml as config was the first thought that came too. But given HF has chosen to use arguments for most configurations we went with it as a design choice. But if in fms-hf-tuning we choose to use config files for configs, we can do that too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Train function expects to take callbacks which need to be associated with the training hence main takes the config and initializes tracker (which is needed for the callback)
Tracker here is the tracker object which train function needs to track any extra metrics and metadata hence the design choice to pass tracker separately.

callbacks = [peft_saving_callback, file_logger_callback]

# Initialize the tracker
tracker = get_tracker(tracker_name, tracker_config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just have a tracker_install_callbacks function somewhere to reduce 3 lines into 1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Here too it's a design choice we have to make. We have tried to keep it consistent with how config is managed in fms-hf-tuning here too. This has tried to follow the pattern used in the peft config block above. But if we choose to have a different approach we can follow that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its simple choice to either perform things explicitly or run the code in some other funciton reverting to no callback under the hood. Does not affect the functionality. If you strongly feeling a meta function can help we can make that change.

tuning/config/configs.py Outdated Show resolved Hide resolved
tuning/sft_trainer.py Outdated Show resolved Hide resolved
requirements.txt Outdated
@@ -2,7 +2,7 @@ numpy
accelerate>=0.20.3
transformers>=4.34.1
torch
aim==3.17.5
aim==3.18.1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since there's a desire to implement multiple trackers, did we want to make the dependency (and imports) optional, just used when available and configured?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be done..but then do we still list these in the requirements.txt?

or do we throw an error and ask user to install the required tracker before importing it in the code.

@dushyantbehl dushyantbehl force-pushed the main branch 4 times, most recently from 1fd64ba to 93e2d03 Compare February 29, 2024 17:55
@dushyantbehl
Copy link
Contributor Author

I have split this PR into two and opened them separately as #89 and #90

Not closing this right away due to the comments.

@dushyantbehl
Copy link
Contributor Author

#89 and #90 is split version of this PR. We will track them for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants