-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add formatting function alpaca #161
Conversation
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Signed-off-by: Sukriti Sharma <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
@alex-jw-brooks the PR is ready for review now. I will address your existing comments soon. Thank you |
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
I have manually verified that formatted data matches our pre-processing step and no warnings occur while training . We will do another quality test to be safe before 0.2.0 release. Safe to merge PR though |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great! some small questions
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
@alex-jw-brooks the PR is ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGMT, thanks Sukriti!
Description of the change
Allows passing a data formatter template to create single sequence of dataset_text_field internally with JSONL supplied . Eliminates need to do preprocessing and format data to alpaca style https://github.com/foundation-model-stack/fms-hf-tuning?tab=readme-ov-file#data-format
Adds a new data_formatter field to args similar to SFT Trainer formatting function https://huggingface.co/docs/trl/en/sft_trainer#trl.SFTTrainer
Similarly, users either need to pass a dataset_text_field in JSON with preformatted template or can pass a formatter to do formatting on the fly
Related issue number
https://github.ibm.com/ai-foundation/watson-fm-stack-tracker/issues/863
How to verify the PR
Was the PR tested