Add formatting function alpaca #161

Ssukriti · 2024-05-17T22:43:29Z

Description of the change

Allows passing a data formatter template to create single sequence of dataset_text_field internally with JSONL supplied . Eliminates need to do preprocessing and format data to alpaca style https://github.com/foundation-model-stack/fms-hf-tuning?tab=readme-ov-file#data-format

Adds a new data_formatter field to args similar to SFT Trainer formatting function https://huggingface.co/docs/trl/en/sft_trainer#trl.SFTTrainer
Similarly, users either need to pass a dataset_text_field in JSON with preformatted template or can pass a formatter to do formatting on the fly

Related issue number

https://github.ibm.com/ai-foundation/watson-fm-stack-tracker/issues/863

How to verify the PR

Unit tests
README updated

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

Signed-off-by: Sukriti-Sharma4 <[email protected]>

tuning/utils/data_utils.py

tuning/sft_trainer.py

tests/utils/test_data_utils.py

tests/test_sft_trainer.py

Signed-off-by: Sukriti-Sharma4 <[email protected]>

Signed-off-by: Sukriti Sharma <[email protected]>

Signed-off-by: Sukriti-Sharma4 <[email protected]>

Ssukriti · 2024-05-24T04:08:16Z

@alex-jw-brooks the PR is ready for review now. I will address your existing comments soon. Thank you

Signed-off-by: Sukriti-Sharma4 <[email protected]>

Ssukriti · 2024-05-24T22:15:23Z

I have manually verified that formatted data matches our pre-processing step and no warnings occur while training . We will do another quality test to be safe before 0.2.0 release. Safe to merge PR though

alex-jw-brooks

looks great! some small questions

README.md

tuning/sft_trainer.py

tuning/utils/data_utils.py

tests/utils/test_data_utils.py

Signed-off-by: Sukriti-Sharma4 <[email protected]>

Ssukriti · 2024-05-29T00:07:28Z

@alex-jw-brooks the PR is ready for review

alex-jw-brooks

LGMT, thanks Sukriti!

Ssukriti and others added 3 commits May 16, 2024 22:44

utility functions to format datasets using template

54b2624

Signed-off-by: Sukriti-Sharma4 <[email protected]>

add tests and formatter as arg

d1cc787

Signed-off-by: Sukriti-Sharma4 <[email protected]>

Merge branch 'main' into add_formatting_function_alpaca

84e4997

alex-jw-brooks reviewed May 20, 2024

View reviewed changes

Ssukriti added 5 commits May 23, 2024 13:52

Merge branch 'main' into add_formatting_function_alpaca

b1fb8a1

merge main

7c6cce3

Signed-off-by: Sukriti-Sharma4 <[email protected]>

update tests to use template to avoid warnings

6d55934

Signed-off-by: Sukriti-Sharma4 <[email protected]>

update README and tests

f84a222

Signed-off-by: Sukriti-Sharma4 <[email protected]>

fix:formatter

e95661d

Signed-off-by: Sukriti-Sharma4 <[email protected]>

Ssukriti marked this pull request as ready for review May 24, 2024 02:15

Ssukriti requested a review from anhuong as a code owner May 24, 2024 02:15

Ssukriti and others added 4 commits May 23, 2024 21:33

Update README.md

5f1b3e1

Signed-off-by: Sukriti Sharma <[email protected]>

fix imports

2f70e34

Signed-off-by: Sukriti-Sharma4 <[email protected]>

fix pylint

45827ce

Signed-off-by: Sukriti-Sharma4 <[email protected]>

fix tests

1f6bb04

Signed-off-by: Sukriti-Sharma4 <[email protected]>

Ssukriti added 2 commits May 24, 2024 15:52

address review comments- function names

4579f6f

Signed-off-by: Sukriti-Sharma4 <[email protected]>

formatting fix

6cc6d41

Signed-off-by: Sukriti-Sharma4 <[email protected]>

alex-jw-brooks requested changes May 28, 2024

View reviewed changes

README.md Show resolved Hide resolved

tuning/sft_trainer.py Outdated Show resolved Hide resolved

tuning/sft_trainer.py Show resolved Hide resolved

tuning/utils/data_utils.py Outdated Show resolved Hide resolved

tests/utils/test_data_utils.py Show resolved Hide resolved

Ssukriti and others added 3 commits May 28, 2024 17:11

update error message

6c09d68

Signed-off-by: Sukriti-Sharma4 <[email protected]>

Merge branch 'main' into add_formatting_function_alpaca

5d8e643

restrict JSON fields templates

3f5cc6b

Signed-off-by: Sukriti-Sharma4 <[email protected]>

alex-jw-brooks approved these changes May 29, 2024

View reviewed changes

Ssukriti merged commit 3d0c4f3 into main May 29, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add formatting function alpaca #161

Add formatting function alpaca #161

Ssukriti commented May 17, 2024 •

edited

Loading

Ssukriti commented May 24, 2024

Ssukriti commented May 24, 2024

alex-jw-brooks left a comment

Ssukriti commented May 29, 2024

alex-jw-brooks left a comment

Add formatting function alpaca #161

Add formatting function alpaca #161

Conversation

Ssukriti commented May 17, 2024 • edited Loading

Description of the change

Related issue number

How to verify the PR

Was the PR tested

Ssukriti commented May 24, 2024

Ssukriti commented May 24, 2024

alex-jw-brooks left a comment

Choose a reason for hiding this comment

Ssukriti commented May 29, 2024

alex-jw-brooks left a comment

Choose a reason for hiding this comment

Ssukriti commented May 17, 2024 •

edited

Loading