Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation Data Format Scripts #115

Merged
merged 13 commits into from
May 21, 2024

Conversation

alex-jw-brooks
Copy link
Collaborator

@alex-jw-brooks alex-jw-brooks commented Apr 10, 2024

Description of the change

This PR adds scripts for:

  1. Pulling and formatting tone/entities/TSA data in Alpaca format
  2. Converting the Alpaca format data to SFT format - this can use the "verbose" instruction template, or the (default) "simple" template.

It also some docs for how to run them.

Related issue number

How to verify the PR

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

@alex-jw-brooks alex-jw-brooks force-pushed the format_scripts branch 6 times, most recently from e54e53f to 7959c71 Compare April 16, 2024 22:19
@alex-jw-brooks alex-jw-brooks force-pushed the format_scripts branch 4 times, most recently from 147799d to e0e6868 Compare May 2, 2024 23:42
@alex-jw-brooks alex-jw-brooks marked this pull request as ready for review May 2, 2024 23:49
@alex-jw-brooks alex-jw-brooks changed the title DO NOT MERGE - Format scripts Evaluation Data Format Scripts May 2, 2024
@@ -0,0 +1,73 @@
apiVersion: v1
kind: ConfigMap
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably don't have to commit this file

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it and pointed it at the wiki for how to run something in the cluster

Copy link
Collaborator

@anhuong anhuong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments on some additional cleanup, but have been using the scripts and they work well! Making the pull_and_format script generic is a nice future todo

scripts/cluster_k8s.yaml Outdated Show resolved Hide resolved
@@ -0,0 +1,140 @@
# Formatting / Tuning / Evaluating Notes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to add this as well to the PR? There is useful info here on how to run the scripts but there are also some parts, like running on a fork and specifying IBM datasets that may not be needed?

Copy link
Collaborator Author

@alex-jw-brooks alex-jw-brooks May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did mean to commit this, as otherwise it's a bit unclear what the role of each script actually is, and these scripts are only really applicable to IBM datasets anyway, since the format that they're consuming is internal formats that aren't standardized (i.e., if they were standardized, we wouldn't probably would not need this script to begin with and could cleanly consolidate it into the Alpaca -> SFT one).

I am open to changing the guidance here if you have specific thoughts though! Or, we could reduce this by a bit by not including any information on how to run in a cluster / k8s, but IMO that is relatively useful knowledge, at least within our group

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also our decision on what to do with this probably affects what to do with the cluster_k8s.yaml - I think if we want k8s related guidance here, having some kind of minimal pod spec is sort of useful, but if we don't want it here, deleting it is probably for the best

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure if the deployment docs are needed since we are starting to have it in multiple places already. In the build docs it shows an example of using a ConfigMap and Pod spec to deploy the image and in the wiki it shows an example ConfigMap and PytorchJob to deploy. But I can see how this is a third more developer use case of running and testing inside the pod. So I like the doc, but curious on your thoughts on how they compare to the other existing docs.

I definitely agree that the parts on how to use the evaluation and formatting scripts are super useful!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for the links! I updated the docs to point it at the wiki for some guidance on the cluster stuff & deleted the k8s config, but left some small notes in about how you might go about running things in a pod, and also added a short section on how to add new datasets

scripts/pull_and_format_datasets.py Outdated Show resolved Hide resolved
scripts/pull_and_format_datasets.py Outdated Show resolved Hide resolved
scripts/pull_and_format_datasets.py Show resolved Hide resolved
@alex-jw-brooks
Copy link
Collaborator Author

alex-jw-brooks commented May 13, 2024

Thanks @anhuong, I'll make the changes! I am curious what your thoughts are in terms of making the dataset pulling / formatting generic - since the intent is really to format the dataset into something that is generic (i.e., alpaca, which can then be converted to SFT), the lambdas to do the actual conversion are pretty dataset / task specific. The script itself could be expanded, e.g., to make the lambdas/COS paths pickable through a CLI or something like that, but given that the input format of what is being consumed is arbitrary, IMO it's not worth it since it is unlikely that multiple datasets will be in the same nonstandard format, especially within the same COS instance

@anhuong
Copy link
Collaborator

anhuong commented May 13, 2024

In terms of making the dataset pulling / formatting generic, I feel like this would be nice but since it's so specific to the dataset on how to parse and reformat it, I agree that it's not worth it and this is a nice reference that someone can use to help format their datasets as needed.

@alex-jw-brooks alex-jw-brooks force-pushed the format_scripts branch 3 times, most recently from 078f674 to 1f6121d Compare May 21, 2024 18:37
Copy link
Collaborator

@anhuong anhuong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks Alex!

@anhuong anhuong merged commit 04350fa into foundation-model-stack:main May 21, 2024
6 checks passed
jbusche pushed a commit to jbusche/fms-hf-tuning that referenced this pull request May 21, 2024
* Add tentative script for formatting eval data

Signed-off-by: Alex-Brooks <[email protected]>

* Add formatting for entities and tsa

Signed-off-by: Alex-Brooks <[email protected]>

* Add docstrings to format script

Signed-off-by: Alex-Brooks <[email protected]>

* Escape commas in tsa / entities

Signed-off-by: Alex-Brooks <[email protected]>

* Add subsampling to script for pulling data

Signed-off-by: Alex-Brooks <[email protected]>

* Add script to convert alpaca format to sft format

Signed-off-by: Alex-Brooks <[email protected]>

* Add notes on formatting and evaluation

Signed-off-by: Alex-Brooks <[email protected]>

* Fix venv guidance

Signed-off-by: Alex-Brooks <[email protected]>

* Add hack for simple formatting

Signed-off-by: Alex-Brooks <[email protected]>

* Add boto3 to dev deps

Signed-off-by: Alex-Brooks <[email protected]>

* Add file and verbose flags to sft formatter

Signed-off-by: Alex-Brooks <[email protected]>

* lint pull and fmt script

Signed-off-by: Alex-Brooks <[email protected]>

* Update docs & review comments

Signed-off-by: Alex-Brooks <[email protected]>

---------

Signed-off-by: Alex-Brooks <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants