Evaluation Data Format Scripts #115

alex-jw-brooks · 2024-04-10T15:41:42Z

Description of the change

This PR adds scripts for:

Pulling and formatting tone/entities/TSA data in Alpaca format
Converting the Alpaca format data to SFT format - this can use the "verbose" instruction template, or the (default) "simple" template.

It also some docs for how to run them.

Related issue number

How to verify the PR

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

anhuong · 2024-05-09T14:44:36Z

scripts/cluster_k8s.yaml

@@ -0,0 +1,73 @@
+apiVersion: v1
+kind: ConfigMap


Probably don't have to commit this file

Removed it and pointed it at the wiki for how to run something in the cluster

anhuong

Few comments on some additional cleanup, but have been using the scripts and they work well! Making the pull_and_format script generic is a nice future todo

scripts/cluster_k8s.yaml

anhuong · 2024-05-09T14:47:07Z

scripts/cluster_stuff.md

@@ -0,0 +1,140 @@
+# Formatting / Tuning / Evaluating Notes


Did you mean to add this as well to the PR? There is useful info here on how to run the scripts but there are also some parts, like running on a fork and specifying IBM datasets that may not be needed?

I did mean to commit this, as otherwise it's a bit unclear what the role of each script actually is, and these scripts are only really applicable to IBM datasets anyway, since the format that they're consuming is internal formats that aren't standardized (i.e., if they were standardized, we wouldn't probably would not need this script to begin with and could cleanly consolidate it into the Alpaca -> SFT one).

I am open to changing the guidance here if you have specific thoughts though! Or, we could reduce this by a bit by not including any information on how to run in a cluster / k8s, but IMO that is relatively useful knowledge, at least within our group

Also our decision on what to do with this probably affects what to do with the cluster_k8s.yaml - I think if we want k8s related guidance here, having some kind of minimal pod spec is sort of useful, but if we don't want it here, deleting it is probably for the best

I wasn't sure if the deployment docs are needed since we are starting to have it in multiple places already. In the build docs it shows an example of using a ConfigMap and Pod spec to deploy the image and in the wiki it shows an example ConfigMap and PytorchJob to deploy. But I can see how this is a third more developer use case of running and testing inside the pod. So I like the doc, but curious on your thoughts on how they compare to the other existing docs.

I definitely agree that the parts on how to use the evaluation and formatting scripts are super useful!

Cool, thanks for the links! I updated the docs to point it at the wiki for some guidance on the cluster stuff & deleted the k8s config, but left some small notes in about how you might go about running things in a pod, and also added a short section on how to add new datasets

scripts/pull_and_format_datasets.py

alex-jw-brooks · 2024-05-13T13:37:42Z

Thanks @anhuong, I'll make the changes! I am curious what your thoughts are in terms of making the dataset pulling / formatting generic - since the intent is really to format the dataset into something that is generic (i.e., alpaca, which can then be converted to SFT), the lambdas to do the actual conversion are pretty dataset / task specific. The script itself could be expanded, e.g., to make the lambdas/COS paths pickable through a CLI or something like that, but given that the input format of what is being consumed is arbitrary, IMO it's not worth it since it is unlikely that multiple datasets will be in the same nonstandard format, especially within the same COS instance

anhuong · 2024-05-13T16:27:55Z

In terms of making the dataset pulling / formatting generic, I feel like this would be nice but since it's so specific to the dataset on how to parse and reformat it, I agree that it's not worth it and this is a nice reference that someone can use to help format their datasets as needed.

Signed-off-by: Alex-Brooks <[email protected]>

anhuong

Awesome, thanks Alex!

* Add tentative script for formatting eval data Signed-off-by: Alex-Brooks <[email protected]> * Add formatting for entities and tsa Signed-off-by: Alex-Brooks <[email protected]> * Add docstrings to format script Signed-off-by: Alex-Brooks <[email protected]> * Escape commas in tsa / entities Signed-off-by: Alex-Brooks <[email protected]> * Add subsampling to script for pulling data Signed-off-by: Alex-Brooks <[email protected]> * Add script to convert alpaca format to sft format Signed-off-by: Alex-Brooks <[email protected]> * Add notes on formatting and evaluation Signed-off-by: Alex-Brooks <[email protected]> * Fix venv guidance Signed-off-by: Alex-Brooks <[email protected]> * Add hack for simple formatting Signed-off-by: Alex-Brooks <[email protected]> * Add boto3 to dev deps Signed-off-by: Alex-Brooks <[email protected]> * Add file and verbose flags to sft formatter Signed-off-by: Alex-Brooks <[email protected]> * lint pull and fmt script Signed-off-by: Alex-Brooks <[email protected]> * Update docs & review comments Signed-off-by: Alex-Brooks <[email protected]> --------- Signed-off-by: Alex-Brooks <[email protected]>

alex-jw-brooks force-pushed the format_scripts branch 6 times, most recently from e54e53f to 7959c71 Compare April 16, 2024 22:19

alex-jw-brooks force-pushed the format_scripts branch 4 times, most recently from 147799d to e0e6868 Compare May 2, 2024 23:42

alex-jw-brooks marked this pull request as ready for review May 2, 2024 23:49

alex-jw-brooks requested review from anhuong and Ssukriti as code owners May 2, 2024 23:49

alex-jw-brooks changed the title ~~DO NOT MERGE - Format scripts~~ Evaluation Data Format Scripts May 2, 2024

anhuong reviewed May 9, 2024

View reviewed changes

anhuong requested changes May 9, 2024

View reviewed changes

alex-jw-brooks force-pushed the format_scripts branch 3 times, most recently from 078f674 to 1f6121d Compare May 21, 2024 18:37

alex-jw-brooks added 9 commits May 21, 2024 12:48

Add tentative script for formatting eval data

fae2c95

Signed-off-by: Alex-Brooks <[email protected]>

Add formatting for entities and tsa

a967ff3

Signed-off-by: Alex-Brooks <[email protected]>

Add docstrings to format script

502d45e

Signed-off-by: Alex-Brooks <[email protected]>

Escape commas in tsa / entities

a9e2ad0

Signed-off-by: Alex-Brooks <[email protected]>

Add subsampling to script for pulling data

0e8c4d5

Signed-off-by: Alex-Brooks <[email protected]>

Add script to convert alpaca format to sft format

151adc3

Signed-off-by: Alex-Brooks <[email protected]>

Add notes on formatting and evaluation

ba075b9

Signed-off-by: Alex-Brooks <[email protected]>

Fix venv guidance

6828780

Signed-off-by: Alex-Brooks <[email protected]>

Add hack for simple formatting

df8121c

Signed-off-by: Alex-Brooks <[email protected]>

alex-jw-brooks added 4 commits May 21, 2024 12:48

Add boto3 to dev deps

f6e27db

Signed-off-by: Alex-Brooks <[email protected]>

Add file and verbose flags to sft formatter

cf92a3f

Signed-off-by: Alex-Brooks <[email protected]>

lint pull and fmt script

bd7e521

Signed-off-by: Alex-Brooks <[email protected]>

Update docs & review comments

43fa7a5

Signed-off-by: Alex-Brooks <[email protected]>

alex-jw-brooks force-pushed the format_scripts branch from 1f6121d to 43fa7a5 Compare May 21, 2024 18:48

anhuong approved these changes May 21, 2024

View reviewed changes

anhuong merged commit 04350fa into foundation-model-stack:main May 21, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation Data Format Scripts #115

Evaluation Data Format Scripts #115

alex-jw-brooks commented Apr 10, 2024 •

edited

Loading

anhuong May 9, 2024

alex-jw-brooks May 21, 2024

anhuong left a comment

anhuong May 9, 2024

alex-jw-brooks May 13, 2024 •

edited

Loading

alex-jw-brooks May 13, 2024

anhuong May 13, 2024

alex-jw-brooks May 21, 2024

alex-jw-brooks commented May 13, 2024 •

edited

Loading

anhuong commented May 13, 2024

anhuong left a comment

Evaluation Data Format Scripts #115

Evaluation Data Format Scripts #115

Conversation

alex-jw-brooks commented Apr 10, 2024 • edited Loading

Description of the change

Related issue number

How to verify the PR

Was the PR tested

anhuong May 9, 2024

Choose a reason for hiding this comment

alex-jw-brooks May 21, 2024

Choose a reason for hiding this comment

anhuong left a comment

Choose a reason for hiding this comment

anhuong May 9, 2024

Choose a reason for hiding this comment

alex-jw-brooks May 13, 2024 • edited Loading

Choose a reason for hiding this comment

alex-jw-brooks May 13, 2024

Choose a reason for hiding this comment

anhuong May 13, 2024

Choose a reason for hiding this comment

alex-jw-brooks May 21, 2024

Choose a reason for hiding this comment

alex-jw-brooks commented May 13, 2024 • edited Loading

anhuong commented May 13, 2024

anhuong left a comment

Choose a reason for hiding this comment

alex-jw-brooks commented Apr 10, 2024 •

edited

Loading

alex-jw-brooks May 13, 2024 •

edited

Loading

alex-jw-brooks commented May 13, 2024 •

edited

Loading