-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluation Data Format Scripts #115
Evaluation Data Format Scripts #115
Conversation
e54e53f
to
7959c71
Compare
147799d
to
e0e6868
Compare
scripts/cluster_k8s.yaml
Outdated
@@ -0,0 +1,73 @@ | |||
apiVersion: v1 | |||
kind: ConfigMap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably don't have to commit this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed it and pointed it at the wiki for how to run something in the cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments on some additional cleanup, but have been using the scripts and they work well! Making the pull_and_format script generic is a nice future todo
scripts/cluster_stuff.md
Outdated
@@ -0,0 +1,140 @@ | |||
# Formatting / Tuning / Evaluating Notes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to add this as well to the PR? There is useful info here on how to run the scripts but there are also some parts, like running on a fork and specifying IBM datasets that may not be needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did mean to commit this, as otherwise it's a bit unclear what the role of each script actually is, and these scripts are only really applicable to IBM datasets anyway, since the format that they're consuming is internal formats that aren't standardized (i.e., if they were standardized, we wouldn't probably would not need this script to begin with and could cleanly consolidate it into the Alpaca -> SFT one).
I am open to changing the guidance here if you have specific thoughts though! Or, we could reduce this by a bit by not including any information on how to run in a cluster / k8s, but IMO that is relatively useful knowledge, at least within our group
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also our decision on what to do with this probably affects what to do with the cluster_k8s.yaml
- I think if we want k8s related guidance here, having some kind of minimal pod spec is sort of useful, but if we don't want it here, deleting it is probably for the best
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure if the deployment docs are needed since we are starting to have it in multiple places already. In the build docs it shows an example of using a ConfigMap and Pod spec to deploy the image and in the wiki it shows an example ConfigMap and PytorchJob to deploy. But I can see how this is a third more developer use case of running and testing inside the pod. So I like the doc, but curious on your thoughts on how they compare to the other existing docs.
I definitely agree that the parts on how to use the evaluation and formatting scripts are super useful!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks for the links! I updated the docs to point it at the wiki for some guidance on the cluster stuff & deleted the k8s config, but left some small notes in about how you might go about running things in a pod, and also added a short section on how to add new datasets
Thanks @anhuong, I'll make the changes! I am curious what your thoughts are in terms of making the dataset pulling / formatting generic - since the intent is really to format the dataset into something that is generic (i.e., alpaca, which can then be converted to SFT), the lambdas to do the actual conversion are pretty dataset / task specific. The script itself could be expanded, e.g., to make the lambdas/COS paths pickable through a CLI or something like that, but given that the input format of what is being consumed is arbitrary, IMO it's not worth it since it is unlikely that multiple datasets will be in the same nonstandard format, especially within the same COS instance |
In terms of making the dataset pulling / formatting generic, I feel like this would be nice but since it's so specific to the dataset on how to parse and reformat it, I agree that it's not worth it and this is a nice reference that someone can use to help format their datasets as needed. |
078f674
to
1f6121d
Compare
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
1f6121d
to
43fa7a5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks Alex!
* Add tentative script for formatting eval data Signed-off-by: Alex-Brooks <[email protected]> * Add formatting for entities and tsa Signed-off-by: Alex-Brooks <[email protected]> * Add docstrings to format script Signed-off-by: Alex-Brooks <[email protected]> * Escape commas in tsa / entities Signed-off-by: Alex-Brooks <[email protected]> * Add subsampling to script for pulling data Signed-off-by: Alex-Brooks <[email protected]> * Add script to convert alpaca format to sft format Signed-off-by: Alex-Brooks <[email protected]> * Add notes on formatting and evaluation Signed-off-by: Alex-Brooks <[email protected]> * Fix venv guidance Signed-off-by: Alex-Brooks <[email protected]> * Add hack for simple formatting Signed-off-by: Alex-Brooks <[email protected]> * Add boto3 to dev deps Signed-off-by: Alex-Brooks <[email protected]> * Add file and verbose flags to sft formatter Signed-off-by: Alex-Brooks <[email protected]> * lint pull and fmt script Signed-off-by: Alex-Brooks <[email protected]> * Update docs & review comments Signed-off-by: Alex-Brooks <[email protected]> --------- Signed-off-by: Alex-Brooks <[email protected]>
Description of the change
This PR adds scripts for:
It also some docs for how to run them.
Related issue number
How to verify the PR
Was the PR tested