-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Automatic script to fetch current datasets. #257
base: master
Are you sure you want to change the base?
Conversation
I've tried this a couple of days ago, IMO looks good. I would not delete intermediate files by default, I think for most people it is more difficult to download 200GB than to store it, even though you might have different constraints on your server. Let's try to complete the list of data sets for current master nets. I can definitely upload more data as needed. |
the current master stage2 dataset is composed of:
i'll take a closer look at stage3 later. the current L1-2048 master final stage dataset is an unshuffled 800GB+ binpack that i'm no longer using since it's too inconvenient. i'm working on replacing it with a fully minimized ~330GB dataset which i'll document later as well. |
the current master stage3 dataset is:
|
the current master stage4/5 dataset is composed of: the uploaded dataset components are all minimized. parts of the dataset were unminimized to increase randomness during training. however, it's unclear how much of an elo benefit this brings. see official-stockfish/Stockfish#4606 for more details on this particular dataset. as of now, all datasets for training the current master net (nn-c38c3d8d3920.nnue) are documented in this PR. |
the current master stage6 dataset is composed of:
since this was a retraining of the master net, all datasets for training the current master net (nn-1ee1aba5ed4c.nnue) are documented in this PR. more details about this dataset in: official-stockfish/Stockfish#4782 |
The intended purpose of this script is to always document and fetch the datasets required for replicating the training of the current Stockfish master network.
Right now this is mostly a skeleton with a DSL allowing to define how the datasets are combined. Downloading from kaggle and concatenation should work but are untested. Interleaving is not yet implemented. It is only meant to be used in the dry-run form right now.
A single kaggle dataset is always combined into a single destination file by concatenation in alphabetical sort order. If this is too rigid of a requirement we can work on relaxing it, but I think it works for all the current datasets.
@linrock Could you please add a full specification for the currently used datasets? I included an example for the dataset used in the first stage of the training. If there is any needed functionality missing let me know.
Even the data for the first stage requires downloading 200GB of data, so I'm unable to verify the correctness right now. We'll see about it after we have the full process documented.