Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Automatic script to fetch current datasets. #257

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Sopel97
Copy link
Member

@Sopel97 Sopel97 commented Jul 6, 2023

The intended purpose of this script is to always document and fetch the datasets required for replicating the training of the current Stockfish master network.

Right now this is mostly a skeleton with a DSL allowing to define how the datasets are combined. Downloading from kaggle and concatenation should work but are untested. Interleaving is not yet implemented. It is only meant to be used in the dry-run form right now.

A single kaggle dataset is always combined into a single destination file by concatenation in alphabetical sort order. If this is too rigid of a requirement we can work on relaxing it, but I think it works for all the current datasets.

@linrock Could you please add a full specification for the currently used datasets? I included an example for the dataset used in the first stage of the training. If there is any needed functionality missing let me know.

Even the data for the first stage requires downloading 200GB of data, so I'm unable to verify the correctness right now. We'll see about it after we have the full process documented.

@vondele
Copy link
Member

vondele commented Jul 12, 2023

I've tried this a couple of days ago, IMO looks good. I would not delete intermediate files by default, I think for most people it is more difficult to download 200GB than to store it, even though you might have different constraints on your server.

Let's try to complete the list of data sets for current master nets. I can definitely upload more data as needed.

@linrock
Copy link
Contributor

linrock commented Jul 12, 2023

the current master stage2 dataset is composed of:
https://www.kaggle.com/datasets/joostvandevondele/t60t70wisrightfarseert60t74t75t76
https://www.kaggle.com/datasets/linrock/t78juntoaugt79mart80dec-16tb7p

   LeelaFarseer-T78juntoaugT79marT80dec.binpack (141G)
     T60T70wIsRightFarseerT60T74T75T76.binpack
     test78-junjulaug2022-16tb7p.no-db.min.binpack
     test79-mar2022-16tb7p.no-db.min.binpack
     test80-dec2022-16tb7p.no-db.min.binpack

i'll take a closer look at stage3 later. the current L1-2048 master final stage dataset is an unshuffled 800GB+ binpack that i'm no longer using since it's too inconvenient. i'm working on replacing it with a fully minimized ~330GB dataset which i'll document later as well.

@linrock
Copy link
Contributor

linrock commented Jul 22, 2023

the current master stage3 dataset is:
https://www.kaggle.com/datasets/linrock/leela96-filt-v2-min
https://www.kaggle.com/datasets/linrock/dfrc99-16tb7p-filt-v2-min
https://www.kaggle.com/datasets/linrock/sfnnv7-s3

   leela96-dfrc99-v2-T80dectofeb-sk20-mar-v6-T77decT78janfebT79apr.binpack (223G)
     leela96-filt-v2.min.binpack
     dfrc99-16tb7p-eval-filt-v2.min.binpack
     test80-dec2022-16tb7p-filter-v6-sk20.min-mar2023.binpack
     test80-jan2023-16tb7p-filter-v6-sk20.min-mar2023.binpack
     test80-feb2023-16tb7p-filter-v6-sk20.min-mar2023.binpack
     test80-mar2023-2tb7p-filter-v6.min.binpack
     test77-dec2021-16tb7p.no-db.min.binpack
     test78-janfeb2022-16tb7p.no-db.min.binpack
     test79-apr2022-16tb7p.no-db.min.binpack

@linrock
Copy link
Contributor

linrock commented Sep 10, 2023

the current master stage4/5 dataset is composed of:
https://www.kaggle.com/datasets/linrock/leela96-filt-v2-min
https://www.kaggle.com/datasets/linrock/dfrc99-16tb7p-filt-v2-min
https://www.kaggle.com/datasets/linrock/0dd1cebea57-test80-v6-dd
https://www.kaggle.com/datasets/linrock/0dd1cebea57-misc-v6-dd
https://www.kaggle.com/datasets/linrock/test80-apr2023-2tb7p-no-db

the uploaded dataset components are all minimized. parts of the dataset were unminimized to increase randomness during training. however, it's unclear how much of an elo benefit this brings. see official-stockfish/Stockfish#4606 for more details on this particular dataset.

as of now, all datasets for training the current master net (nn-c38c3d8d3920.nnue) are documented in this PR.

@linrock
Copy link
Contributor

linrock commented Sep 14, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants