This tutorial demonstrates the usage of NeMo Curator's Python API to curate the TinyStories dataset. TinyStories is a dataset of short stories generated by GPT-3.5 and GPT-4, featuring words that are understood by 3 to 4-year olds. The small size of this dataset makes it ideal for creating and validating data curation pipelines on a local machine.
For simplicity, this tutorial uses the validation split of this dataset, which contains around 22,000 samples.
For a detailed walkthrough of this tutorial, please see the following blog post:
After installing the NeMo Curator package, you can simply run the following command:
python tutorials/tinystories/main.py
This will download the validation split of the TinyStories dataset and begin the data curation pipeline.