Skip to content

Latest commit

 

History

History
 
 

tinystories

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

TinyStories

This tutorial demonstrates the usage of NeMo Curator's Python API to curate the TinyStories dataset. TinyStories is a dataset of short stories generated by GPT-3.5 and GPT-4, featuring words that are understood by 3 to 4-year olds. The small size of this dataset makes it ideal for creating and validating data curation pipelines on a local machine.

For simplicity, this tutorial uses the validation split of this dataset, which contains around 22,000 samples.

Walkthrough

For a detailed walkthrough of this tutorial, please see the following blog post:

Usage

After installing the NeMo Curator package, you can simply run the following command:

python tutorials/tinystories/main.py

This will download the validation split of the TinyStories dataset and begin the data curation pipeline.