Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
docbuilder.py		docbuilder.py
filters.py		filters.py
helpers.py		helpers.py
main.py		main.py
modifiers.py		modifiers.py

README.md

TinyStories

This tutorial demonstrates the usage of NeMo Curator's Python API to curate the TinyStories dataset. TinyStories is a dataset of short stories generated by GPT-3.5 and GPT-4, featuring words that are understood by 3 to 4-year olds. The small size of this dataset makes it ideal for creating and validating data curation pipelines on a local machine.

For simplicity, this tutorial uses the validation split of this dataset, which contains around 22,000 samples.

Walkthrough

For a detailed walkthrough of this tutorial, please see the following blog post:

Curating Custom Datasets for LLM Training with NVIDIA NeMo Curator.

Usage

After installing the NeMo Curator package, you can simply run the following command:

python tutorials/tinystories/main.py

This will download the validation split of the TinyStories dataset and begin the data curation pipeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tinystories

tinystories

README.md

TinyStories

Walkthrough

Usage

Files

tinystories

Directory actions

More options

Directory actions

More options

Latest commit

History

tinystories

Folders and files

parent directory

README.md

TinyStories

Walkthrough

Usage