- Image Curation
- Image Embedding Creation
- Aesthetic Classifier
- NSFW Classifier
- Semantic Deduplication
- Text Curation
- Quality Classifier
- Aegis Classifier
- FineWeb-Edu Classifier
Full Changelog: https://github.com/NVIDIA/NeMo-Curator/commits/v0.5.0
- Add spacy<3.8 pin to r0.4.1 by @ayushdg in NVIDIA#279
Full Changelog: https://github.com/NVIDIA/NeMo-Curator/compare/v0.4.0...v0.4.1
- Semantic Deduplication
- Resiliparse for Text Extraction
- Improve Distributed Data Classification - Domain classifier is 1.55x faster through intelligent batching
- Synthetic data generation for fine-tuning
- Update README by @ryantwolf in NVIDIA#6
- [Tutorials] Add a readme file for the TinyStories tutorial by @Maghoumi in NVIDIA#5
- Add workflow for running cpu pytests by @ayushdg in NVIDIA#13
- Add pre-commit style checks by @ayushdg in NVIDIA#14
- Add citation by @ryantwolf in NVIDIA#15
- Fix Noisy CUDA Shutdown by @ryantwolf in NVIDIA#20
- Bump Python and RAPIDS versions by @ryantwolf in NVIDIA#16
- Add batched decorator by @ryantwolf in NVIDIA#18
- Add issue templates by @ayushdg in NVIDIA#22
- Add dependency to fix justext by @ryantwolf in NVIDIA#24
- Fix metadata inference with pandas and dask by @ryantwolf in NVIDIA#35
- Disable PyTorch Compile Multiprocessing by @ryantwolf in NVIDIA#34
- Improve speed of AddId module by @ryantwolf in NVIDIA#36
- Make GPU dependencies optional by @ayushdg in NVIDIA#27
- Fix failing GPU tests with latest pandas bump by @ayushdg in NVIDIA#41
- Adds Nemo Curator K8s example by @terrykong in NVIDIA#40
- Move common dedup utils and remove unused code by @ayushdg in NVIDIA#42
- Fix lang id example by @ryantwolf in NVIDIA#37
- Add dataset blending tool by @ryantwolf in NVIDIA#32
- High level fuzzy duplicates module by @ayushdg in NVIDIA#46
- Fix indexing in PII Modifier by @ryantwolf in NVIDIA#55
- Disable string conversion globally by @ryantwolf in NVIDIA#56
- Fix issue #43 (empty files creation) and improve reading/writing speed by @miguelusque in NVIDIA#57
- [Tutorials] Add a tutorial for PEFT data curation by @Maghoumi in NVIDIA#45
- Only import PII constants during Curator import by @ayushdg in NVIDIA#61
- Align
extract_partitioning_index
logic with upstream shuffling by @rjzamora in NVIDIA#60 - [REVIEW] Switch Models to use Crossfit by @VibhuJawa in NVIDIA#58
- Remove argparse from get_client function signature by @ryantwolf in NVIDIA#12
- Fuzzy Dedup: Use text_field instead of hardcoded text column by @ayushdg in NVIDIA#74
- Add pull request template by @ayushdg in NVIDIA#78
- Add jupyter notebook tutorial for single node mulilingual dataset by @nicoleeeluo in NVIDIA#30
- Update issue templates by @ryantwolf in NVIDIA#81
- Fix #91 - Incorrect reference to domain_classifier_example.py by @miguelusque in NVIDIA#92
- Fix #63. Add --input-meta parameter to explicitly specify the jsonl field dtypes by @miguelusque in NVIDIA#75
- Update readme by @ayushdg in NVIDIA#93
- Update documentation for new version by @ryantwolf in NVIDIA#83
- Update requirements documentation. by @ayushdg in NVIDIA#98
- Make sure query-planning is disabled for now by @rjzamora in NVIDIA#97
- Applying SEO Best Pratices by @aschilling-nv in NVIDIA#104
- Shuffle CC result on group before writing out by @ayushdg in NVIDIA#110
- Added tutorials to index.rst by @jgerh in NVIDIA#113
- Pin to numpy<2 to avoid spacy compat issues by @ayushdg in NVIDIA#119
- Fix #116. Fix broken links by @miguelusque in NVIDIA#117
- Update index.rst by @aschilling-nv in NVIDIA#129
- Fix nemo_curator import in CPU only environment when GPU packages are installed. by @ayushdg in NVIDIA#123
- Improve Common Crawl download by @ryantwolf in NVIDIA#82
- Update README.md by @Maghoumi in NVIDIA#126
- Allow multiple filenames per partition when separating by metadata by @ayushdg in NVIDIA#99
- [REVIEW] Add Resiliparse option for text extraction by @sarahyurick in NVIDIA#128
- Fix 69 - Refactor how arguments are added to scripts by @miguelusque in NVIDIA#102
- Stricter check for query planning. by @ayushdg in NVIDIA#107
- Add DataFrame example to Distributed Data Classification tutorial by @sarahyurick in NVIDIA#137
- Enable Sem-dedup by @VibhuJawa in NVIDIA#130
- Remove lxml installation by @ryantwolf in NVIDIA#140
- Nemotron 340 SDG Pipeline Tutorial by @chrisalexiuk-nvidia in NVIDIA#144
- Add Synthetic Data Generation Module by @ryantwolf in NVIDIA#136
- Skip explicit comms shuffle for dask-cuda 24.06 by @ayushdg in NVIDIA#147
- Add support for NeMo SDK by @ryantwolf in NVIDIA#131
- [REVIEW] Fix SemDedup bugs by @VibhuJawa in NVIDIA#151
- [pre-commit.ci] pre-commit suggestions by @pre-commit-ci in NVIDIA#135
- Fix bug with torch rmm and nemo by @ryantwolf in NVIDIA#155
- @ayushdg made their first contribution in NVIDIA#13
- @terrykong made their first contribution in NVIDIA#40
- @rjzamora made their first contribution in NVIDIA#60
- @nicoleeeluo made their first contribution in NVIDIA#30
- @aschilling-nv made their first contribution in NVIDIA#104
- @pre-commit-ci made their first contribution in NVIDIA#135
Full Changelog: https://github.com/NVIDIA/NeMo-Curator/commits/v0.4.0s
- Update README by @ryantwolf in NVIDIA#6
- [Tutorials] Add a readme file for the TinyStories tutorial by @Maghoumi in NVIDIA#5
- Add workflow for running cpu pytests by @ayushdg in NVIDIA#13
- Add pre-commit style checks by @ayushdg in NVIDIA#14
- Add citation by @ryantwolf in NVIDIA#15
- Fix Noisy CUDA Shutdown by @ryantwolf in NVIDIA#20
- Bump Python and RAPIDS versions by @ryantwolf in NVIDIA#16
- Add batched decorator by @ryantwolf in NVIDIA#18
- Add issue templates by @ayushdg in NVIDIA#22
- Add dependency to fix justext by @ryantwolf in NVIDIA#24
- Fix metadata inference with pandas and dask by @ryantwolf in NVIDIA#35
- Disable PyTorch Compile Multiprocessing by @ryantwolf in NVIDIA#34
- Improve speed of AddId module by @ryantwolf in NVIDIA#36
- Make GPU dependencies optional by @ayushdg in NVIDIA#27
- Fix failing GPU tests with latest pandas bump by @ayushdg in NVIDIA#41
- Adds Nemo Curator K8s example by @terrykong in NVIDIA#40
- Move common dedup utils and remove unused code by @ayushdg in NVIDIA#42
- Fix lang id example by @ryantwolf in NVIDIA#37
- Add dataset blending tool by @ryantwolf in NVIDIA#32
- High level fuzzy duplicates module by @ayushdg in NVIDIA#46
- Fix indexing in PII Modifier by @ryantwolf in NVIDIA#55
- Disable string conversion globally by @ryantwolf in NVIDIA#56
- Fix issue #43 (empty files creation) and improve reading/writing speed by @miguelusque in NVIDIA#57
- [Tutorials] Add a tutorial for PEFT data curation by @Maghoumi in NVIDIA#45
- Only import PII constants during Curator import by @ayushdg in NVIDIA#61
- Align
extract_partitioning_index
logic with upstream shuffling by @rjzamora in NVIDIA#60
- @Maghoumi made their first contribution in NVIDIA#5
- @terrykong made their first contribution in NVIDIA#40
- @miguelusque made their first contribution in NVIDIA#57
- @rjzamora made their first contribution in NVIDIA#60
Full Changelog: https://github.com/NVIDIA/NeMo-Curator/commits/v0.3.0