Skip to content

Latest commit

 

History

History
150 lines (129 loc) · 10.8 KB

CHANGELOG.md

File metadata and controls

150 lines (129 loc) · 10.8 KB

Changelog

NeMo Curator 0.5.0

Highlights

  • Image Curation
    • Image Embedding Creation
    • Aesthetic Classifier
    • NSFW Classifier
    • Semantic Deduplication
  • Text Curation
    • Quality Classifier
    • Aegis Classifier
    • FineWeb-Edu Classifier

Full Changelog: https://github.com/NVIDIA/NeMo-Curator/commits/v0.5.0

NeMo Curator 0.4.1

What's Changed

  • Add spacy<3.8 pin to r0.4.1 by @ayushdg in NVIDIA#279

Full Changelog: https://github.com/NVIDIA/NeMo-Curator/compare/v0.4.0...v0.4.1

NeMo Curator 0.4.0

Highlights

  • Semantic Deduplication
  • Resiliparse for Text Extraction
  • Improve Distributed Data Classification - Domain classifier is 1.55x faster through intelligent batching
  • Synthetic data generation for fine-tuning

What's Changed

  • Update README by @ryantwolf in NVIDIA#6
  • [Tutorials] Add a readme file for the TinyStories tutorial by @Maghoumi in NVIDIA#5
  • Add workflow for running cpu pytests by @ayushdg in NVIDIA#13
  • Add pre-commit style checks by @ayushdg in NVIDIA#14
  • Add citation by @ryantwolf in NVIDIA#15
  • Fix Noisy CUDA Shutdown by @ryantwolf in NVIDIA#20
  • Bump Python and RAPIDS versions by @ryantwolf in NVIDIA#16
  • Add batched decorator by @ryantwolf in NVIDIA#18
  • Add issue templates by @ayushdg in NVIDIA#22
  • Add dependency to fix justext by @ryantwolf in NVIDIA#24
  • Fix metadata inference with pandas and dask by @ryantwolf in NVIDIA#35
  • Disable PyTorch Compile Multiprocessing by @ryantwolf in NVIDIA#34
  • Improve speed of AddId module by @ryantwolf in NVIDIA#36
  • Make GPU dependencies optional by @ayushdg in NVIDIA#27
  • Fix failing GPU tests with latest pandas bump by @ayushdg in NVIDIA#41
  • Adds Nemo Curator K8s example by @terrykong in NVIDIA#40
  • Move common dedup utils and remove unused code by @ayushdg in NVIDIA#42
  • Fix lang id example by @ryantwolf in NVIDIA#37
  • Add dataset blending tool by @ryantwolf in NVIDIA#32
  • High level fuzzy duplicates module by @ayushdg in NVIDIA#46
  • Fix indexing in PII Modifier by @ryantwolf in NVIDIA#55
  • Disable string conversion globally by @ryantwolf in NVIDIA#56
  • Fix issue #43 (empty files creation) and improve reading/writing speed by @miguelusque in NVIDIA#57
  • [Tutorials] Add a tutorial for PEFT data curation by @Maghoumi in NVIDIA#45
  • Only import PII constants during Curator import by @ayushdg in NVIDIA#61
  • Align extract_partitioning_index logic with upstream shuffling by @rjzamora in NVIDIA#60
  • [REVIEW] Switch Models to use Crossfit by @VibhuJawa in NVIDIA#58
  • Remove argparse from get_client function signature by @ryantwolf in NVIDIA#12
  • Fuzzy Dedup: Use text_field instead of hardcoded text column by @ayushdg in NVIDIA#74
  • Add pull request template by @ayushdg in NVIDIA#78
  • Add jupyter notebook tutorial for single node mulilingual dataset by @nicoleeeluo in NVIDIA#30
  • Update issue templates by @ryantwolf in NVIDIA#81
  • Fix #91 - Incorrect reference to domain_classifier_example.py by @miguelusque in NVIDIA#92
  • Fix #63. Add --input-meta parameter to explicitly specify the jsonl field dtypes by @miguelusque in NVIDIA#75
  • Update readme by @ayushdg in NVIDIA#93
  • Update documentation for new version by @ryantwolf in NVIDIA#83
  • Update requirements documentation. by @ayushdg in NVIDIA#98
  • Make sure query-planning is disabled for now by @rjzamora in NVIDIA#97
  • Applying SEO Best Pratices by @aschilling-nv in NVIDIA#104
  • Shuffle CC result on group before writing out by @ayushdg in NVIDIA#110
  • Added tutorials to index.rst by @jgerh in NVIDIA#113
  • Pin to numpy<2 to avoid spacy compat issues by @ayushdg in NVIDIA#119
  • Fix #116. Fix broken links by @miguelusque in NVIDIA#117
  • Update index.rst by @aschilling-nv in NVIDIA#129
  • Fix nemo_curator import in CPU only environment when GPU packages are installed. by @ayushdg in NVIDIA#123
  • Improve Common Crawl download by @ryantwolf in NVIDIA#82
  • Update README.md by @Maghoumi in NVIDIA#126
  • Allow multiple filenames per partition when separating by metadata by @ayushdg in NVIDIA#99
  • [REVIEW] Add Resiliparse option for text extraction by @sarahyurick in NVIDIA#128
  • Fix 69 - Refactor how arguments are added to scripts by @miguelusque in NVIDIA#102
  • Stricter check for query planning. by @ayushdg in NVIDIA#107
  • Add DataFrame example to Distributed Data Classification tutorial by @sarahyurick in NVIDIA#137
  • Enable Sem-dedup by @VibhuJawa in NVIDIA#130
  • Remove lxml installation by @ryantwolf in NVIDIA#140
  • Nemotron 340 SDG Pipeline Tutorial by @chrisalexiuk-nvidia in NVIDIA#144
  • Add Synthetic Data Generation Module by @ryantwolf in NVIDIA#136
  • Skip explicit comms shuffle for dask-cuda 24.06 by @ayushdg in NVIDIA#147
  • Add support for NeMo SDK by @ryantwolf in NVIDIA#131
  • [REVIEW] Fix SemDedup bugs by @VibhuJawa in NVIDIA#151
  • [pre-commit.ci] pre-commit suggestions by @pre-commit-ci in NVIDIA#135
  • Fix bug with torch rmm and nemo by @ryantwolf in NVIDIA#155

New Contributors

  • @ayushdg made their first contribution in NVIDIA#13
  • @terrykong made their first contribution in NVIDIA#40
  • @rjzamora made their first contribution in NVIDIA#60
  • @nicoleeeluo made their first contribution in NVIDIA#30
  • @aschilling-nv made their first contribution in NVIDIA#104
  • @pre-commit-ci made their first contribution in NVIDIA#135

Full Changelog: https://github.com/NVIDIA/NeMo-Curator/commits/v0.4.0s

NeMo Curator 0.3.0

What's Changed

  • Update README by @ryantwolf in NVIDIA#6
  • [Tutorials] Add a readme file for the TinyStories tutorial by @Maghoumi in NVIDIA#5
  • Add workflow for running cpu pytests by @ayushdg in NVIDIA#13
  • Add pre-commit style checks by @ayushdg in NVIDIA#14
  • Add citation by @ryantwolf in NVIDIA#15
  • Fix Noisy CUDA Shutdown by @ryantwolf in NVIDIA#20
  • Bump Python and RAPIDS versions by @ryantwolf in NVIDIA#16
  • Add batched decorator by @ryantwolf in NVIDIA#18
  • Add issue templates by @ayushdg in NVIDIA#22
  • Add dependency to fix justext by @ryantwolf in NVIDIA#24
  • Fix metadata inference with pandas and dask by @ryantwolf in NVIDIA#35
  • Disable PyTorch Compile Multiprocessing by @ryantwolf in NVIDIA#34
  • Improve speed of AddId module by @ryantwolf in NVIDIA#36
  • Make GPU dependencies optional by @ayushdg in NVIDIA#27
  • Fix failing GPU tests with latest pandas bump by @ayushdg in NVIDIA#41
  • Adds Nemo Curator K8s example by @terrykong in NVIDIA#40
  • Move common dedup utils and remove unused code by @ayushdg in NVIDIA#42
  • Fix lang id example by @ryantwolf in NVIDIA#37
  • Add dataset blending tool by @ryantwolf in NVIDIA#32
  • High level fuzzy duplicates module by @ayushdg in NVIDIA#46
  • Fix indexing in PII Modifier by @ryantwolf in NVIDIA#55
  • Disable string conversion globally by @ryantwolf in NVIDIA#56
  • Fix issue #43 (empty files creation) and improve reading/writing speed by @miguelusque in NVIDIA#57
  • [Tutorials] Add a tutorial for PEFT data curation by @Maghoumi in NVIDIA#45
  • Only import PII constants during Curator import by @ayushdg in NVIDIA#61
  • Align extract_partitioning_index logic with upstream shuffling by @rjzamora in NVIDIA#60

New Contributors

  • @Maghoumi made their first contribution in NVIDIA#5
  • @terrykong made their first contribution in NVIDIA#40
  • @miguelusque made their first contribution in NVIDIA#57
  • @rjzamora made their first contribution in NVIDIA#60

Full Changelog: https://github.com/NVIDIA/NeMo-Curator/commits/v0.3.0

PyPi

https://pypi.org/project/nemo-curator/0.3.0/