Skip to content

Releases: NVIDIA/NeMo

NVIDIA Neural Modules 2.0.0

14 Nov 18:57
e938df3
Compare
Choose a tag to compare

Highlights

Large language models & Multi modal

  • Training
    • Long context recipe
    • PyTorch Native FSDP 1
  • Models
    • Llama 3
    • Mixtral
    • Nemotron
  • NeMo 1.0
    • SDXL (text-2-image)
    • Model Opt
      • Depth Pruning (docs)
      • Logit based Knowledge Distillation (docs)

Export

  • TensorRT-LLM v0.12 integration
  • LoRA support for vLLM
  • FP8 checkpoint

ASR

  • Parakeet large (ASR with PnC model)
  • Added Uzbek offline and Gregorian streaming models
  • Optimization feature for efficient bucketing to improve bs consumption on GPUs

Detailed Changelogs

ASR

Changelog

TTS

Changelog

NLP / NMT

Changelog

NVIDIA Neural Modules 2.0.0rc1

15 Aug 21:55
579983f
Compare
Choose a tag to compare

Highlights

Large language models

  • PEFT: QLoRA support, LoRA/QLora for Mixture-of-Experts (MoE) dense layer
  • State Space Models & Hybrid Architecture support (Mamba2 and NV-Mamba2-hybrid)
  • Support Nemotron, Minitron, Gemma2, Qwen, RAG
  • Custom Tokenizer training in NeMo
  • Update the Auto-Configurator for EP, CP and FSDP

Multimodal

  • NeVA: Add SOTA LLM backbone support (Mixtral/LLaMA3) and suite of model parallelism support (PP/EP)
  • Support Language Instructed Temporal-Localization Assistant (LITA) on top of video NeVA

ASR

  • SpeechLM and SALM
  • Adapters for Canary Customization
  • Pytorch allocator in PyTorch 2.2 improves training speed up to 30% for all ASR models
  • Cuda Graphs for Transducer Inference
  • Replaced webdataset with Lhotse - gives up to 2x speedup
  • Transcription Improvements - Speedup and QoL Changes
  • ASR Prompt Formatter for multimodal Canary

Export & Deploy

  • In framework PyTriton deployment with backends: - PyTorch - vLLM - TRT-LLM update to 0.10
  • TRT-LLM C++ runtime

Detailed Changelogs

ASR

Changelog

TTS

Changelog

LLM/Multimodal

Changelog
Read more

NVIDIA Neural Modules 2.0.0rc0

06 Jun 05:46
Compare
Choose a tag to compare

Highlights

LLM and MM

Models

  • Megatron Core RETRO

    • Pre-training
    • Zero-shot Evaluation
  • Pretraining, conversion, evaluation, SFT, and PEFT for:

    • Mixtral 8X22B
    • Llama 3
    • SpaceGemma
  • Embedding Models Fine Tuning

    • Mistral
    • BERT
  • BERT models

    • Context Parallel
    • Distributed checkpoint
  • Video capabilities with NeVa

Performance

  • Distributed Checkpointing

    • Torch native backend
    • Parallel read/write
    • Async write
  • Multimodal LLM (LLAVA/NeVA)

    • Pipeline Parallelism support
    • Sequence packing support

Export

  • Integration of Export & Deploy Modules into NeMo Framework container
    • Upgrade to TRT-LLM 0.9

Speech (ASR & TTS)

Models

  • AED Multi Task Models (Canary) - Multi-Task Multi-Lingual Speech Recognition / Speech Translation model
  • Multimodal Domain - Speech LLM supporting SALM Model
  • Parakeet-tdt_ctc-1.1b Model - RTFx of > 1500 (can transcribe 1500 seconds of audio in 1 second)
  • Audio Codec 16kHz Small - NeMo Neural Audio Codec for discretizing speech for use in LLMs
    • mel_codec_22khz_medium
    • mel_codec_44khz_medium

Perf Improvements

  • Transcribe() upgrade - Enables one line transcribe with files, tensors, data loaders
  • Frame looping algorithm for RNNT faster decoding - Improves Real Time Factor (RTF) by 2-3x
  • Cuda Graphs + Label-Looping algorithm for RNN-T and TDT Decoding - Transducer Greedy decoding at over 1500x RTFx, on par with CTC Non-Autoregressive models
  • Semi Sorted Batching support - External User contribution that speeds up training by 15-30%.

Customization

  • Context biasing for CTC word stamping - Improve accuracy for custom vocabulary and pronunciation
    • Longform Inference
    • Longform inference support for AED models
  • Transcription of multi-channel audio for AED models

Misc

  • Upgraded webdataset - Speech and LLM / Multimodal unified container

Detailed Changelogs

ASR

Changelog
  • Enable using hybrid asr models in CTC Segmentation tool by @erastorgueva-nv :: PR: #8828
  • TDT confidence fix by @GNroy :: PR: #8982
  • Fix union type annotations for autodoc+mock-import rendering by @pzelasko :: PR: #8956
  • NeMo dev doc restructure by @yaoyu-33 :: PR: #8896
  • Improved random seed configuration for Lhotse dataloaders with docs by @pzelasko :: PR: #9001
  • Fix #8948, allow preprocessor to be stream captured to a cuda graph when doing per_feature normalization by @galv :: PR: #8964
  • [ASR] Support for transcription of multi-channel audio for AED models by @anteju :: PR: #9007
  • Add ASR latest news by @titu1994 :: PR: #9073
  • Fix docs errors and most warnings by @erastorgueva-nv :: PR: #9006
  • PyTorch CUDA allocator optimization for dynamic batch shape dataloading in ASR by @pzelasko :: PR: #9061
  • RNN-T and TDT inference: use CUDA graphs by default by @artbataev :: PR: #8972
  • Fix #8891 by supported GPU-side batched CTC Greedy Decoding by @galv :: PR: #9100
  • Update branch for notebooks and ci in release by @ericharper :: PR: #9189
  • Enable CUDA graphs by default only for transcription by @artbataev :: PR: #9196
  • rename paths2audiofiles to audio by @nithinraok :: PR: #9209
  • Fix ASR_Context_Biasing.ipynb contains FileNotFoundError by @andrusenkoau :: PR: #9233
  • Cherrypick: Support dataloader as input to audio for transcription (#9201) by @titu1994 :: PR: #9235
  • Update Online_Offline_Microphone_VAD_Demo.ipynb by @stevehuang52 :: PR: #9252
  • Dgalvez/fix greedy batch strategy name r2.0.0rc0 by @galv :: PR: #9243
  • Accept None as an argument to decoder_lengths in GreedyBatchedCTCInfer::forward by @galv :: PR: #9246
  • Fix loading github raw images on notebook by @nithinraok :: PR: #9282
  • typos by @nithinraok :: PR: #9314
  • Re-enable cuda graphs in training modes. by @galv :: PR: #9338
  • add large model stable training fix and contrastive loss update for variable seq by @nithinraok :: PR: #9259
  • Fix conv1d package in r2.0.0rc0 by @pablo-garay :: PR: #9369
  • Fix GreedyBatchedCTCInfer regression from GreedyCTCInfer. (#9347) by @titu1994 :: PR: #9350
  • Make a backward compatibility for old MSDD configs in label models by @tango4j :: PR: #9377
  • Force diarizer to use CUDA if cuda is available and if device=None. by @tango4j :: PR: #9380

TTS

Changelog

LLM and MM

Changelog

Export

Changelog

General Improvements

Changelog
Read more

NVIDIA Neural Modules 1.23.0

28 Feb 06:18
d2283e3
Compare
Choose a tag to compare

Highlights

Models

Nvidia Starcoder 2 - 15B

NeMo Canary

Announcement - https://nvidia.github.io/NeMo/blogs/2024/2024-02-canary/

NeMo LLM

  • Falcon
  • Code Llama
  • StarCoder
  • GPT perf improvements
  • Context parallelism
  • Mistral
  • Mixtral (without expert parallelism)
  • Mcore GPT Dataset integration

NeMo MM

  • CLIP
  • Stable Diffusion (supporting LoRA)
  • Imagen
  • ControlNet (for SD)
  • Instruct pix2pix (for SD)
  • LLAVA
  • NeVA
  • DreamFusion++
  • NSFW filtering

NeMo ASR

  • Lhotse Dataloading support #7880
  • Canary: Multi task multi lingual ASR #8242
  • LongForm Audio for Diarization #7737
  • Faster algorithm for RNN-T Greedy #7926
  • Cache-Aware streaming notebook #8296

NeMo TTS

NeMo Vision

Known Issues

ASR

RNNT WER calculation when fused batch size > 1 during validation / test step()

Previously, the RNNT metric was stateful while the CTC one was not (r1.22.0, r1.23.0)

Therefore this calculation in the RNNT joint for fused operation worked properly. However with the unification of metrics in r1.23.0, a bug was introduced where only the last sub-batch of metrics calculates the scores and does not accumulate. This is patched via #8587 and will be fixed in the next release.

Workaround: Explicitly disable fused batch size during inference using the following command

from omegaconf import open_dict
model = ...
decoding_cfg = model.cfg.decoding
with open_dict(decoding_cfg):
  decoding_cfg.fused_batch_size = -1
model.change_decoding_strategy(decoding_cfg)

Note: This bug does not affect scores calculated via model.transcribe() (since it does not calculate metrics during inference, just text), or using the transcribe_speech.py or speech_to_text_eval.py in examples/asr.

Two failing unit tests due to a change in expected results, caused by lhotse version update.

Container

For additional information regarding NeMo containers, please visit: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo

docker pull nvcr.io/nvidia/nemo:24.01.speech

Detailed Changelogs

ASR

Changelog

TTS

Changelog
  • [TTS] Scale sampler steps by number of devices by @rlangman :: PR: #7947
  • Add All Multimodal Source Code Part 2: Text to image, x to nerf by @yaoyu-33 :: PR: #7970
  • [TTS] Add period discriminator and feature matching loss to codec recipe by @rlangman :: PR: #7884
  • Added VectorQuantizer base class by @anteju :: PR: #8011

LLMS

Changelog
  • Add interface to set NCCL options of each process group by @erhoo82 :: PR: #7923
  • Support O2 training of PEFT and SFT by @cuichenx :: PR: #7971
  • [NLP] Access scaler only in FP16 case by @janekl :: PR: #7916
  • [NLP] Minor improvements in Llama conversion script by @janekl :: PR: #7978
  • [NLP] Use helpers from utils_funcs.py in Llama conversion by @janekl :: PR: #7979
  • [NLP] Remove replace_sampler_ddp (deprecated in Trainer) by @janekl :: PR: #7981
  • Reworked MegatronPretrainingRandomBatchSampler to correctly handle epochs > 1 by @trias702 :: PR: #7920
  • Remove deprecated arguments from TE's TransformerLayer by @jbaczek :: PR: #7917
  • Add All Multimodal Source Code by @yaoyu-33 :: PR: #7791
  • First draft of mcore bert model in NeMo by @shanmugamr1992 :: PR: #7814
  • Support Falcon Variants (7B/40B/180B) in Mcore NeMo by @xuanzic :: PR: #7666
  • FSDP + Tensor Parallelism by @erhoo82 :: PR: #7897
  • Packed Sequence by @cuichenx :: PR: #7945
  • Adding method back that was removed accidentally by @ericharper :: PR: #8038
  • [NLP] ArtifactItem with init=True to make it debuggable by @janekl :: PR: #7980
  • SFT patch: (1) enable sequence parallelism and (2) enable profile by @erhoo82 :: PR: #7963
  • migration to PTL 2.0 for spellmapper model by @bene-ges :: PR: #7924
  • Change the megatron config lr scheduler default and fix to change partitions script by @shan18 :: PR: #8094
  • (1) Add SHARP interface to M-CORE, (2) use send/recv to send train loss to the first rank instead of b-cast by @erhoo82 :: PR: #7793
  • Reconfigure limit_val_batches only for int by @athitten :: PR: #8099
  • Fixing wrapper and moving it to base class by @shanmugamr1992 :: PR: #8055
  • fix gated_linear_unit bug by @Agoniii :: PR: #8042
  • Fix Adapter for MCore models by @cuichenx :: PR: #8124
  • add war fix for sync issues by @gshennvm :: PR: #8130
  • Improve PEFT UX by @cuichenx :: PR: #8131
  • Enhance flexibility by passing callbacks as method argument by @michal2409 :: PR: #8015
  • context parallelism by @xrennvidia :: PR: #7739
  • Make pipelined TP comm overlap available with mcore by @erhoo82 :: PR: #8005
  • remove deprecated scripts by @arendu :: PR: #8138
  • adding OnlineSampleMapping by @arendu :: PR: #8137
  • Add distopt support for FP8 params and BF16 optimizer state by @timmoon10 :: PR: #7909
  • Revert adding OnlineSampleMapping by @pablo-garay :: PR: #8164
  • Token count and sequence length logging for MegatronGPTSFTModel by @vysarge :: PR: #8136
  • Use latest apex internal API by @jbaczek :: PR: #8129
  • tune specific params in the base model by @arendu :: PR: #7745
  • Virtual pipeline parallel support for MegatronGPTSFTModel by @vysarge :: PR: #7964
  • removed deprecated peft model by @arendu :: PR: #8183
  • remove more deprecated files by @arendu :: PR: #8169
  • Pre-generate cu_seqlens argmin and max_seqlen to remove host-to-device sync by @erhoo82 :: PR: #8108
  • Add the interface to use SHARP to FSDP strategy by @erhoo82 :: PR: #8202
  • Multimodal required NLP base model changes by @yaoyu-33 :: PR: #8188
  • [NLP] Improve and unify loading state_dict for community models by @janekl :: PR: #7977
  • Rename Finetuning Scripts by @cuichenx :: PR: #8201
  • Final multimodal PR with our recent developments on MM side by @yaoyu-33 :: PR: #8127
  • Add include_text parameter to SFT dataloaders by @Kipok :: PR: #8198
  • Add random_seed argument to generate by @Kipok :: PR: #8162
  • Added support for neptune logger by @harishankar-gopalan :: PR: #8210
  • Pre-compute max_seqlen and cu_seqlens_argmin in all model-parallel cases by @erhoo82 :: PR: #8222
  • Use PackedSeqParams in accordance with changes in Megatron-LM by @cuichenx :: PR: #8205
  • Fix to peft & virtual pipeline parallel unsupported check by @vysarge :: PR: #8216
  • Fixed the tp overlap switch by @sanandaraj5597 :: PR: #8195
  • add knobs for rope/swiglu fusion by @lhb8125 :: PR: #8184
  • Added sample cpu_offloading switch to YAML by @sanandaraj5597 :: PR: #8148
  • Syncing random seed between ranks in generate by @Kipok :: PR: #8230
  • add first_val_step to mcore scheduler by @JimmyZhang12 :: PR: #8150
  • Correct padding for SFT input data to account for sequence parallel + TE's fp8 op dimension requirements by @vysarge :: PR: #8240
  • Mistral 7b conversion script by @akoumpa :: PR: #8052
  • switch to mcore dataset [with FIM support] by @dimapihtar :: PR: #8149
  • Mixtral to NeMo conversion script. by @akoumpa :: PR: #8155
  • fixes to accomendate mcore changes by @HuiyingLi :: PR: #8261
  • Allow MegatronPretrainingRandomSample...
Read more

NVIDIA Neural Modules 1.22.0

11 Jan 02:04
Compare
Choose a tag to compare

Highlights

Models

NeMo Parakeet

Announcement - https://nvidia.github.io/NeMo/blogs/2024/2024-01-parakeet/

NeMo Parakeet-TDT

Announcement - https://nvidia.github.io/NeMo/blogs/2024/2024-01-parakeet-tdt/

ASR

NeMo ASR

  • Multi-lookahead cache-aware streaming Conformer #6711
  • Automatic Lip Reading Recognition (ALR) - ASR/CV (Visual ASR) by @burchim #7330
  • Speech ehancement tutorial #6492
  • Support punctuation error rate #7538

Container

For additional information regarding NeMo containers, please visit: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo

docker pull nvcr.io/nvidia/nemo:23.10

Detailed Changelogs

ASR

Changelog

TTS

Changelog

LLM

Changelog

General Improvements

Changelog
Read more

NVIDIA Neural Modules 1.21.0

25 Oct 23:27
c0022ae
Compare
Choose a tag to compare

Highlights

Models

NeMo ASR

  • Multi-lookahead cache-aware streaming
  • Speech enahncement tutorial #6492
  • Online code switching dataset #6579

NeMo TTS

  • AudioCodec: Training recipe for EnCodec #6852

NeMo Framework

  • GPT from Mcore #7093
  • GPT distributed checkpointing #7116
  • Hidden transformations #6332
  • LLama-2 #7299

NeMo Core

  • Update to PTL 2.0 #6433

NeMo Tools

  • Forced aligner tutorial #7210

Container

For additional information regarding NeMo containers, please visit: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo

docker pull nvcr.io/nvidia/nemo:23.08

ASR

Changelog

TTS

Changelog

NLP / NMT

Changelog
Read more

NVIDIA Neural Modules 1.20.0

04 Aug 19:50
2baef81
Compare
Choose a tag to compare

Highlights

Models

NeMo ASR

  • Graph-RNN-T #6168
  • WildCard-RNN-T #6168
  • Confidence Ensembles for ASR
  • Token-and-Duration Transducer (TDT) #6536
  • Spellchecking ASR #6179
  • Numba FP16 RNNT Loss #6991

NeMo TTS

  • TTS Adapter Customization
  • TTS Dataloader Framework

NeMo Framework

  • LoRA for T5 and mT5 #6612
  • Flash Attention integration #6666
  • Mosaic 7B compatibility
  • Models with LongContext (32K) #6666, #6687, #6773

NeMo Tools

  • Speech Data Explorer: Utterance level ASR model comparsion #6669
  • Speech Data Processor: Spanish P&C
  • NeMo Forced Aligner: Large sequence alignment + memory reduction #6695

Container

For additional information regarding NeMo containers, please visit: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo

docker pull nvcr.io/nvidia/nemo:23.06

Detailed Changelogs

ASR

Changelog

TTS

Changelog
  • [TTS] Add callback for saving audio during FastPitch training by @rlangman :: PR: #6665
  • [TTS] Add script for text preprocessing by @rlangman :: PR: #6541
  • [TTS] Fix adapter duration issue by @hsiehjackson :: PR: #6697
  • [TTS] Filter out silent audio files during preprocessing by @rlangman :: PR: #6716
  • [TTS] fix inconsistent type hints for IpaG2p by @XuesongYang :: PR: #6733
  • [TTS] relax hardcoded prefix for phonemes and tones and infer phoneme set through dict by @XuesongYang :: PR: #6735
  • [TTS] corrected misleading deprecation warnings. by @XuesongYang :: PR: #6702
  • Fix TTS adapter tutorial by @hsiehjackson :: PR: #6741
  • [TTS][zh] refine hardcoded lowercase for ASCII letters. by @XuesongYang :: PR: #6781
  • [TTS] Append pretrained FastPitch & SpectrogamEnhancer pair to available models by @racoiaws :: PR: #7012

NLP / NMT

Changelog

NeMo Tools

Changelog

Bugfixes

Changelog

General Improvements

Changelog
Read more

NVIDIA Neural Modules 1.19.1

13 Jul 20:42
Compare
Choose a tag to compare

This release is a small patch to fix torchmetrics.

  • Remove deprecated arg compute_on_step. See #6979.

NVIDIA Neural Modules 1.19.0

15 Jun 23:46
2331b06
Compare
Choose a tag to compare

Highlights

NeMo ASR

  • Sharded Manifests for Tarred Datasets #6395
  • Frame-VAD model + datasets support #6441
  • Noise Norm Perturbation #6445
  • Code Switched Dataset with IID Sampling #6448

NeMo TTS

NeMo Megatron

  • Batch size rampup #6424
  • Unify dataset and model classes for all PEFT #6391
  • LoRA for GPT #6391
  • Convert interleaved pipeline model to non-interleaved #6498
  • Dialog Dataset for SFT #6654
  • Dynamic length batches for GPT SFT #6510
  • Merge LoRA weights into base model #6597

Container

For additional information regarding NeMo containers, please visit: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo

docker pull nvcr.io/nvidia/nemo:23.04

Detailed Changelogs

ASR

Changelog

TTS

Changelog

NLP / NMT

Changelog

Bugfixes

Changelog

General Improvements

Changelog

NVIDIA Neural Modules 1.18.1

17 May 19:09
Compare
Choose a tag to compare

Highlights

For the complete release note, please see NeMo 1.18.0 Release Notes

Bugfix

This patch release fixes a major bug in ASR Bucketing datasets that was introduced in r1.17.0 in PR #6191. Due to this bug, while each bucket is randomly shuffled before selection on each rank, only a single bucket would loop infinitely - without continuing onto subsequent buckets.

Effect: Significantly worse WER would be obtained since not all buckets would be used.

This has been patched and should work correctly in 1.18.1 onwards.

Container

For additional information regarding NeMo containers, please visit: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo

docker pull nvcr.io/nvidia/nemo:23.03