NeMo 2.0 support #293

TaekyungHeo · 2024-10-29T20:32:21Z

Summary

This PR introduces support for NeMo 2.0 in CloudAI. Initially, we planned to dump fiddle configurations to a file and load them in NeMo-Run. However, I changed the approach to use NeMo-Run directly to execute a model. Marc Romejin informed me that we can run a task with a recipe without generating an sbatch script, known as a "direct executor" in NeMo-Run. To run NeMo 2.0, you can use the following command:

$ srun -t "60:00" --account=hw_nsw_misc --ntasks-per-node=8 --container-image=nvcr.io/nvidia/nemo:dev --pty nemo llm pretrain -y --factory llama3_8b trainer.max_steps=5 log.ckpt.save_on_train_epoch_end=False log.ckpt.save_last=False

Test Plan

CI passes
Ran on a server

$ cloudai run --system-config ~/cloudaix/conf/common/system/eos.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/nemo_run_llama3_8b.toml                                 

/home/theo/scratch/miniconda3/envs/test4/lib/python3.9/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.20) or chardet (5.2.0)/charset_normalizer (2.0.12) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
[INFO] System Name: EOS
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: nemo_run_llama3_8b
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: nemo_run_llama3_8b

Section Name: nemo_run_llama3_8b
  Test Name: nemo_run_llama3_8b
  Description: nemo_run_llama3_8b
  No dependencies
[INFO] Initializing Runner [RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test scenario execution.
[INFO] Starting test: nemo_run_llama3_8b
[INFO] Running test: nemo_run_llama3_8b

$ cd results/nemo_run_llama3_8b_2024-11-15_10-16-03/nemo_run_llama3_8b/0
$ tail stdout.txt 
        module.decoder.layers.0.self_attention.linear_proj.weight
        module.decoder.layers.0.self_attention.linear_qkv.layer_norm_weight
    Params for bucket 98 (206045184 elements):
        module.embedding.word_embeddings.weight
[NeMo I 2024-11-15 10:22:04 utils:259] Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=0.0003, min_lr=None, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=False, bf16=True, params_dtype=torch.bfloat16, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.95, adam_eps=1e-05, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_grad_reduce=False, overlap_param_gather=False, overlap_param_gather_with_optimizer_step=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=False, timers=None, config_logger_dir='')
Training epoch 0, iteration 0/4 | lr: 1.499e-07 | consumed_samples: 512 | global_batch_size: 512 | global_step: 0 | reduced_train_loss: 11.03 | train_step_timing in s: 61.94
Training epoch 0, iteration 1/4 | lr: 2.999e-07 | consumed_samples: 1024 | global_batch_size: 512 | global_step: 1 | reduced_train_loss: 11.03 | train_step_timing in s: 53.67
Training epoch 0, iteration 2/4 | lr: 4.498e-07 | consumed_samples: 1536 | global_batch_size: 512 | global_step: 2 | reduced_train_loss: 11.03 | train_step_timing in s: 52.45
Training epoch 0, iteration 3/4 | lr: 5.997e-07 | consumed_samples: 2048 | global_batch_size: 512 | global_step: 3 | reduced_train_loss: 11.03 | train_step_timing in s: 52.54
Training epoch 0, iteration 4/4 | lr: 7.496e-07 | consumed_samples: 2560 | global_batch_size: 512 | global_step: 4 | reduced_train_loss: 11.03 | train_step_timing in s: 53.16

TaekyungHeo added feature Jan25 Jan'25 release feature labels Oct 29, 2024

TaekyungHeo changed the title ~~NeMo 2.0~~ NeMo 2.0 Support Oct 29, 2024

TaekyungHeo changed the title ~~NeMo 2.0 Support~~ NeMo 2.0 support Oct 29, 2024

TaekyungHeo force-pushed the nemo2.0 branch 3 times, most recently from 9349a3f to f07e0f1 Compare October 31, 2024 16:25

TaekyungHeo force-pushed the nemo2.0 branch 21 times, most recently from e3ca13b to 68025cc Compare November 15, 2024 19:37

TaekyungHeo requested review from amaslenn and artemry-nv November 15, 2024 19:38

TaekyungHeo requested review from srivatsankrishnan and srinivas212 November 15, 2024 19:38

TaekyungHeo force-pushed the nemo2.0 branch 2 times, most recently from a2e30ee to 15b2e87 Compare November 15, 2024 20:04

TaekyungHeo marked this pull request as ready for review November 15, 2024 21:20

Add NeMo 2.0 (NeMo-run)

e1c5787

TaekyungHeo force-pushed the nemo2.0 branch from f922581 to e1c5787 Compare November 15, 2024 21:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NeMo 2.0 support #293

NeMo 2.0 support #293

TaekyungHeo commented Oct 29, 2024 •

edited

Loading

NeMo 2.0 support #293

Are you sure you want to change the base?

NeMo 2.0 support #293

Conversation

TaekyungHeo commented Oct 29, 2024 • edited Loading

Summary

Test Plan

TaekyungHeo commented Oct 29, 2024 •

edited

Loading