Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NeMo 2.0 support #293

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

NeMo 2.0 support #293

wants to merge 1 commit into from

Conversation

TaekyungHeo
Copy link
Member

@TaekyungHeo TaekyungHeo commented Oct 29, 2024

Summary

This PR introduces support for NeMo 2.0 in CloudAI. Initially, we planned to dump fiddle configurations to a file and load them in NeMo-Run. However, I changed the approach to use NeMo-Run directly to execute a model. Marc Romejin informed me that we can run a task with a recipe without generating an sbatch script, known as a "direct executor" in NeMo-Run. To run NeMo 2.0, you can use the following command:

$ srun -t "60:00" --account=hw_nsw_misc --ntasks-per-node=8 --container-image=nvcr.io/nvidia/nemo:dev --pty nemo llm pretrain -y --factory llama3_8b trainer.max_steps=5 log.ckpt.save_on_train_epoch_end=False log.ckpt.save_last=False

Test Plan

  1. CI passes
  2. Ran on a server
$ cloudai run --system-config ~/cloudaix/conf/common/system/eos.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/nemo_run_llama3_8b.toml                                 

/home/theo/scratch/miniconda3/envs/test4/lib/python3.9/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.20) or chardet (5.2.0)/charset_normalizer (2.0.12) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
[INFO] System Name: EOS
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: nemo_run_llama3_8b
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: nemo_run_llama3_8b

Section Name: nemo_run_llama3_8b
  Test Name: nemo_run_llama3_8b
  Description: nemo_run_llama3_8b
  No dependencies
[INFO] Initializing Runner [RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test scenario execution.
[INFO] Starting test: nemo_run_llama3_8b
[INFO] Running test: nemo_run_llama3_8b

$ cd results/nemo_run_llama3_8b_2024-11-15_10-16-03/nemo_run_llama3_8b/0
$ tail stdout.txt 
        module.decoder.layers.0.self_attention.linear_proj.weight
        module.decoder.layers.0.self_attention.linear_qkv.layer_norm_weight
    Params for bucket 98 (206045184 elements):
        module.embedding.word_embeddings.weight
[NeMo I 2024-11-15 10:22:04 utils:259] Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=0.0003, min_lr=None, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=False, bf16=True, params_dtype=torch.bfloat16, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.95, adam_eps=1e-05, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_grad_reduce=False, overlap_param_gather=False, overlap_param_gather_with_optimizer_step=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=False, timers=None, config_logger_dir='')
Training epoch 0, iteration 0/4 | lr: 1.499e-07 | consumed_samples: 512 | global_batch_size: 512 | global_step: 0 | reduced_train_loss: 11.03 | train_step_timing in s: 61.94
Training epoch 0, iteration 1/4 | lr: 2.999e-07 | consumed_samples: 1024 | global_batch_size: 512 | global_step: 1 | reduced_train_loss: 11.03 | train_step_timing in s: 53.67
Training epoch 0, iteration 2/4 | lr: 4.498e-07 | consumed_samples: 1536 | global_batch_size: 512 | global_step: 2 | reduced_train_loss: 11.03 | train_step_timing in s: 52.45
Training epoch 0, iteration 3/4 | lr: 5.997e-07 | consumed_samples: 2048 | global_batch_size: 512 | global_step: 3 | reduced_train_loss: 11.03 | train_step_timing in s: 52.54
Training epoch 0, iteration 4/4 | lr: 7.496e-07 | consumed_samples: 2560 | global_batch_size: 512 | global_step: 4 | reduced_train_loss: 11.03 | train_step_timing in s: 53.16

@TaekyungHeo TaekyungHeo added feature Jan25 Jan'25 release feature labels Oct 29, 2024
@TaekyungHeo TaekyungHeo changed the title NeMo 2.0 NeMo 2.0 Support Oct 29, 2024
@TaekyungHeo TaekyungHeo changed the title NeMo 2.0 Support NeMo 2.0 support Oct 29, 2024
@TaekyungHeo TaekyungHeo force-pushed the nemo2.0 branch 21 times, most recently from e3ca13b to 68025cc Compare November 15, 2024 19:37
@TaekyungHeo TaekyungHeo marked this pull request as ready for review November 15, 2024 21:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Jan25 Jan'25 release feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant