when i use container to do sft for any model, it has context not found error #11825

munger1985 · 2025-01-11T15:05:52Z

container docker pull nvcr.io/nvidia/nemo:24.12

nemo llm finetune -f llama32_1b peft=lora # acceptable values are lora/dora/none

it has error like No such file: '/root/.cache/nemo/models/meta-llama/Llama-3.2-1B/context'

the document i followed
https://docs.nvidia.com/nemo-framework/user-guide/latest/sft_peft/peft_nemo2.html

dir="/checkpoints/llama3.2_1b", #

here, what is the dir , isn't it the dir for hugginface model downloaded?

munger1985 · 2025-01-11T15:10:26Z

09-53
i.finetune/0 GPU available: True (cuda), used: True
i.finetune/0 TPU available: False, using: 0 TPU cores
i.finetune/0 HPU available: False, using: 0 HPUs
i.finetune/0 [NeMo W 2025-01-11 15:09:53 nemo_logger:173] "update_logger_directory" is True. Overwriting tensorboard logger "save_dir" to tb_logs
i.finetune/0 [NeMo W 2025-01-11 15:09:53 nemo_logger:189] The Trainer already contains a ModelCheckpoint callback. This will be overwritten.
i.finetune/0 [NeMo W 2025-01-11 15:09:53 nemo_logger:212] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 1000. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.
i.finetune/0 Traceback (most recent call last):
i.finetune/0 File "/opt/NeMo/nemo/lightning/io/api.py", line 57, in load_context
i.finetune/0 return load(path, output_type=TrainerContext, subpath=subpath, build=build)
i.finetune/0 File "/opt/NeMo/nemo/lightning/io/mixin.py", line 720, in load
i.finetune/0 raise FileNotFoundError(f"No such file: '{_path}'")
i.finetune/0 FileNotFoundError: No such file: '/root/.cache/nemo/models/baichuan-inc/Baichuan2-7B-Base'
i.finetune/0
i.finetune/0 During handling of the above exception, another exception occurred:
i.finetune/0
i.finetune/0 Traceback (most recent call last):
i.finetune/0 File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
i.finetune/0 return _run_code(code, main_globals, None,
i.finetune/0 File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
i.finetune/0 exec(code, run_globals)
i.finetune/0 File "/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py", line 66, in
i.finetune/0 fdl_runner_app()
i.finetune/0 File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 338, in call
i.finetune/0 raise e
i.finetune/0 File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 321, in call
i.finetune/0 return get_command(self)(*args, **kwargs)
i.finetune/0 File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in call
i.finetune/0 return self.main(*args, **kwargs)
i.finetune/0 File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 665, in main
i.finetune/0 return _main(
i.finetune/0 File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 197, in _main
i.finetune/0 rv = self.invoke(ctx)
i.finetune/0 File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
i.finetune/0 return ctx.invoke(self.callback, **ctx.params)
i.finetune/0 File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
i.finetune/0 return __callback(*args, **kwargs)
i.finetune/0 File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 703, in wrapper
i.finetune/0 return callback(**use_params)
i.finetune/0 File "/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py", line 62, in fdl_direct_run
i.finetune/0 fdl_fn()
i.finetune/0 File "/opt/NeMo/nemo/collections/llm/api.py", line 201, in finetune
i.finetune/0 return train(
i.finetune/0 File "/opt/NeMo/nemo/collections/llm/api.py", line 96, in train
i.finetune/0 app_state = _setup(
i.finetune/0 File "/opt/NeMo/nemo/collections/llm/api.py", line 858, in _setup
i.finetune/0 resume.setup(trainer, model)
i.finetune/0 File "/opt/NeMo/nemo/lightning/resume.py", line 140, in setup
i.finetune/0 _try_restore_tokenizer(model, context_path)
i.finetune/0 File "/opt/NeMo/nemo/lightning/resume.py", line 44, in _try_restore_tokenizer
i.finetune/0 tokenizer = load_context(ckpt_path, "model.tokenizer")
i.finetune/0 File "/opt/NeMo/nemo/lightning/io/api.py", line 64, in load_context
i.finetune/0 return load(path, output_type=TrainerContext, subpath=subpath, build=build)
i.finetune/0 File "/opt/NeMo/nemo/lightning/io/mixin.py", line 720, in load
i.finetune/0 raise FileNotFoundError(f"No such file: '{_path}'")
i.finetune/0 FileNotFoundError: No such file: '/root/.cache/nemo/models/baichuan-inc/Baichuan2-7B-Base/context'
[15:09:55] INFO Job nemo.collections.llm.api.finetune-v1xxmknq0xfmmc finished: FAILED

munger1985 · 2025-01-11T15:10:53Z

nothing did i succeed

munger1985 added the bug Something isn't working label Jan 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

when i use container to do sft for any model, it has context not found error #11825

when i use container to do sft for any model, it has context not found error #11825

munger1985 commented Jan 11, 2025

munger1985 commented Jan 11, 2025

munger1985 commented Jan 11, 2025

when i use container to do sft for any model, it has context not found error #11825

when i use container to do sft for any model, it has context not found error #11825

Comments

munger1985 commented Jan 11, 2025

munger1985 commented Jan 11, 2025

munger1985 commented Jan 11, 2025