Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when i use container to do sft for any model, it has context not found error #11825

Open
munger1985 opened this issue Jan 11, 2025 · 2 comments
Labels
bug Something isn't working

Comments

@munger1985
Copy link

container docker pull nvcr.io/nvidia/nemo:24.12

nemo llm finetune -f llama32_1b peft=lora  # acceptable values are lora/dora/none

it has error like No such file: '/root/.cache/nemo/models/meta-llama/Llama-3.2-1B/context'

the document i followed
https://docs.nvidia.com/nemo-framework/user-guide/latest/sft_peft/peft_nemo2.html

dir="/checkpoints/llama3.2_1b", #

here, what is the dir , isn't it the dir for hugginface model downloaded?

@munger1985 munger1985 added the bug Something isn't working label Jan 11, 2025
@munger1985
Copy link
Author

09-53
i.finetune/0 GPU available: True (cuda), used: True
i.finetune/0 TPU available: False, using: 0 TPU cores
i.finetune/0 HPU available: False, using: 0 HPUs
i.finetune/0 [NeMo W 2025-01-11 15:09:53 nemo_logger:173] "update_logger_directory" is True. Overwriting tensorboard logger "save_dir" to tb_logs
i.finetune/0 [NeMo W 2025-01-11 15:09:53 nemo_logger:189] The Trainer already contains a ModelCheckpoint callback. This will be overwritten.
i.finetune/0 [NeMo W 2025-01-11 15:09:53 nemo_logger:212] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 1000. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.
i.finetune/0 Traceback (most recent call last):
i.finetune/0 File "/opt/NeMo/nemo/lightning/io/api.py", line 57, in load_context
i.finetune/0 return load(path, output_type=TrainerContext, subpath=subpath, build=build)
i.finetune/0 File "/opt/NeMo/nemo/lightning/io/mixin.py", line 720, in load
i.finetune/0 raise FileNotFoundError(f"No such file: '{_path}'")
i.finetune/0 FileNotFoundError: No such file: '/root/.cache/nemo/models/baichuan-inc/Baichuan2-7B-Base'
i.finetune/0
i.finetune/0 During handling of the above exception, another exception occurred:
i.finetune/0
i.finetune/0 Traceback (most recent call last):
i.finetune/0 File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
i.finetune/0 return _run_code(code, main_globals, None,
i.finetune/0 File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
i.finetune/0 exec(code, run_globals)
i.finetune/0 File "/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py", line 66, in
i.finetune/0 fdl_runner_app()
i.finetune/0 File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 338, in call
i.finetune/0 raise e
i.finetune/0 File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 321, in call
i.finetune/0 return get_command(self)(*args, **kwargs)
i.finetune/0 File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in call
i.finetune/0 return self.main(*args, **kwargs)
i.finetune/0 File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 665, in main
i.finetune/0 return _main(
i.finetune/0 File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 197, in _main
i.finetune/0 rv = self.invoke(ctx)
i.finetune/0 File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
i.finetune/0 return ctx.invoke(self.callback, **ctx.params)
i.finetune/0 File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
i.finetune/0 return __callback(*args, **kwargs)
i.finetune/0 File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 703, in wrapper
i.finetune/0 return callback(**use_params)
i.finetune/0 File "/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py", line 62, in fdl_direct_run
i.finetune/0 fdl_fn()
i.finetune/0 File "/opt/NeMo/nemo/collections/llm/api.py", line 201, in finetune
i.finetune/0 return train(
i.finetune/0 File "/opt/NeMo/nemo/collections/llm/api.py", line 96, in train
i.finetune/0 app_state = _setup(
i.finetune/0 File "/opt/NeMo/nemo/collections/llm/api.py", line 858, in _setup
i.finetune/0 resume.setup(trainer, model)
i.finetune/0 File "/opt/NeMo/nemo/lightning/resume.py", line 140, in setup
i.finetune/0 _try_restore_tokenizer(model, context_path)
i.finetune/0 File "/opt/NeMo/nemo/lightning/resume.py", line 44, in _try_restore_tokenizer
i.finetune/0 tokenizer = load_context(ckpt_path, "model.tokenizer")
i.finetune/0 File "/opt/NeMo/nemo/lightning/io/api.py", line 64, in load_context
i.finetune/0 return load(path, output_type=TrainerContext, subpath=subpath, build=build)
i.finetune/0 File "/opt/NeMo/nemo/lightning/io/mixin.py", line 720, in load
i.finetune/0 raise FileNotFoundError(f"No such file: '{_path}'")
i.finetune/0 FileNotFoundError: No such file: '/root/.cache/nemo/models/baichuan-inc/Baichuan2-7B-Base/context'
[15:09:55] INFO Job nemo.collections.llm.api.finetune-v1xxmknq0xfmmc finished: FAILED

@munger1985
Copy link
Author

nothing did i succeed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant