-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resuming training fails #105
Comments
I've also observed this in the log:
|
any updates on this ? I'm facing the same problem |
Here's a temporary fix according to https://huggingface.co/docs/safetensors/torch_shared_tensors Modify -from safetensors.torch import load_file
+from safetensors.torch import load_model
...
if input_model_file.exists():
- state_dict = load_file(input_model_file, device=str(map_location))
+ load_model(models[i], input_model_file, device=str(map_location), **load_model_func_kwargs)
else:
# Load with torch
input_model_file = input_dir.joinpath(f"{MODEL_NAME}{ending}.bin")
state_dict = torch.load(input_model_file, map_location=map_location)
- models[i].load_state_dict(state_dict, **load_model_func_kwargs)
+ models[i].load_state_dict(state_dict, **load_model_func_kwargs) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
So I removed the flag
--overwrite_output_dir
to be able to resume the training, and I'm getting the following error:At the same time, evaluation script works just fine with the same checkpoint.
I'm using Ubuntu 22, rtx 3090 ti.
The text was updated successfully, but these errors were encountered: