You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While executing the script bash scripts/TrainStage1_7b.sh, I encountered an Out of Memory (OOM) error. The error message is as follows:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacity of 23.59 GiB of which 104.75 MiB is free. Including non-PyTorch memory, this process has 23.42 GiB memory in use. Of the allocated memory, 23.17 GiB is allocated by PyTorch, and 2.91 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large, try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.
The error seems to occur at the following point in the code:
Given these specs, I’m wondering if it’s feasible to train the model without encountering OOM errors, and if there are any suggestions for resolving the memory issues.
Attempt with DeepSpeed:
To mitigate the OOM issue, I tried implementing DeepSpeed, but ran into compatibility issues. I am using DeepSpeed version 0.15.3 (as version 0.7.3 did not work due to the deprecation of torch._six). The following is the config file I used for DeepSpeed:
However, I encountered the following error message when trying to run the modified code:
AttributeError: 'TrainingArguments' object has no attribute 'hf_deepspeed_config'
Even though I explicitly added the path to the DeepSpeed config file, this error persists. Could you provide any guidance on how to resolve the OOM issue using DeepSpeed, or suggest which version of DeepSpeed is compatible with Transformers 4.28.1? Any advice would be greatly appreciated.
The text was updated successfully, but these errors were encountered:
While executing the script bash scripts/TrainStage1_7b.sh, I encountered an Out of Memory (OOM) error. The error message is as follows:
The error seems to occur at the following point in the code:
trainer = AlignLLMwithSDCLIPTrainer(model=model, tokenizer=llm_tokenizer, args=training_args, **data_module)
System Info:
Given these specs, I’m wondering if it’s feasible to train the model without encountering OOM errors, and if there are any suggestions for resolving the memory issues.
To mitigate the OOM issue, I tried implementing DeepSpeed, but ran into compatibility issues. I am using DeepSpeed version 0.15.3 (as version 0.7.3 did not work due to the deprecation of torch._six). The following is the config file I used for DeepSpeed:
I also modified the TrainStage1.py file as follows to include DeepSpeed:
However, I encountered the following error message when trying to run the modified code:
Even though I explicitly added the path to the DeepSpeed config file, this error persists. Could you provide any guidance on how to resolve the OOM issue using DeepSpeed, or suggest which version of DeepSpeed is compatible with Transformers 4.28.1? Any advice would be greatly appreciated.
The text was updated successfully, but these errors were encountered: