Issues with running model2 #62

s-kyungyong · 2023-02-24T00:43:58Z

Hi!

It looks like the run was killed due to the issues with GPU memory usage when model2 is used. However, the same input sequence runs fine with model 1. Do you have any clues?


 nvidia-smi
Thu Feb 23 16:38:21 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA TITAN RTX    On   | 00000000:1A:00.0 Off |                  N/A |
| 41%   39C    P2    71W / 280W |   7148MiB / 24220MiB |      2%      Default |
|                               |                      |                  N/A |


omegafold --num_cycle 1 --model 1 gene_5088_NI907.fasta test4
INFO:root:Loading weights from /global/scratch/users/skyungyong/omegafold/omegafold_ckpt/model.pt
INFO:root:Constructing OmegaFold
INFO:root:Reading gene_5088_NI907.fasta
INFO:root:Predicting 1th chain in gene_5088_NI907.fasta
INFO:root:365 residues in this chain.
INFO:root:Finished prediction in 23.76 seconds.
INFO:root:Saving prediction to test4/gene_5088_NI907.pdb
INFO:root:Saved
INFO:root:Done!


omegafold --num_cycle 1 --model 2 gene_5088_NI907.fasta test5
INFO:root:Loading weights from /global/scratch/users/skyungyong/omegafold/omegafold_ckpt/model2.pt
INFO:root:Constructing OmegaFold
Killed

Using a better GPU didn't help.


 nvidia-smi
Thu Feb 23 16:39:10 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          Off  | 00000000:41:00.0 Off |                    0 |
|  0%   32C    P8    30W / 300W |     23MiB / 45634MiB |      0%      Default |
|                               |                      |                  N/A |


omegafold --num_cycle 1 --model 1 gene_5088_NI907.fasta test10
INFO:root:Loading weights from /global/scratch/users/skyungyong/omegafold/omegafold_ckpt/model.pt
INFO:root:Constructing OmegaFold
INFO:root:Reading gene_5088_NI907.fasta
INFO:root:Predicting 1th chain in gene_5088_NI907.fasta
INFO:root:365 residues in this chain.
INFO:root:Finished prediction in 12.72 seconds.
INFO:root:Saving prediction to test10/gene_5088_NI907.pdb
INFO:root:Saved
INFO:root:Done!


omegafold --num_cycle 1 --model 2 gene_5088_NI907.fasta test11
INFO:root:Loading weights from /global/scratch/users/skyungyong/omegafold/omegafold_ckpt/model2.pt
INFO:root:Constructing OmegaFold
INFO:root:Reading gene_5088_NI907.fasta
INFO:root:Predicting 1th chain in gene_5088_NI907.fasta
INFO:root:365 residues in this chain.
INFO:root:Failed to generate test11/gene_5088_NI907.pdb due to CUDA out of memory. Tried to allocate 10.67 GiB (GPU 0; 44.56 GiB total capacity; 32.65 GiB already allocated; 9.25 GiB free; 33.13 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
INFO:root:Skipping...
INFO:root:Done!

Using --subbatch_size also didn't help.

omegafold --subbatch_size 1 --num_cycle 1 --model 2 gene_5088_NI907.fasta test11
INFO:root:Loading weights from /global/scratch/users/skyungyong/omegafold/omegafold_ckpt/model2.pt
INFO:root:Constructing OmegaFold
INFO:root:Reading gene_5088_NI907.fasta
INFO:root:Predicting 1th chain in gene_5088_NI907.fasta
INFO:root:365 residues in this chain.
INFO:root:Failed to generate test11/gene_5088_NI907.pdb due to CUDA out of memory. Tried to allocate 10.67 GiB (GPU 0; 44.56 GiB total capacity; 32.65 GiB already allocated; 9.25 GiB free; 33.13 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
INFO:root:Skipping...
INFO:root:Done!

Thanks!

The text was updated successfully, but these errors were encountered:

bzhousd · 2023-02-26T06:57:25Z

I got the same error,

INFO:root:Failed to generate my_output4/ranked_0.pdb due to CUDA out of memory. Tried to allocate 7.80 GiB (GPU 0; 31.75 GiB total capacity; 24.66 GiB already allocated; 5.82 GiB free; 24.92 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

my command is omegafold a.fa my_output4 --model 2 --subbatch_size 1 --num_cycle 1

and model 1 works just fine and the length of my sequence is 311. any suggestion? thanks

Edit: OOM message was printed within RecycleEmbedder in my case, however I didn't find this class is using subbatch_size.

ZhihaoXie · 2024-12-18T14:20:00Z

I had the same problem, running model2, and switching to model1 worked fine. Here is the error message for model2:
INFO:root:379 residues in this chain.
INFO:root:Failed to generate xxx/xxx.pdb due to CUDA out of memory. Tried to allocate 6.71 GiB (GPU 0; 23.70 GiB total capacity; 18.67 GiB already allocated; 3.43 GiB free; 19.11 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
INFO:root:Skipping...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with running model2 #62

Issues with running model2 #62

s-kyungyong commented Feb 24, 2023

bzhousd commented Feb 26, 2023 •

edited

Loading

ZhihaoXie commented Dec 18, 2024

Issues with running model2 #62

Issues with running model2 #62

Comments

s-kyungyong commented Feb 24, 2023

bzhousd commented Feb 26, 2023 • edited Loading

ZhihaoXie commented Dec 18, 2024

bzhousd commented Feb 26, 2023 •

edited

Loading