Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with running model2 #62

Open
s-kyungyong opened this issue Feb 24, 2023 · 2 comments
Open

Issues with running model2 #62

s-kyungyong opened this issue Feb 24, 2023 · 2 comments

Comments

@s-kyungyong
Copy link

Hi!

It looks like the run was killed due to the issues with GPU memory usage when model2 is used. However, the same input sequence runs fine with model 1. Do you have any clues?


 nvidia-smi
Thu Feb 23 16:38:21 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA TITAN RTX    On   | 00000000:1A:00.0 Off |                  N/A |
| 41%   39C    P2    71W / 280W |   7148MiB / 24220MiB |      2%      Default |
|                               |                      |                  N/A |


omegafold --num_cycle 1 --model 1 gene_5088_NI907.fasta test4
INFO:root:Loading weights from /global/scratch/users/skyungyong/omegafold/omegafold_ckpt/model.pt
INFO:root:Constructing OmegaFold
INFO:root:Reading gene_5088_NI907.fasta
INFO:root:Predicting 1th chain in gene_5088_NI907.fasta
INFO:root:365 residues in this chain.
INFO:root:Finished prediction in 23.76 seconds.
INFO:root:Saving prediction to test4/gene_5088_NI907.pdb
INFO:root:Saved
INFO:root:Done!


omegafold --num_cycle 1 --model 2 gene_5088_NI907.fasta test5
INFO:root:Loading weights from /global/scratch/users/skyungyong/omegafold/omegafold_ckpt/model2.pt
INFO:root:Constructing OmegaFold
Killed

Using a better GPU didn't help.


 nvidia-smi
Thu Feb 23 16:39:10 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          Off  | 00000000:41:00.0 Off |                    0 |
|  0%   32C    P8    30W / 300W |     23MiB / 45634MiB |      0%      Default |
|                               |                      |                  N/A |


omegafold --num_cycle 1 --model 1 gene_5088_NI907.fasta test10
INFO:root:Loading weights from /global/scratch/users/skyungyong/omegafold/omegafold_ckpt/model.pt
INFO:root:Constructing OmegaFold
INFO:root:Reading gene_5088_NI907.fasta
INFO:root:Predicting 1th chain in gene_5088_NI907.fasta
INFO:root:365 residues in this chain.
INFO:root:Finished prediction in 12.72 seconds.
INFO:root:Saving prediction to test10/gene_5088_NI907.pdb
INFO:root:Saved
INFO:root:Done!


omegafold --num_cycle 1 --model 2 gene_5088_NI907.fasta test11
INFO:root:Loading weights from /global/scratch/users/skyungyong/omegafold/omegafold_ckpt/model2.pt
INFO:root:Constructing OmegaFold
INFO:root:Reading gene_5088_NI907.fasta
INFO:root:Predicting 1th chain in gene_5088_NI907.fasta
INFO:root:365 residues in this chain.
INFO:root:Failed to generate test11/gene_5088_NI907.pdb due to CUDA out of memory. Tried to allocate 10.67 GiB (GPU 0; 44.56 GiB total capacity; 32.65 GiB already allocated; 9.25 GiB free; 33.13 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
INFO:root:Skipping...
INFO:root:Done!

Using --subbatch_size also didn't help.

omegafold --subbatch_size 1 --num_cycle 1 --model 2 gene_5088_NI907.fasta test11
INFO:root:Loading weights from /global/scratch/users/skyungyong/omegafold/omegafold_ckpt/model2.pt
INFO:root:Constructing OmegaFold
INFO:root:Reading gene_5088_NI907.fasta
INFO:root:Predicting 1th chain in gene_5088_NI907.fasta
INFO:root:365 residues in this chain.
INFO:root:Failed to generate test11/gene_5088_NI907.pdb due to CUDA out of memory. Tried to allocate 10.67 GiB (GPU 0; 44.56 GiB total capacity; 32.65 GiB already allocated; 9.25 GiB free; 33.13 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
INFO:root:Skipping...
INFO:root:Done!

Thanks!

@bzhousd
Copy link

bzhousd commented Feb 26, 2023

I got the same error,

INFO:root:Failed to generate my_output4/ranked_0.pdb due to CUDA out of memory. Tried to allocate 7.80 GiB (GPU 0; 31.75 GiB total capacity; 24.66 GiB already allocated; 5.82 GiB free; 24.92 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

my command is omegafold a.fa my_output4 --model 2 --subbatch_size 1 --num_cycle 1

and model 1 works just fine and the length of my sequence is 311. any suggestion? thanks

Edit: OOM message was printed within RecycleEmbedder in my case, however I didn't find this class is using subbatch_size.

@ZhihaoXie
Copy link

I had the same problem, running model2, and switching to model1 worked fine. Here is the error message for model2:
INFO:root:379 residues in this chain.
INFO:root:Failed to generate xxx/xxx.pdb due to CUDA out of memory. Tried to allocate 6.71 GiB (GPU 0; 23.70 GiB total capacity; 18.67 GiB already allocated; 3.43 GiB free; 19.11 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
INFO:root:Skipping...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants