CUDA out of Memory while training the RegressionModel #262
Replies: 9 comments
-
Any updates on this issue, please? having the same problem |
Beta Was this translation helpful? Give feedback.
-
How many genes are you using? The default batch_size might be too large. If you want to play with smaller batch sizes you should restart your kernel to clear the GPU memory. |
Beta Was this translation helpful? Give feedback.
-
I have tried changing the batch size several times and have also tried clearing the GPU memory but still the snippet throws the CUDA out of Memory error. |
Beta Was this translation helpful? Give feedback.
-
Are you able to share a reproducible example on Colab? |
Beta Was this translation helpful? Give feedback.
-
I thought there was a problem with GPU memory allocation (used earlier an RTX6000) but have tried also on an HPC with A100 that has 40G memory and still same error. However, I can make it run if I change the batch size. A small batch size (5-10) leads to an estimated training of 900h. When I increased it to 20000, it goes down to 13h (on A100). I am not sure what the consequences are of changing this parameter. The single-cell RNA model was trained using 24K genes (the intersect with ST data is 19K). Would you suggest to trained it with fewer genes? Thanks in advance. |
Beta Was this translation helpful? Give feedback.
-
I finally got the model trained after 14h, but now that I want to load the model I am having the same problem.
Any help, please! |
Beta Was this translation helpful? Give feedback.
-
I also have the same problem |
Beta Was this translation helpful? Give feedback.
-
If your problem is with loading the regression model or training the regression model - please check that you are using latest cell2location. If your problem is with the main cell2location model, your dataset could be too large - its realistic to use 16k genes * about 50k locations on A100 80GB GPU. |
Beta Was this translation helpful? Give feedback.
-
@vitkl Thank you for the explanation. I went into trouble CUDA memory problem when running the main cell2location model. My dataset was indeed bigger (17k genes over 70k locations). After reducing the genes to 13k, CUDA is happy again. |
Beta Was this translation helpful? Give feedback.
-
Please use the template below to post a question to https://discourse.scverse.org/c/ecosytem/cell2location/.
Problem
...
N_cells_per_location
anddetection_alpha
.batch_key
for reference NB regression.I have actually gone through the mousebrain cell2location tutorial and I was planning to run the newer version of the Cell2location model on this dataset. The data prerocessing part was done successfully but I was having some memory issues in the training of the RegerssionModel.
Here is the block of code that produces rhe error:
here I have tried to different batch_size values (even 1) but still it produces the given error:
I have also stepped through the portion of the code which runs out of memory.
The corresponding file was /usr/local/lib/python3.7/dist-packages/scvi/model/base/_pyromixin.py while running the code in google colab.
After stepping through this TrainRunner, the CUDA out of memory issue is displayed. I have actually used pdb to step through each line of this class.
Could you please provide a reason for this memory issue since I am not able to train the Regression model with the MouseBrain* data. or suggest any ways to overcome this memory issue since decreasing the batch_size parameter doesn't seem to work in this case.
Beta Was this translation helpful? Give feedback.
All reactions