-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High GPU memory consumption #6
Comments
Hello, It seems that this is a fundamental problem with TTLayer and how optimization in autograd frameworks is done. In addition to the memory footprint of model weights, during optimization we also store activations on GPU. In the case of a single FC layer of size d^3 x d^3 and a batch of size B x d^3, the storage memory footprint is d^6 and the activations footprint is Bd^3. In the case of TTLayer with 3 cores, the storage footprint is d^2(r^2 + 2r) but the activation footprint is Bd^3(2r + 1). Batch size (number of tokens, B x L) used in Transformers is usually pretty big and, as a result, an increase in the activation memory footprint (2Brd^3) outweighs the win in memory footprint (d^6). To be precise, this happens when 2Br > d^3. Common values for d^3 is ~1000-4000, common batch sizes are around 5000 and for ranks they are 8-32. As you see, there is a big activation memory overhead. Most likely, TTLayers are not applicable to FC layers used in Transformers. |
Your explanation about activations makes sense, I also went over the math and its correct. However, the compressed model also consumes much more memory during inference, i.e in eval model and Top memory consumption was 3021MB for the compressed model versus 2132MB for the normal model. I also tried to write the "forward" method more efficiently (e.g with bmm or einsum) , it didn't help too. |
Hey, |
Excellent, cool! I wonder how the "native" solution would scale in terms of compute time and memory consumption. I can prepare code for d>3, I made a working script for this yesterday for something else. That's your main change: which does
Need to fully understand when einsum does a reshape and if it does efficient broadcasting for scaling this. There are several issues on pytorch repo about einsum, I understood they are working on it: |
Hi,
I tried to integrate the TTLayer into transformerXL,
however I found that it consumes much more memory than usual.
Did you experience such problems? do you know anyway around this?
(BTW I also applied few fixes for multi-GPU training, e.g tensor train objects are not passed to GPU when you activate the model.to(device), therefore breaking the model in distributed training).
The text was updated successfully, but these errors were encountered: