Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up model training by using mixed precision and tensor cores #18

Closed
BerndDoser opened this issue Sep 28, 2023 · 3 comments
Closed
Assignees
Labels
enhancement New feature or request

Comments

@BerndDoser
Copy link
Member

BerndDoser commented Sep 28, 2023

Approach

  • Check accessibility of tensor cores on A40
  • Setup configuration for mixed precision
  • Optimize parameters for ideal usage of tensor cores

Related links:

@BerndDoser BerndDoser added the enhancement New feature or request label Sep 28, 2023
@BerndDoser BerndDoser self-assigned this Sep 28, 2023
@BerndDoser
Copy link
Member Author

The accessibility of tensor cores can be check with the NVIDIA Nsight Compute CLI (ncu).

The access must be enabled for users with (link):

echo 'options nvidia "NVreg_RestrictProfilingToAdminUsers=0"' | sudo tee -a /etc/modprobe.d/nvidia.conf
sudo update-initramfs -u 
sudo reboot

Simple test:

import torch
import torch.nn

bsz, inf, outf = 256, 1024, 2048
tensor = torch.randn(bsz, inf).cuda().half()
layer = torch.nn.Linear(inf, outf).cuda().half()
layer(tensor)

Profile output:

(spherinator) doserbd@rh04715:~/git/Spherinator$ ncu -o profile python devel/test-tensor-cores.py 
==PROF== Connected to process 5375 (/home/doserbd/anaconda3/envs/spherinator/bin/python3.10)
==PROF== Profiling "unrolled_elementwise_kernel" - 0: 0%....50%....100% - 8 passes
==PROF== Profiling "unrolled_elementwise_kernel" - 1: 0%....50%....100% - 8 passes
==PROF== Profiling "unrolled_elementwise_kernel" - 2: 0%....50%....100% - 8 passes
==PROF== Profiling "elementwise_kernel" - 3: 0%....50%....100% - 8 passes
==PROF== Profiling "turing_fp16_s1688gemm_fp16_12..." - 4: 0%....50%....100% - 8 passes
==PROF== Disconnected from process 5375
==PROF== Report: /home/doserbd/git/Spherinator/profile.ncu-rep

Here turing_fp16_s1688gemm is a CUDA tensor core kernel on my workstation with RTX 8020.

@BerndDoser
Copy link
Member Author

Using precision: 16-mixed in the trainer config results in the same issue described in #39 and #40.

 File "/local_data/doserbd/miniconda3/envs/spherinator-2/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 166, in scale
105     assert outputs.is_cuda or outputs.device.type == 'xla'

We close this issue and track it further in #39 and #40.

@andife
Copy link
Collaborator

andife commented Nov 7, 2023

According to my tensorboard visualization we use tensor cores in about 6% of the whole training time.... (I'll have to see if I still have the screenshot for this)

Unfortunately the tensorboard pytorch profiler plugin is depcreated and does not work as well as it used to with our pytorch 2.... currently the tool https://hta.readthedocs.io/en/latest/index.html does not seem to cover the whole functionality

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants