Speed up model training by using mixed precision and tensor cores #18

BerndDoser · 2023-09-28T14:55:08Z

Approach

Check accessibility of tensor cores on A40
Setup configuration for mixed precision
Optimize parameters for ideal usage of tensor cores

Related links:

BerndDoser · 2023-09-28T15:48:43Z

The accessibility of tensor cores can be check with the NVIDIA Nsight Compute CLI (ncu).

The access must be enabled for users with (link):

echo 'options nvidia "NVreg_RestrictProfilingToAdminUsers=0"' | sudo tee -a /etc/modprobe.d/nvidia.conf
sudo update-initramfs -u 
sudo reboot

Simple test:

import torch
import torch.nn

bsz, inf, outf = 256, 1024, 2048
tensor = torch.randn(bsz, inf).cuda().half()
layer = torch.nn.Linear(inf, outf).cuda().half()
layer(tensor)

Profile output:

(spherinator) doserbd@rh04715:~/git/Spherinator$ ncu -o profile python devel/test-tensor-cores.py 
==PROF== Connected to process 5375 (/home/doserbd/anaconda3/envs/spherinator/bin/python3.10)
==PROF== Profiling "unrolled_elementwise_kernel" - 0: 0%....50%....100% - 8 passes
==PROF== Profiling "unrolled_elementwise_kernel" - 1: 0%....50%....100% - 8 passes
==PROF== Profiling "unrolled_elementwise_kernel" - 2: 0%....50%....100% - 8 passes
==PROF== Profiling "elementwise_kernel" - 3: 0%....50%....100% - 8 passes
==PROF== Profiling "turing_fp16_s1688gemm_fp16_12..." - 4: 0%....50%....100% - 8 passes
==PROF== Disconnected from process 5375
==PROF== Report: /home/doserbd/git/Spherinator/profile.ncu-rep

Here turing_fp16_s1688gemm is a CUDA tensor core kernel on my workstation with RTX 8020.

BerndDoser · 2023-11-07T08:53:26Z

Using precision: 16-mixed in the trainer config results in the same issue described in #39 and #40.

 File "/local_data/doserbd/miniconda3/envs/spherinator-2/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 166, in scale
105     assert outputs.is_cuda or outputs.device.type == 'xla'

We close this issue and track it further in #39 and #40.

andife · 2023-11-07T12:45:19Z

According to my tensorboard visualization we use tensor cores in about 6% of the whole training time.... (I'll have to see if I still have the screenshot for this)

Unfortunately the tensorboard pytorch profiler plugin is depcreated and does not work as well as it used to with our pytorch 2.... currently the tool https://hta.readthedocs.io/en/latest/index.html does not seem to cover the whole functionality

BerndDoser added the enhancement New feature or request label Sep 28, 2023

BerndDoser self-assigned this Sep 28, 2023

BerndDoser closed this as completed Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up model training by using mixed precision and tensor cores #18

Speed up model training by using mixed precision and tensor cores #18

BerndDoser commented Sep 28, 2023 •

edited

Loading

BerndDoser commented Sep 28, 2023

BerndDoser commented Nov 7, 2023

andife commented Nov 7, 2023

Speed up model training by using mixed precision and tensor cores #18

Speed up model training by using mixed precision and tensor cores #18

Comments

BerndDoser commented Sep 28, 2023 • edited Loading

Approach

BerndDoser commented Sep 28, 2023

BerndDoser commented Nov 7, 2023

andife commented Nov 7, 2023

BerndDoser commented Sep 28, 2023 •

edited

Loading