- PyTorch
- Table of Contents
- Cheat Sheets
- Tutorials
- General Functioning
- Tips/Tricks, code snippets
- Gradient accumulation
- Freeze seeds for reproducibility
- Freeze layers
- Save and load weights
- Change last layer
- Delete last layer
- Get number of parameters
- No grad and inference_mode decorators
- Gradient clipping
- Remove bias weight decay
- Test time augmentation (TTA)
- Intermediate gradients
- Get intermediate layers values
- Weight init
- Train/test/valid splits
- Common mistakes
- Maximizing performance
- Construct tensors directly on GPUs
- Avoid CPU to GPU transfers or vice-versa
- Transform your input tensors on the GPU
- Workers in dataloader
- cuddn.benchmark
- Use inplace operations
- Gradient checkpointing
- Pinned Memory
- Use DistributedDataParallel not DataParallel
- Profile your code
- Use auto mixed precision
- Static graphs
- Lightning Fabric
- More tips
- Torchmetrics
- Visualize Layer Activations
- MultiGPU
- PyTorch internals
- Debugging
- Getting Started Tutorials
- PyTorch Autograd Explained - In-depth Tutorial
- PyTorchZeroToAll
- Neural Network Programming - Deep Learning with PyTorch
- Understanding PyTorch with an example: a step-by-step tutorial
- https://towardsdatascience.com/how-to-use-pytorch-hooks-5041d777f904
- https://leimao.github.io/blog/PyTorch-Benchmark/
- https://www.learnpytorch.io/#does-this-course-cover-pytorch-20
- https://github.com/srush/Tensor-Puzzles
Modern libraries like PyTorch are designed with 3 components:
- a fast (C/CUDA) general Tensor library that implements basic mathematical operations over multi-dimensional tensors
- an autograd engine that tracks the forward compute graph and can generate operations for the backward pass
- a scriptable (Python) deep-learning-aware, high-level API of common deep learning operations, layers, architectures, optimizers, loss functions, etc.
Each tensor has a .grad_fn
attribute that references a function that has created the Tensor (except for Tensors created by the user - their grad_fn
is None
).
The dynamic graph records the execution of the running program. The graph is being generated ON THE FLY.
Automatic differentiations is basically a pre-implementation of the most common functions and their local gradients.
Mathematically, if you have a vector valued function
Generally speaking, torch.autograd
is an engine for computing
vector-Jacobian product. If
(Note that
This characteristic of vector-Jacobian product makes it very convenient to feed external gradients into a model that has non-scalar output.
https://kozodoi.me/python/deep%20learning/pytorch/tutorial/2021/02/19/gradient-accumulation.html
optimizer.zero_grad()
scaled_loss = 0
for accumulated_step_i in range(N_STEPS):
out = model.forward()
loss = ...
loss.backward()
scaled_loss += loss.item()
optimizer.step()
actual_loss = scaled_loss / N_STEPS
np.random.seed(1)
torch.manual_seed(1)
torch.cuda.manual_seed(1)
random.seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
for child in model.children():
for param in child.parameters():
param.requires_grad = False
optimizer = torch.optim.Adam(
filter(lambda p: p.requires_grad, model.parameters()), lr=...
)
torch.save(model.state_dict(), MODEL_PATH)
model.load_state_dict(torch.load(MODEL_PATH))
num_final_in = model.fc.in_features
model.fc = nn.Linear(num_final_in, NUM_CLASSES)
new_model = nn.Sequential(*list(model.children())[:-1])
num_params = sum(p.numel() for p in model.parameters()) # Total parameters
num_trainable_params = sum(
p.numel() for p in model.parameters() if p.requires_grad
) # Trainable parameters
@torch.no_grad()
def eval(model, data):
model.eval()
@torch.inference_mode()
def eval(model, data):
model.eval()
torch.nn.utils.clip_grad_value_(parameters=model.parameters(), clip_value=1.0)
torch.nn.utils.clip_grad_norm_(parameters, max_norm, norm_type=2)
def add_weight_decay(net, l2_value, skip_list=()):
decay, no_decay = [], []
for name, param in net.named_parameters():
if not param.requires_grad:
continue # frozen weights
if len(param.shape) == 1 or name.endswith(".bias") or name in skip_list:
no_decay.append(param)
else:
decay.append(param)
return [
{"params": no_decay, "weight_decay": 0.0},
{"params": decay, "weight_decay": l2_value},
]
params = add_weight_decay(net, 2e-5)
sgd = torch.optim.SGD(params, lr=0.1)
data = torch.stack([list_of_tensors])
batch_size, n_crops, c, h, w = data.size()
data = data.view(-1, c, h, w)
output = model(data)
output = output.view(batch_size, n_crops, -1).mean(1)
By default, PyTorch only stores the gradients of the leaf variables (e.g., the weights and biases) via their grad attribute to save memory. So, if we are interested in the intermediate results in a computational graph, we can use the retain_grad
method to store gradients of non-leaf variables as follows:
u = x * w
v = u + b
u.retain_grad()
v.backward()
print(u.grad)
https://pytorch.org/blog/FX-feature-extraction-torchvision/
def init_weights(net, init_type="normal", gain=0.02):
def init_func(m):
if isinstance(m, (nn.Conv2d, nn.Linear)):
if init_type == "normal":
nn.init.normal_(m.weight.data, 0.0, gain)
elif init_type == "xavier":
nn.init.xavier_normal_(m.weight.data, gain=gain)
elif init_type == "kaiming":
nn.init.kaiming_normal_(m.weight.data, a=0, mode="fan_in")
elif init_type == "orthogonal":
nn.init.orthogonal_(m.weight.data, gain=gain)
else:
raise NotImplementedError(
"initialization method [%s] is not implemented" % init_type
)
if hasattr(m, "bias") and m.bias is not None:
nn.init.constant_(m.bias.data, 0.0)
elif isinstance(m, nn.BatchNorm2d):
nn.init.normal_(m.weight.data, 1.0, gain)
nn.init.constant_(m.bias.data, 0.0)
print("initialize network with %s" % init_type)
net.apply(init_func)
Use from torch.utils.data.dataset import Subset
losses.append(loss) # bad
losses.append(loss.item()) # good
a = torch.tensor([1.0, 2.0, 3.0])
b = a # WRONG: same reference
b = a.clone()
t = tensor.rand(2, 2).cuda() # bad
t = tensor.rand(2, 2, device="cuda") # good
Avoid usage of the .item()
or .cpu()
or .numpy()
calls. This is really bad for performance because every one of these calls transfers data from GPU to CPU and dramatically slows your performance.
Usually, your images are loaded from your drives as uint8 arrays. After normalization, the data is converted to float32. This takes more space as each pixel is now represented by 32 bits instead of 8 bits.
It makes more sense to send the input batches in uint8 to the GPU, and then convert them to float32 on the GPU. This saves a lot of bandwidth between the CPU and the GPU.
# Bad
x = torch.tensor(x, dtype=torch.float32).cuda()
# Good
x = torch.tensor(x, dtype=torch.uint8).cuda().float()
PyTorch allows loading data on multiple processes simultaneously.
Good rule of thumb:
- set the number of workers to the number of CPU cores.
- num_worker = 4 * num_GPU
Set torch.backends.cudnn.benchmark = True
Note that cudnn.benchmark will profile the kernels for each new input shape, so be careful if dynamic shapes are used.
For example, if we perform x.cos().cos(), usually we need to perform 4 global reads and writes.
x1 = x.cos() # Read from x in global memory, write to x1
x2 = x1.cos() # Read from x1 in global memory, write to x2
But, with operator fusion, we only need 2 global memory reads and writes! So operator fusion will speed it up by 2x.
x2 = x.cos().cos() # Read from x in global memory, write to x2
Deep Learning Memory Usage and Pytorch Optimization Tricks
pin_memory=True
in PyTorch's DataLoader
class
Theoretically, pinning the memory should speed up the data transfer rate but minimizing the data transfer cost between CPU and the CUDA device; hence, enabling pin_memory=True
should make the model training faster by some small margin.
pin_memory=torch.cuda.is_available()
When you enable pinned_memory in a DataLoader it “automatically puts the fetched data Tensors in pinned memory, and enables faster data transfer to CUDA-enabled GPUs”.
PyTorch has two main models for training on multiple GPUs. The first, DataParallel (DP), splits a batch across multiple GPUs. But this also means that the model has to be copied to each GPU and once gradients are calculated on GPU 0, they must be synced to the other GPUs.
That’s a lot of GPU transfers which are expensive! Instead, DistributedDataParallel (DDP)creates a siloed copy of the model on each GPU (in its own process), and makes only a portion of the data available to that GPU. Then its like having N independent models training, except that once each one calculates the gradients, they all sync gradients across models… this means we only transfer data across GPUs once during each batch.
Pytorch lightning has a profiler built in.
https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html
https://zdevito.github.io/2022/12/09/memory-traces.html
Forward and backward pass in 16-bit precision, convert gradients to 16-bit and upgrade weights in 32-bit precision.
This is another way to speed up training which we don’t see many people using. In 16-bit training parts of your model and your data go from 32-bit numbers to 16-bit numbers. This has a few advantages:
- You use half the memory (which means you can double batch size and cut training time in half).
- Certain GPUs (V100, 2080Ti) give you automatic speed-ups (3x-8x faster) because they are optimized for 16-bit computations.
Can make your code run three times faster.
https://pytorch.org/docs/stable/amp.html
Pytorch 2.0 added torch.compile()
model = torch.compile(model) # NEW
the compile API converts your model into an intermediate computation graph (an FX graph) which it then compiles into low-level compute kernels in a manner that is optimal for the underlying training accelerator, using techniques such as kernel fusion and out-of-order execution (see here for more details).
You can also compile your losses for example:
criterion = torch.compile(torch.nn.CrossEntropyLoss().cuda(device))
Lightning Fabric is a lightweight Pytorch Lightning extension: https://lightning.ai/docs/fabric/stable/
- https://web.archive.org/web/20230127171726/https://efficientdl.com/faster-deep-learning-in-pytorch-a-guide/
- https://sebastianraschka.com/blog/2023/pytorch-memory-optimization.html
- PyTorch Model Performance Analysis and Optimization
- https://pytorch.org/blog/accelerating-generative-ai-2/
Remove boilerplate code using torchmetrics to accumulate batch metrics over an epoch.
https://sebastianraschka.com/blog/2022/torchmetrics.html
- https://web.archive.org/web/20240222184221/http://blog.ezyang.com/2019/05/pytorch-internals/
- http://blog.christianperone.com/2018/03/pytorch-internal-architecture-tour/
Debugging PyTorch memory use with snapshots: https://zdevito.github.io/2022/08/16/memory-snapshots.html