LEARNING PATH - From Basics to Advanced CUDA Programming

This structured learning path guides you through the essential steps required to become proficient in CUDA programming, starting from foundational programming knowledge to advanced GPU computing concepts. The path emphasizes building a strong base in programming, understanding data structures, mastering C++, and diving into GPU architecture and CUDA-specific optimizations. Resources include both English and Polish materials, offering flexibility based on language preference.

C Programming:
Begin with C programming if you are unfamiliar with it. A solid understanding of C is mandatory before transitioning to C++ programming.
- 🇵🇱 Podstawy programowania. Język C
- 🇵🇱 Zaawansowane programowanie w języku C
- The C Programming Language (ANSI C) by Brian Kernighan and Dennis Ritchie
Data Structures:
Learn essential data structures and algorithms, a prerequisite for effective problem-solving and programming.
- 🎥C++ Data Structures & Algorithms + LEETCODE Exercises
- Data Structures and Algorithms -> Leetcode
- 🇵🇱 📖 Algorytmy, struktury danych i techniki programowania by Paweł Wróblewski
- 🇵🇱 📖 C++. Algorytmy i struktury danych by Adam Drozdek
C++ Programming:
Master C++ programming as it serves as a foundation for CUDA development.
- 🎥 Beginning C++ Programming - From Beginner to Beyond
- 🇵🇱 🎥 C++ od Podstaw do Eksperta
- 🇵🇱 📖 Opus magnum C++11
- 🇵🇱 📖 Język C++ Kompendium Wiedzy by Bjarne Stroustrup
- 🎥 Back to Basics
- 📖Modern C++ Tutorial: C++11/14/17/20 On the Fly
Parallel Computing:
Understand the basics of parallel computing and modern hardware architectures.
- 🎥 GPU Computing
- 🇵🇱 🎥 Programowanie równolegle z wykorzystaniem współczesnych architektur komputerowych z pamięcią współdzieloną
- 🇵🇱 Algorithms for Modern Hardware
- 🎥 Learn Multithreading with Modern C++
CUDA Programming:
Dive into CUDA, learning GPU programming techniques, optimizations, and advanced performance tuning.
- 🇵🇱 🎥 GPU Programming
- CUDA C++ Programming Guide
- 🇵🇱 CUDA - Tomasz Jasiukiewicz
- 🎥 CUDA Parallel Programming on NVIDIA GPUs - HW and SW
- CUDA Samples
- 🎥 CUDA Programming Course – High-Performance Computing with GPUs
- 📖 Programming Massively Parallel Processors by David B. Kirk, Wen-mei W. Hwu
- 🎥 CUDA training series
Triton:
Explore the Triton framework for GPU programming with efficient performance.
- Remek's Triton Repo
GPU Architecture and Glossary:
Familiarize yourself with GPU architecture and terminology to deepen your understanding of hardware capabilities.
- GPU Glossary

This comprehensive learning path equips you with the skills needed to progress from foundational programming to advanced CUDA development, paving the way for a career in GPU-accelerated computing.

Matmul

This section focuses on understanding the fundamentals and optimization of matrix multiplication (Matmul), a cornerstone operation in CUDA programming and high-performance computing (HPC). The provided resources cover both CPU implementations and GPU optimizations, including the use of Tensor Cores on architectures like Ampere and Ada. These materials are essential for building a strong foundation in writing optimized CUDA code.

Matmul on CPU: Analysis of efficient matrix multiplication implementations on CPUs, with detailed examples of optimizations:
CUDA Matmul Optimizations:
- Ampere Architecture: How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
- Ada Architecture: Implementing a fast Tensor Core matmul on the Ada Architecture
- GPU H100: Outperforming cuBLAS on H100: a Worklog
- CUDA Matrix Multiplication Optimization
Theory and Basics:

These resources provide a comprehensive theoretical and practical foundation in matrix multiplication, enabling you to master CUDA learning and better understand algorithm optimization in GPU environments.

GPU programming resources

Description of the Section: GPU Programming Resources

This section provides a curated collection of resources for learning, exploring, and mastering GPU programming. It covers various aspects of GPU development, including community engagement, architectural insights, tutorials, example implementations, benchmarking, and advanced tools. These resources cater to developers at different expertise levels, offering a pathway to build and optimize high-performance GPU applications.

1. Communities

Engage with fellow developers and experts in the field of GPU programming:

NVIDIA CUDA Forum
CUDA-MODE

2. GPU Architectures

Understand the underlying architecture of GPUs to optimize code efficiently:

Ampere Architecture
Ada Architecture
Hopper Architecture
Grace-Hopper Architecture
GPUs Go Brrr

3. Tutorials

Learn the practical aspects of GPU programming with these tutorials:

GPU Puzzles - to solve -> Solved
Accurate Timing of CUDA Kernels in PyTorch

4. Courses

Comprehensive courses to deepen your GPU programming skills:

Parallel Computing Using CUDA-C
CUDA Course
CUDA Tutorial Code Samples
CUDA Tutorial

5. Videos

Explore video tutorials and insights on GPU programming:

Programming Massively Parallel Processors
Simon Oz - GPU Programming
CUDA Programming
George Hotz Archive

6. Example Implementations

Explore real-world examples and implementations:

llm.c
Fast LLM Inference From Scratch
MNIST CUDA
Softmax
YALM: LLM Inference in C++/CUDA
llm.cpp: Training and Inference
CUTLASS Tutorial: Fast Matrix Multiplication with WGMMA on Hopper GPUs

7. Kernel Leaderboard

Track performance and benchmarks of GPU kernels:

Kernel Leaderboard
KernelBench Blog

8. Benchmarking

Compare GPU performance and analyze benchmarks:

MI300X vs H100 vs H200 Training Benchmarks
Forecasting GPU Performance
Benchmarking Nvidia Hopper GPU Architecture
Maximum Achievable Matmul FLOPS Finder

9. Patterns and Algorithms

Understand key HPC algorithms like matrix multiplication:

HPC Matmul Algorithms

10. Articles

Insights into GPU performance and its nuances:

The GPU is Not Always Faster
Series of articles explaining GPU programming -> Demystifying CPUs and GPUs: What You Need to Know, How the way a computer works, Terminology in parallel programming, Hello world Cuda-C, The operational mechanism of CPU-GPU, Memory Types in GPU, Using GPU memory, Synchronization and Asynchronization, Unified memory, Pinned memory, Streaming, Data Hazard, Warp Scheduler, Global Memory Coalescing, Atomic Function,Bandwidth — Throughput — Latency,Occupancy in GPU Part 1, Occupancy in GPU Part 2

11. CUDA Frameworks

Explore CUDA-based frameworks for specific use cases:

ThunderKittens Framework

12. Papers

Explore state-of-the-art research in GPU programming:

The Case for Co-Designing Model Architectures with Hardware

13. Tools

Useful tools for tuning and analyzing GPU performance:

NVIDIA Nsight Compute
CUDA Profiler User Guide
Kernel Tuner

This resource list offers a comprehensive set of tools, tutorials, and materials to help developers advance their GPU programming expertise, from beginner to professional levels.

Parallel computing

Programming Parallel Computers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

LEARNING PATH - From Basics to Advanced CUDA Programming

Matmul

GPU programming resources

Description of the Section: GPU Programming Resources

1. Communities

2. GPU Architectures

3. Tutorials

4. Courses

5. Videos

6. Example Implementations

7. Kernel Leaderboard

8. Benchmarking

9. Patterns and Algorithms

10. Articles

11. CUDA Frameworks

12. Papers

13. Tools

Parallel computing

Files

README.md

Latest commit

History

README.md

File metadata and controls

LEARNING PATH - From Basics to Advanced CUDA Programming

Matmul

GPU programming resources

Description of the Section: GPU Programming Resources

1. Communities

2. GPU Architectures

3. Tutorials

4. Courses

5. Videos

6. Example Implementations

7. Kernel Leaderboard

8. Benchmarking

9. Patterns and Algorithms

10. Articles

11. CUDA Frameworks

12. Papers

13. Tools

Parallel computing