- Create github account (if not exists);
- Make sure SSH clone & commit is working (Connecting to GitHub with SSH);
- Fork this repo (just click Fork button on the top of the page, detailed instructions here)
- Clone your forked repo into your local machine, use your user instead of
username
:
git clone [email protected]:username/cuda-2024.git
cd cuda-2024
- Go to your group folder, e.g.:
cd 3821B1FI1
- Go to needed task folder, e.g.:
cd 1_gelu_omp
- Create new folder with your surname and name (make sure it's the same for all tasks), e.g.:
mkdir petrov_ivan
- Copy your task source/header files (including main program) into this folder (use
copy
instead ofcp
on Windows), e.g.:
cd petrov_ivan
cp /home/usr/lab/*.cpp .
cp /home/usr/lab/*.h .
- Push your sources to github repo, e.g.:
cd ..
git add .
git commit -m "1_gelu_omp task"
git push
- Go to your repo in browser, click Contribute button on the top of page, then Open pull request. Provide meaningfull request title and description, then Create pull request (see details here).
The following scheme is used to measure task execution time:
int main() {
// ...
// Warming-up
Task(input, size / 8);
// Performance Measuring
auto start = std::chrono::high_resolution_clock::now();
auto c = Task(input, size);
auto end = std::chrono::high_resolution_clock::now();
// ...
}
- CPU: Intel Core i5 12600K (4 cores, 4 threads)
- RAM: 16 GB
- GPU: NVIDIA RTX 4060 (8 GB)
- Host Compiler: GCC 11.4.0
- CUDA: 12.6
The Gaussian Error Linear Unit (GELU) is an activation function frequently used in Deep Neural Networks (DNNs) and can be thought of as a smoother ReLU.
To approximate GELU function, use the following formula:
GELU(x) =
Implement the function with the following interface in C++:
std::vector<float> GeluOMP(const std::vector<float>& input);
Size of result vector should be the same as for input
. Use OpenMP technology to make your function parallel & fast.
Two files are expected to be uploaded:
- gelu_omp.h
#ifndef __GELU_OMP_H
#define __GELU_OMP_H
#include <vector>
std::vector<float> GeluOMP(const std::vector<float>& input);
#endif // __GELU_OMP_H
- gelu_omp.cpp
#include "gelu_omp.h"
std::vector<float> GeluOMP(const std::vector<float>& input) {
// Place your implementation here
}
Implement the function with the following interface in CUDA C++ using the formula described above:
std::vector<float> GeluCUDA(const std::vector<float>& input);
Size of result vector should be the same as for input
. Use CUDA technology to make your function work on NVIDIA GPU. Try to make it fast.
Two files are expected to be uploaded:
- gelu_cuda.h
#ifndef __GELU_CUDA_H
#define __GELU_CUDA_H
#include <vector>
std::vector<float> GeluCUDA(const std::vector<float>& input);
#endif // __GELU_CUDA_H
- gelu_cuda.cu
#include "gelu_cuda.h"
std::vector<float> GeluCUDA(const std::vector<float>& input) {
// Place your implementation here
}
General matrix multiplication (GEMM) is a very basic and broadly used linear algebra operation applied in high performance computing (HPC), statistics, deep learning and other domains. There are a lot of GEMM algorithms with different mathematical complexity form
To start learning matrix multiplication smoother, let us start with naive approach here. To compute matrix multiplication result C for matricies A and B, where C = A * B and the size for all matricies are
To complete the task one should implement a function that multiplies two square matricies using OpenMP with the following interface:
std::vector<float> NaiveGemmOMP(const std::vector<float>& a,
const std::vector<float>& b,
int n);
Each matrix must be stored in a linear array by rows, so that a.size()==n*n
. Function takes two matricies and their size as inputs, and returns result matrix also stored by rows.
For simplicity, let's consider matrix size is always power of 2.
Two files are expected to be uploaded:
- naive_gemm_omp.h:
#ifndef __NAIVE_GEMM_OMP_H
#define __NAIVE_GEMM_OMP_H
#include <vector>
std::vector<float> NaiveGemmOMP(const std::vector<float>& a,
const std::vector<float>& b,
int n);
#endif // __NAIVE_GEMM_OMP_H
- naive_gemm_omp.cpp:
#include "naive_gemm_omp.h"
std::vector<float> NaiveGemmOMP(const std::vector<float>& a,
const std::vector<float>& b,
int n) {
// Place your implementation here
}
In this task one should implement naive approach for matrix multiplication in CUDA trying to make it fast enough (pay attention to global memory accesses in your code).
Each matrix must be stored in a linear array by rows, so that a.size()==n*n
. Function takes two matricies and their size as inputs, and returns result matrix also stored by rows.
For simplicity, let's consider matrix size is always power of 2.
Two files are expected to be uploaded:
- naive_gemm_cuda.h:
#ifndef __NAIVE_GEMM_CUDA_H
#define __NAIVE_GEMM_CUDA_H
#include <vector>
std::vector<float> NaiveGemmCUDA(const std::vector<float>& a,
const std::vector<float>& b,
int n);
#endif // __NAIVE_GEMM_CUDA_H
- naive_gemm_cuda.cu:
#include "naive_gemm_cuda.h"
std::vector<float> NaiveGemmCUDA(const std::vector<float>& a,
const std::vector<float>& b,
int n) {
// Place your implementation here
}
In real applications block-based approach for matrix multiplication can get multiple times faster execution comparing with naive version due to cache friendly approach. To prove this in practice, implement such a version in C++ using OpenMP.
In block version algorithm could be divided into three stages:
- Split matricies into blocks (block size normally affects performance significantly so choose it consciously);
- Multiply two blocks to get partial result;
- Replay step 2 for all row/column blocks accumulating values into a single result block.
From math perspective, block matrix multiplication could be described by the following formula, where
Each matrix must be stored in a linear array by rows, so that a.size()==n*n
. Function takes two matricies and their size as inputs, and returns result matrix also stored by rows.
For simplicity, let's consider matrix size is always power of 2.
Two files are expected to be uploaded:
- block_gemm_omp.h:
#ifndef __BLOCK_GEMM_OMP_H
#define __BLOCK_GEMM_OMP_H
#include <vector>
std::vector<float> BlockGemmOMP(const std::vector<float>& a,
const std::vector<float>& b,
int n);
#endif // __BLOCK_GEMM_OMP_H
- block_gemm_omp.cpp:
#include "block_gemm_omp.h"
std::vector<float> BlockGemmOMP(const std::vector<float>& a,
const std::vector<float>& b,
int n) {
// Place your implementation here
}
As in previous task, let us consider all matricies are square.
In CUDA C++ block-based approach looks similar. But to get better performance one should use CUDA shared memory to store each particular block while computations. With this consideration, algorithm will be the following:
- A single CUDA block should compute a single block of result matrix C, a single CUDA thread - a single matrix C element;
- For each A block in a row and B block in a column:
- Load A block into shared memory;
- Load B block into shared memory;
- Synchronize over all threads in block;
- Compute BlockA * BlockB and accumulate into C block in shared memory;
- Synchronize over all threads in block;
- Dump block C from shared to global memory.
Each matrix must be stored in a linear array by rows, so that a.size()==n*n
. Function takes two matricies and their size as inputs, and returns result matrix also stored by rows.
For simplicity, let's consider matrix size is always power of 2.
Two files are expected to be uploaded:
- block_gemm_cuda.h:
#ifndef __BLOCK_GEMM_CUDA_H
#define __BLOCK_GEMM_CUDA_H
#include <vector>
std::vector<float> BlockGemmCUDA(const std::vector<float>& a,
const std::vector<float>& b,
int n);
#endif // __BLOCK_GEMM_CUDA_H
- block_gemm_cuda.cu:
#include "block_gemm_cuda.h"
std::vector<float> BlockGemmCUDA(const std::vector<float>& a,
const std::vector<float>& b,
int n) {
// Place your implementation here
}
The most performant way to multiply two matrices on particular hardware is to use vendor-provided library for this purpose. In CUDA it's cuBLAS. Try to use cuBLAS API to implement general matrix multiplication in most performant way.
Each matrix must be stored in a linear array by rows, so that a.size()==n*n
. Function takes two matricies and their size as inputs, and returns result matrix also stored by rows.
For simplicity, let's consider matrix size is always power of 2.
Note, that in cuBLAS API matrix is expected to be stored by columns, so additional transpose may be required.
Two files are expected to be uploaded:
- gemm_cublas.h:
#ifndef __GEMM_CUBLAS_H
#define __GEMM_CUBLAS_H
#include <vector>
std::vector<float> GemmCUBLAS(const std::vector<float>& a,
const std::vector<float>& b,
int n);
#endif // __GEMM_CUBLAS_H
- gemm_cublas.cu:
#include "gemm_cublas.h"
std::vector<float> GemmCUBLAS(const std::vector<float>& a,
const std::vector<float>& b,
int n) {
// Place your implementation here
}
Another widely used operation in HPC & signal processing is discrete Fourier Transform. Naive approach (by definition) has
Due to its frequent use, FFT algorithm implementation is normally a part of vendor-optimized solutions for various hardware chips. For NVIDIA GPUs one should take cuFFT library.
To pass the task one should implement a funtion that takes
Required function should have the following prototype:
std::vector<float> FffCUFFT(const std::vector<float>& input, int batch);
Here
The function should perform the following actions:
- Compute forward Fourier transform for
$input$ ; - Compute inverse Fourier transform for the result of step 1;
- Normalize result of step 2 by
$n$ .
Returned array must store result of step 3 in the same format of
Note, that due to Fourier Transform math properties, result array will have the same values as input one. This specificity could be used for self-checking.
Two files are expected to be uploaded:
- fft_cufft.h:
#ifndef __FFT_CUFFT_H
#define __FFT_CUFFT_H
#include <vector>
std::vector<float> FffCUFFT(const std::vector<float>& input, int batch);
#endif // __FFT_CUFFT_H
- fft_cufft.cu:
#include "fft_cufft.h"
std::vector<float> FffCUFFT(const std::vector<float>& input, int batch) {
// Place your implementation here
}
Implement GELU function with the following interface in OpenCL using the formula described in task #1:
std::vector<float> GeluOCL(const std::vector<float>& input);
Size of result vector should be the same as for input
. Use OpenCL technology to make your function work on NVIDIA GPU. Try to make it fast.
Use CL_DEVICE_GPU
flag to choose GPU device. Use zero platform and zero device. Store your OpenCL kernel in a string constant.
Two files are expected to be uploaded:
- gelu_ocl.h
#ifndef __GELU_OCL_H
#define __GELU_OCL_H
#include <vector>
std::vector<float> GeluOCL(const std::vector<float>& input);
#endif // __GELU_OCL_H
- gelu_ocl.cpp
#include "gelu_ocl.h"
std::vector<float> GeluOCL(const std::vector<float>& input) {
// Place your implementation here
}
Group | Name | Result |
---|---|---|
3821B1FI3 | kuznetsov_artyom | 0.2679 |
REF | REF | 0.8126 |
3821B1FI3 | kulaev_zhenya | TOO SLOW |
Group | Name | Result |
---|---|---|
3821B1FI3 | kuznetsov_artyom | 0.2452 |
REF | REF | 0.2598 |
Group | Name | Result |
---|---|---|
REF | REF | 0.8379 |
3821B1FI3 | kuznetsov_artyom | 0.9010 |
Group | Name | Result |
---|---|---|
REF | REF | 0.1877 |
3821B1FI3 | kuznetsov_artyom | 0.2329 |
Group | Name | Result |
---|---|---|
REF | REF | 0.1980 |
3821B1FI3 | kuznetsov_artyom | 0.2230 |
Group | Name | Result |
---|---|---|
3821B1FI3 | kuznetsov_artyom | 0.1516 |
REF | REF | 0.1524 |
Group | Name | Result |
---|---|---|
REF | REF | 0.0601 |
3821B1FI3 | kuznetsov_artyom | BUILD FAILED |
Group | Name | Result |
---|---|---|
REF | REF | 0.2309 |
Group | Name | Result |
---|---|---|
REF | REF | 0.2621 |
Group | Name | Passed |
---|---|---|
3821B1FI3 | kulaev_zhenya | 0/9 |
3821B1FI3 | kuznetsov_artyom | 6/9 |