Skip to content

Latest commit

 

History

History
219 lines (195 loc) · 9.86 KB

README.md

File metadata and controls

219 lines (195 loc) · 9.86 KB

Build and Test License: MIT

one4all

A framework to streamline developing for CUDA, ROCm and oneAPI at the same time. There is a recorded video about it on SHARCNET YouTube Channel:

Updated slides of the above video with more accurate benchmark results are included in the doc folder.

Table of contents

Features

  • Support four target APIs
    • CUDA
    • oneAPI
    • ROCm
    • STL Parallel Algorithms
  • All the configurations are automatically done by CMake
  • Support unit testing with Catch2
  • Support Google Benchmark
  • Two (kernel and Thrust/oneDPL) sample algorithms are already included

Building from source

You need:

  • C++ compiler supporting the C++17 standard (e.g. gcc 9.3)
  • CMake version 3.21 or higher.

And the following optional third-party libraries

The CMake script configured in a way that if it cannot find the optional third-party libraries it tries to fetch and build them automatically. So, there is no need to do anything if they are missing but you need an internet connection for that to work.

On the Alliance clusters, you can activate the above environment by the following module command:

module load cmake googlebenchmark catch2

Building C++17 parallel algorithm version

Parallel STL requires a TBB version between 2018 to 2020 to work.

git clone https://github.com/arminms/one4all.git
cd one4all
cmake -S . -B build
cmake --build build -j

Building CUDA version

Requires CUDA version 11 or higher.

git clone https://github.com/arminms/one4all.git
cd one4all
cmake -S . -B build-cuda -DONE4ALL_TARGET_API=cuda
cmake --build build-cuda -j

Building ROCm version

Requires ROCm 5.4.3 or higher.

git clone https://github.com/arminms/one4all.git
cd one4all
CXX=hipcc cmake -S . -B build-rocm -DONE4ALL_TARGET_API=rocm
cmake --build build-rocm -j

Building oneAPI version

Requires oneAPI 2023.0.0 or higher.

Building for OpenCL targets

git clone https://github.com/arminms/one4all.git
cd one4all
CXX=icpx cmake -S . -B build-oneapi -DONE4ALL_TARGET_API=oneapi
cmake --build build-oneapi -j

Building for OpenCL, NVIDIA and/or AMD GPUs

Requires Codeplay plugins for NVIDIA and/or AMD GPUs installed.

git clone https://github.com/arminms/one4all.git
cd one4all
CXX=clang++ cmake -S . -B build-oneapi -DONE4ALL_TARGET_API=oneapi
cmake --build build-oneapi -j

Running unit tests

cd build # or build-cuda / build-rocm / build-oneapi
ctest

To select target for oneAPI version, set ONEAPI_DEVICE_SELECTOR or SYCL_DEVICE_FILTER environment variable first:

# oneAPI 2023.1.0 or higher
ONEAPI_DEVICE_SELECTOR=[level_zero|opencl|cuda|hip|esimd_emulator|*][:cpu|gpu|fpga|*]

# older versions of oneAPI
SYCL_DEVICE_FILTER=[level_zero|opencl|cuda|hip|esimd_emulator|*][:cpu|gpu|acc|*]

You can find the complete syntax here. Here is an example to run oneAPI version on NVIDIA GPUs:

ONEAPI_DEVICE_SELECTOR=cuda build-oneapi/test/unit_tests

Running benchmarks

cd build  # or build-cuda / build-rocm / build-oneapi
perf/benchmarks --benchmark_counters_tabular=true

Selecting targets for oneAPI version is like unit tests described above.

Benchmark results

Here are some updated benchmark results (more accurate than the preliminary results shown in the YouTube video because of switching to cudaEvent*() / hipEvent*() / SYCL's queue profiling for measuring performance) on the Alliance's clusters. Output files are included in the perf/results folder.

Parallel STL vs. oneAPI (higher is better)

Using AMD EPYC 7543 x2 2.8 GHz (64C / 128T) CPU:

API – Algorithm float double
Parallel STL – * 1.00 1.00
oneAPI – generate_table() 0.46 0.27
oneAPI – scale_table() 1.61 1.36

CUDA vs. oneAPI (higher is better)

Using NVIDIA A100-SXM4-40GB GPU:

API – Algorithm float double
CUDA – * 1.00 1.00
oneAPI – generate_table() 1.00 1.02
oneAPI – scale_table() 1.01 1.02

ROCm vs. oneAPI (higher is better)

Using AMD Instinct MI210 GPU:

API – Algorithm float double
ROCm – * 1.00 1.00
oneAPI – generate_table() 1.04 1.08
oneAPI – scale_table() 0.91 0.79

NVIDIA A100-SXM4-40GB (SM=108) vs. AMD Instinct MI210 (SM=104) (higher is better)

GPU – Algorithm float double
A100 – * 1.00 1.00
MI210 – generate_table() 0.81 0.37
MI210 – scale_table() 0.68 0.87

Using one4all for new projects

Select fork from the top right part of this page. You may choose a different name for your repository. In that case, you can also find/replace one4all with <your-project> in all files (case-sensitive) and ONE4ALL_TARGET_API with <YOUR-PROJECT>_TARGET_API in all CMakeLists.txt files. Finally, rename include/one4all folder to include/<your-project>.

You can add your new algorithms to include/<your-project>/algorithm along with unit tests and benchmarks in the corresponding test/unit_test/unit_tests_*.cpp and perf/benchmark/benchmarks_*.cpp files, respectively.

Later, if you decided to have a program, you can make a src folder and add the source code (e.g. my_prog_*.cpp) along with the following CMakeLists.txt into it:

## defining target for my_prog
#
add_executable(my_prog
  my_prog_${<YOUR-PROJECT>_TARGET_API}.$<IF:$<STREQUAL:${<YOUR-PROJECT>_TARGET_API},cuda>,cu,cpp>
)

## defining link libraries for my_prog
#
target_link_libraries(my_prog PRIVATE
  ${PROJECT_NAME}::${<YOUR-PROJECT>_TARGET_API}
)

## installing my_prog
#
install(TARGETS my_prog RUNTIME DESTINATION CMAKE_INSTALL_BINDIR)

Don't forget to replace <YOUR-PROJECT> with the name of your project in the above file.

Finally, add add_subdirectory(src) at the end of the main CMakeLists.txt file.