Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop Stream 2024-10-31 #421

Open
wants to merge 33 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
fec62de
Resolve "hipCUB compilation fails with latest rocPRIM changes"
sbalint98 Oct 2, 2024
97660ce
Rework device_histogram test
Beanavil Oct 16, 2024
68a439f
Const-qualify half_t::operator+/*
Beanavil Oct 16, 2024
c58c2a3
Update CUB/Thrust/libcu++ to 2.5.0
Beanavil Oct 16, 2024
2281933
Add tets for large number of items for hipcub::DeviceSelect::If
Beanavil Oct 17, 2024
dd0b00c
Update example_device_radix_sort.cu
Beanavil Oct 17, 2024
03e836a
fix: reset error code in device_radix_sort test after out-of-memory e…
Beanavil Oct 21, 2024
f0bb576
Expose DeviceSelect::FlaggedIf
Beanavil Oct 25, 2024
01e1304
Add test for DeviceSelect::FlaggedIf
Beanavil Oct 25, 2024
fbe5761
Add benchmark for DeviceSelect::FlaggedIf
Beanavil Oct 25, 2024
001509e
Add hipGraph capture tests for device run length encode
Beanavil Oct 2, 2024
30e0957
Set c++ version to 17 and create warning
NB4444 Oct 7, 2024
88addb0
Fix ambiguous variable error
NB4444 Oct 7, 2024
48615a1
Fix nodiscard warnings
NB4444 Oct 7, 2024
1b72fbc
Set CI tests for both c++14 and 17
NB4444 Oct 7, 2024
f5ece38
Fix nodiscard warnings in example
NB4444 Oct 7, 2024
4ef0aed
Examples clang-format
NB4444 Oct 7, 2024
89875a4
Fixed clang format and dates
NB4444 Oct 7, 2024
1b65b61
temp fix: wrong error on cuda machines
NB4444 Oct 25, 2024
7492686
Ignore error from hipGetLastError to prevent warning
NB4444 Oct 25, 2024
da3b221
Deleted declaration of unecessary hipError_t result
Saiyang-Zhang Oct 23, 2024
6e29b14
Deleted declaration of unecessary hipError_t result
Saiyang-Zhang Oct 23, 2024
43186b3
Format amending
Saiyang-Zhang Oct 23, 2024
d560631
Format amending
Saiyang-Zhang Oct 23, 2024
419ac42
Format amending
Saiyang-Zhang Oct 23, 2024
b46c259
Cleanup
Saiyang-Zhang Oct 28, 2024
2093675
Formatting
Saiyang-Zhang Oct 28, 2024
83b5be0
Fix clang-format
NB4444 Nov 5, 2024
a71133e
Added wrapper for BindTexture
NB4444 Oct 25, 2024
575577f
Fixed some issues in the test
NB4444 Oct 25, 2024
032cb09
Removed unnecessary code
NB4444 Nov 1, 2024
f082baa
Merge branch 'develop' into develop_stream_2024_10_31
NB4444 Nov 13, 2024
b742900
Merge branch 'develop' into develop_stream_2024_10_31
NB4444 Nov 14, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@ copyright-date:
-D GPU_TARGETS="$GPU_TARGETS"
-D CMAKE_C_COMPILER_LAUNCHER=phc_sccache_c
-D CMAKE_CXX_COMPILER_LAUNCHER=phc_sccache_cxx
-D CMAKE_CXX_STANDARD=14
-B $CI_PROJECT_DIR/rocPRIM/build
-S $CI_PROJECT_DIR/rocPRIM
- cd $CI_PROJECT_DIR/rocPRIM/build
Expand Down Expand Up @@ -116,6 +117,7 @@ build:rocm:
-D ROCM_SYMLINK_LIBS=OFF
-D CMAKE_C_COMPILER_LAUNCHER=phc_sccache_c
-D CMAKE_CXX_COMPILER_LAUNCHER=phc_sccache_cxx
-D CMAKE_CXX_STANDARD="$BUILD_VERSION"
-B $CI_PROJECT_DIR/build
-S $CI_PROJECT_DIR
- cmake --build $CI_PROJECT_DIR/build
Expand All @@ -133,6 +135,9 @@ build:rocm:
- $CI_PROJECT_DIR/build/hipcub*.zip
- $CI_PROJECT_DIR/build/.ninja_log
expire_in: 2 weeks
parallel:
matrix:
- BUILD_VERSION: [14, 17]

build:rocm-benchmark:
extends:
Expand All @@ -153,6 +158,7 @@ build:rocm-benchmark:
-D GPU_TARGETS="$GPU_TARGETS"
-D CMAKE_C_COMPILER_LAUNCHER=phc_sccache_c
-D CMAKE_CXX_COMPILER_LAUNCHER=phc_sccache_cxx
-D CMAKE_CXX_STANDARD=14
-B $CI_PROJECT_DIR/build
-S $CI_PROJECT_DIR
- cmake --build $CI_PROJECT_DIR/build
Expand All @@ -176,6 +182,7 @@ test:rocm:
- cd $CI_PROJECT_DIR/build
- cmake
-D CMAKE_PREFIX_PATH=/opt/rocm
-D CMAKE_CXX_STANDARD=14
-P $CI_PROJECT_DIR/cmake/GenerateResourceSpec.cmake
- cat ./resources.json
# Parallel execution (with other AMDGPU processes) can oversubscribe the SDMA queue.
Expand Down Expand Up @@ -231,6 +238,7 @@ benchmark:rocm:
-G Ninja
-D CMAKE_CXX_FLAGS="-Wall -Wextra -Werror"
"$GPU_TARGETS_ARG"
-D CMAKE_CXX_STANDARD=14
-S $CI_PROJECT_DIR/test/extra
-B $CI_PROJECT_DIR/build/package_test
- cmake --build $CI_PROJECT_DIR/build/package_test
Expand All @@ -251,6 +259,7 @@ benchmark:rocm:
- cmake
-G Ninja
-D BUILD_TEST=OFF
-D CMAKE_CXX_STANDARD=14
-S $CI_PROJECT_DIR
-B $CI_PROJECT_DIR/build_only_install
# Preserve $PATH when sudoing
Expand Down Expand Up @@ -318,6 +327,7 @@ build:nvcc:
-D CMAKE_C_COMPILER_LAUNCHER=phc_sccache_c
-D CMAKE_CXX_COMPILER_LAUNCHER=phc_sccache_cxx
-D CMAKE_CUDA_COMPILER_LAUNCHER=phc_sccache_cuda
-D CMAKE_CXX_STANDARD="$BUILD_VERSION"
-B $CI_PROJECT_DIR/build
-S $CI_PROJECT_DIR
- cmake --build $CI_PROJECT_DIR/build
Expand All @@ -335,6 +345,9 @@ build:nvcc:
- $CI_PROJECT_DIR/build/hipcub*.zip
- $CI_PROJECT_DIR/build/.ninja_log
expire_in: 2 weeks
parallel:
matrix:
- BUILD_VERSION: [14, 17]

build:nvcc-benchmark:
stage: build
Expand All @@ -354,6 +367,7 @@ build:nvcc-benchmark:
-D CMAKE_C_COMPILER_LAUNCHER=phc_sccache_c
-D CMAKE_CXX_COMPILER_LAUNCHER=phc_sccache_cxx
-D CMAKE_CUDA_COMPILER_LAUNCHER=phc_sccache_cuda
-D CMAKE_CXX_STANDARD=14
-B $CI_PROJECT_DIR/build
-S $CI_PROJECT_DIR
- cmake --build $CI_PROJECT_DIR/build
Expand Down
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,13 @@ Full documentation for hipCUB is available at [https://rocm.docs.amd.com/project
* Added the `hipcub::CubVector` type for CUB parity.
* Added `--emulation` option for `rtest.py`
* Unit tests can be run with `[--emulation|-e|--test|-t]=<test_name>`
* Added `DeviceSelect::FlaggedIf` and its inplace overload.

### Changed
* Changed the subset of tests that are run for smoke tests such that the smoke test will complete with faster run-time and to never exceed 2GB of vram usage. Use `python rtest.py [--emulation|-e|--test|-t]=smoke` to run these tests.
* The `rtest.py` options have changed. `rtest.py` is now run with at least either `--test|-t` or `--emulation|-e`, but not both options.
* The NVIDIA backend now requires CUB, Thrust and libcu++ 2.5.0. If it is not found it will be downloaded from the NVIDIA CCCL repository.
* Changed the C++ version from 14 to 17. C++14 will be deprecated in the next major release.

## hipCUB-3.3.0 for ROCm 6.3.0

Expand Down
13 changes: 11 additions & 2 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# MIT License
#
# Copyright (c) 2017-2023 Advanced Micro Devices, Inc. All rights reserved.
# Copyright (c) 2017-2024 Advanced Micro Devices, Inc. All rights reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
Expand Down Expand Up @@ -61,10 +61,19 @@ endif()
set(CMAKE_INSTALL_RPATH_USE_LINK_PATH TRUE CACHE BOOL "Add paths to linker search and installed rpath")

# Set CXX flags
set(CMAKE_CXX_STANDARD 14)
if (NOT DEFINED CMAKE_CXX_STANDARD)
set(CMAKE_CXX_STANDARD 17)
endif()
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS OFF)


if (CMAKE_CXX_STANDARD EQUAL 14)
message(WARNING "C++14 will be deprecated in the next major release")
elseif(NOT CMAKE_CXX_STANDARD EQUAL 17)
message(FATAL_ERROR "Only C++14 and C++17 are supported")
endif()

# rocm-cmake has to be included early so that it's available to set GPU_TARGETS
# If hip is included prior to setting that then it defaults to building only for the current architecture
include(ROCmCMakeBuildToolsDependency)
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ python3 -m http.server
* Requires CMake 3.16.9 or later
* For NVIDIA GPUs:
* CUDA Toolkit
* CCCL library (>= 2.4.0)
* CCCL library (>= 2.5.0)
* Automatically downloaded and built by the CMake script
* Requires CMake 3.15.0 or later
* Python 3.6 or higher (for HIP on Windows only; this is only required for install scripts)
Expand Down
6 changes: 3 additions & 3 deletions benchmark/benchmark_device_adjacent_difference.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -178,12 +178,12 @@ void run_benchmark(benchmark::State& state, const std::size_t size, const hipStr
state.SetBytesProcessed(state.iterations() * batch_size * size * sizeof(T));
state.SetItemsProcessed(state.iterations() * batch_size * size);

hipFree(d_input);
HIP_CHECK(hipFree(d_input));
if(copy)
{
hipFree(d_output);
HIP_CHECK(hipFree(d_output));
}
hipFree(d_temp_storage);
HIP_CHECK(hipFree(d_temp_storage));
}

} // namespace
Expand Down
32 changes: 16 additions & 16 deletions benchmark/benchmark_device_run_length_encode.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -111,14 +111,14 @@ void run_encode_benchmark(benchmark::State& state,

for(size_t i = 0; i < batch_size; i++)
{
hipcub::DeviceRunLengthEncode::Encode(d_temporary_storage,
temporary_storage_bytes,
d_input,
d_unique_output,
d_counts_output,
d_runs_count_output,
size,
stream);
HIP_CHECK(hipcub::DeviceRunLengthEncode::Encode(d_temporary_storage,
temporary_storage_bytes,
d_input,
d_unique_output,
d_counts_output,
d_runs_count_output,
size,
stream));
}
HIP_CHECK(hipStreamSynchronize(stream));

Expand Down Expand Up @@ -214,14 +214,14 @@ void run_non_trivial_runs_benchmark(benchmark::State& state,

for(size_t i = 0; i < batch_size; i++)
{
hipcub::DeviceRunLengthEncode::NonTrivialRuns(d_temporary_storage,
temporary_storage_bytes,
d_input,
d_offsets_output,
d_counts_output,
d_runs_count_output,
size,
stream);
HIP_CHECK(hipcub::DeviceRunLengthEncode::NonTrivialRuns(d_temporary_storage,
temporary_storage_bytes,
d_input,
d_offsets_output,
d_counts_output,
d_runs_count_output,
size,
stream));
}
HIP_CHECK(hipStreamSynchronize(stream));

Expand Down
151 changes: 136 additions & 15 deletions benchmark/benchmark_device_select.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -113,11 +113,11 @@ void run_flagged_benchmark(benchmark::State& state,
state.SetBytesProcessed(state.iterations() * batch_size * size * sizeof(T));
state.SetItemsProcessed(state.iterations() * batch_size * size);

hipFree(d_input);
hipFree(d_flags);
hipFree(d_output);
hipFree(d_selected_count_output);
hipFree(d_temp_storage);
HIP_CHECK(hipFree(d_input));
HIP_CHECK(hipFree(d_flags));
HIP_CHECK(hipFree(d_output));
HIP_CHECK(hipFree(d_selected_count_output));
HIP_CHECK(hipFree(d_temp_storage));
HIP_CHECK(hipDeviceSynchronize());
}

Expand Down Expand Up @@ -211,7 +211,104 @@ void run_selectop_benchmark(benchmark::State& state,
state.SetBytesProcessed(state.iterations() * batch_size * size * sizeof(T));
state.SetItemsProcessed(state.iterations() * batch_size * size);

HIP_CHECK(hipFree(d_input));
HIP_CHECK(hipFree(d_output));
HIP_CHECK(hipFree(d_selected_count_output));
HIP_CHECK(hipFree(d_temp_storage));
HIP_CHECK(hipDeviceSynchronize());
}

template<class T, class FlagType>
void run_flagged_if_benchmark(benchmark::State& state,
size_t size,
const hipStream_t stream,
float true_probability)
{
std::vector<T> input
= benchmark_utils::get_random_data<T>(size,
benchmark_utils::generate_limits<T>::min(),
benchmark_utils::generate_limits<T>::max());

std::vector<FlagType> flags
= benchmark_utils::get_random_data01<FlagType>(size, true_probability);

SelectOperator<T> select_flag_op(true_probability);

T* d_input;
FlagType* d_flags;
T* d_output;
unsigned int* d_selected_count_output;
HIP_CHECK(hipMalloc(&d_input, input.size() * sizeof(T)));
HIP_CHECK(hipMalloc(&d_flags, flags.size() * sizeof(FlagType)));
HIP_CHECK(hipMalloc(&d_output, input.size() * sizeof(T)));
HIP_CHECK(hipMalloc(&d_selected_count_output, sizeof(unsigned int)));
HIP_CHECK(hipMemcpy(d_input, input.data(), input.size() * sizeof(T), hipMemcpyHostToDevice));
HIP_CHECK(
hipMemcpy(d_flags, flags.data(), flags.size() * sizeof(FlagType), hipMemcpyHostToDevice));
HIP_CHECK(hipDeviceSynchronize());
// Allocate temporary storage memory
size_t temp_storage_size_bytes = 0;

// Get size of d_temp_storage
HIP_CHECK(hipcub::DeviceSelect::FlaggedIf(nullptr,
temp_storage_size_bytes,
d_input,
d_flags,
d_output,
d_selected_count_output,
input.size(),
select_flag_op,
stream));
HIP_CHECK(hipDeviceSynchronize());

// allocate temporary storage
void* d_temp_storage = nullptr;
HIP_CHECK(hipMalloc(&d_temp_storage, temp_storage_size_bytes));
HIP_CHECK(hipDeviceSynchronize());

// Warm-up
for(size_t i = 0; i < 10; i++)
{
HIP_CHECK(hipcub::DeviceSelect::FlaggedIf(d_temp_storage,
temp_storage_size_bytes,
d_input,
d_flags,
d_output,
d_selected_count_output,
input.size(),
select_flag_op,
stream));
}
HIP_CHECK(hipDeviceSynchronize());

const unsigned int batch_size = 10;
for(auto _ : state)
{
auto start = std::chrono::high_resolution_clock::now();
for(size_t i = 0; i < batch_size; i++)
{
HIP_CHECK(hipcub::DeviceSelect::FlaggedIf(d_temp_storage,
temp_storage_size_bytes,
d_input,
d_flags,
d_output,
d_selected_count_output,
input.size(),
select_flag_op,
stream));
}
HIP_CHECK(hipDeviceSynchronize());

auto end = std::chrono::high_resolution_clock::now();
auto elapsed_seconds
= std::chrono::duration_cast<std::chrono::duration<double>>(end - start);
state.SetIterationTime(elapsed_seconds.count());
}
state.SetBytesProcessed(state.iterations() * batch_size * size * sizeof(T));
state.SetItemsProcessed(state.iterations() * batch_size * size);

hipFree(d_input);
hipFree(d_flags);
hipFree(d_output);
hipFree(d_selected_count_output);
hipFree(d_temp_storage);
Expand Down Expand Up @@ -301,10 +398,10 @@ void run_unique_benchmark(benchmark::State& state,
state.SetBytesProcessed(state.iterations() * batch_size * size * sizeof(T));
state.SetItemsProcessed(state.iterations() * batch_size * size);

hipFree(d_input);
hipFree(d_output);
hipFree(d_selected_count_output);
hipFree(d_temp_storage);
HIP_CHECK(hipFree(d_input));
HIP_CHECK(hipFree(d_output));
HIP_CHECK(hipFree(d_selected_count_output));
HIP_CHECK(hipFree(d_temp_storage));
}

template<class KeyT, class ValueT>
Expand Down Expand Up @@ -414,12 +511,12 @@ void run_unique_by_key_benchmark(benchmark::State& state,
* (sizeof(KeyT) + sizeof(ValueT)));
state.SetItemsProcessed(state.iterations() * batch_size * size);

hipFree(d_keys_input);
hipFree(d_values_input);
hipFree(d_keys_output);
hipFree(d_values_output);
hipFree(d_selected_count_output);
hipFree(d_temp_storage);
HIP_CHECK(hipFree(d_keys_input));
HIP_CHECK(hipFree(d_values_input));
HIP_CHECK(hipFree(d_keys_output));
HIP_CHECK(hipFree(d_values_output));
HIP_CHECK(hipFree(d_selected_count_output));
HIP_CHECK(hipFree(d_temp_storage));
}

#define CREATE_SELECT_FLAGGED_BENCHMARK(T, F, p) \
Expand All @@ -442,6 +539,17 @@ void run_unique_by_key_benchmark(benchmark::State& state,
stream, \
p)

#define CREATE_SELECT_FLAGGED_IF_BENCHMARK(T, F, p) \
benchmark::RegisterBenchmark( \
std::string("device_select_flagged_if<data_type:" #T ",flag_type:" #F \
",output_data_type:" #T \
",selected_output_data_type:unsigned int>.(probability:" #p ")") \
.c_str(), \
&run_flagged_if_benchmark<T, F>, \
size, \
stream, \
p)

#define CREATE_UNIQUE_BENCHMARK(T, p) \
benchmark::RegisterBenchmark( \
std::string("device_select_unique<data_type:" #T ",output_data_type:" #T \
Expand Down Expand Up @@ -472,6 +580,12 @@ void run_unique_by_key_benchmark(benchmark::State& state,
CREATE_SELECT_IF_BENCHMARK(type, 0.05f), CREATE_SELECT_IF_BENCHMARK(type, 0.25f), \
CREATE_SELECT_IF_BENCHMARK(type, 0.5f), CREATE_SELECT_IF_BENCHMARK(type, 0.75f)

#define BENCHMARK_FLAGGED_IF_TYPE(type, value) \
CREATE_SELECT_FLAGGED_IF_BENCHMARK(type, value, 0.05f), \
CREATE_SELECT_FLAGGED_IF_BENCHMARK(type, value, 0.25f), \
CREATE_SELECT_FLAGGED_IF_BENCHMARK(type, value, 0.5f), \
CREATE_SELECT_FLAGGED_IF_BENCHMARK(type, value, 0.75f)

#define BENCHMARK_UNIQUE_TYPE(type) \
CREATE_UNIQUE_BENCHMARK(type, 0.05f), CREATE_UNIQUE_BENCHMARK(type, 0.25f), \
CREATE_UNIQUE_BENCHMARK(type, 0.5f), CREATE_UNIQUE_BENCHMARK(type, 0.75f)
Expand Down Expand Up @@ -523,6 +637,13 @@ int main(int argc, char* argv[])
BENCHMARK_IF_TYPE(int8_t),
BENCHMARK_IF_TYPE(custom_int_double),

BENCHMARK_FLAGGED_IF_TYPE(int, unsigned char),
BENCHMARK_FLAGGED_IF_TYPE(float, unsigned char),
BENCHMARK_FLAGGED_IF_TYPE(double, unsigned char),
BENCHMARK_FLAGGED_IF_TYPE(uint8_t, uint8_t),
BENCHMARK_FLAGGED_IF_TYPE(int8_t, int8_t),
BENCHMARK_FLAGGED_IF_TYPE(custom_double2, unsigned char),

BENCHMARK_UNIQUE_TYPE(int),
BENCHMARK_UNIQUE_TYPE(float),
BENCHMARK_UNIQUE_TYPE(double),
Expand Down
Loading
Loading