-
Notifications
You must be signed in to change notification settings - Fork 148
/
Copy pathCHANGELOG
118 lines (88 loc) · 4.57 KB
/
CHANGELOG
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
CHANGELOG
===============================================================
v2.4.0 - 2023-10-06
---------------------------------------------------------------
Changes:
- Increased maximum CUDA version to 12.2, and now supporting HPC SDK 23.7
- Fixed issue preventing parameters from being updated after config initialisation
- Restructured all source files and removed plugin feature
- Replaced custom memory pool with cudaMallocAsync when defining USE_CUDAMALLOCASYNC
- Changed the cuSPARSE SpMV algorithm choice to CUSPARSE_CSRMV_ALG1, which should improve solve performance for recent versions of cuSPARSE
- Added single-kernel csrmv that is invoked when total number of rows in the local matrix falls below 3 times the number of SMs on the target GPUs
- Changes to thrust
-- Increased thrust version to 2.1.0
-- Added specific tested version of thrust as a submodule, please use git clone --recursive to pull AmgX from v2.4.0 onwards
-- Wrapped thrust in namespace to avoid shared library sharing issues referenced here https://github.com/NVIDIA/thrust/releases/tag/1.14.0
-- Removed many superfluous points of synchronisation introduced by thrust
- Improved performance of writing matrices to file
- Improved Clang compatibility
- Added a divergence check, providing new config parameter rel_div_tolerance
- Added compile-time definition to avoid exception handling, in order to improve experience when debugging (DISABLE_EXCEPTION_HANDLING)
- Fixed multiple synchronisation issues that can show up on newer GPU architectures (sm_70+)
- Fixed partition reordering for block_sizes > 1
- Fixed build issue that arose when AmgX is built as a subproject
- Fixed issue with OpenMP and NO_MPI linking
- Replaced some inline asm with intrinsics
- Fixed issue with exact_coarse_solve grid sizing
- Fixed issue with use_sum_stopping_criteria
- Fixed SIGFPE that could occur when the initial norm is 0
- Added a new API call AMGX_matrix_check_symmetry, that tests if a matrix is structurally or completely symmetric
Tested configurations:
Linux x86-64:
-- Ubuntu 20.04, Ubuntu 22.04
-- NVHPC 23.7, GCC 9.4.0, GCC 12.1
-- OpenMPI 4.0.x
-- CUDA 11.2, 11.8, 12.2
-- A100, H100
Note that while AMGX has support for building in Windows, testing on Windows is very limited.
===============================================================
v2.3.0 - 2022-06-30
---------------------------------------------------------------
Changes:
- Increased minimum CMake version to 3.18 and adapted to use CUDA as a language, making it possible to compile with HPC SDK
- Improved performance of compute_values_kernel by ~1.3x
- Optimised block tuning for aggressive coarsening
- Added an exact coarse solve, accessible via the default scope flag "exact_coarse_solve"
- Fixed issue where latency hiding could be enabled/disabled asymmetrically across available ranks
- Fixed bug with SpGEMM fallback that deleted cuSPARSE handle incorrectly
- Fixed bug with use of shared memory in estimate_c_hat_kernel
Tested configurations:
- Linux x86-64:
-- Ubuntu 20.04, Ubuntu 18.04
-- gcc 7.4.0, gcc 9.3.0
-- OpenMPI 4.0.x
-- CUDA 11.0, 11.2
- Windows 10 x86-64:
-- MS Visual Studio 2019 (msvc 19.28)
-- MS MPI v10.1.2
-- CUDA 11.0
Note that while AMGX has support for building in Windows, testing on Windows is very limited.
===============================================================
v2.2.0 - 2021-04-06
---------------------------------------------------------------
- Fixing GPU Direct support (now correct results and better perf)
- Fixing latency hiding (general perf and couple bugfixes in some specific cases)
- Tunings for Volta for agg and classical setup phase
- Gauss-siedel perf improvements on Volta+
- Ampere support
- Minor bugfixes and enhancements including reported/requested by community
Tested configurations:
- Linux x86-64:
-- Ubuntu 20.04, Ubuntu 18.04
-- gcc 7.4.0, gcc 9.3.0
-- OpenMPI 4.0.x
-- CUDA 10.2, 11.0, 11.2
- Windows 10 x86-64:
-- MS Visual Studio 2019 (msvc 19.28)
-- MS MPI v10.1.2
-- CUDA 10.2, 11.0
===============================================================
v2.1.0 - 2020-03-20
---------------------------------------------------------------
- Added new API that allows user to provide distributed matrix partitioning information in a new way - offset to the partition's first row in a matrix. Works only if partitions own continuous rows in matrix.
- Added example case for this new API (see examples/amgx_mpi_capi_cla.c)
- Distributed code improvements
===============================================================
v2.0.0 - 2017.10.17
---------------------------------------------------------------
Initial open source release