This is release includes major overhauls to many of Celerity's core internals, improving performance, debuggability as well as laying the groundwork for future optimizations.
HIGHLIGHTS
- Celerity now supports SimSYCL, a SYCL implementation focused on debugging and verification (#238).
- Multiple devices can now be managed by a single Celerity process, which allows for more efficient device-to-device communication (#265).
- The Celerity runtime can now be configured to log detailed tracing events for the Tracy hybrid profiler (#267).
- Reductions are now supported across all SYCL implementations (#265).
- The new
experimental::hints::oversubscribe
hint can be used to improve computation-communication overlapping (#249). - API documentation is now available, generated by 🥬doc.
Changelog
This release includes changes that may require adjustments when upgrading:
- A single Celerity process can now manage multiple devices.
This means that on a cluster with 4 GPUs per node, only a single MPI rank needs to be spawned per node. - The previous behavior of having a separate process per device is still supported but discouraged, as it incurs additional overhead.
- It is no longer possible to assign a device to a Celerity process using the
CELERITY_DEVICES
environment variable.
Please use vendor-specific mechanisms (such asCUDA_VISIBLE_DEVICES
) for limiting the set of visible devices instead. - We recommend performing a clean build when updating Celerity so that updated submodule dependencies are properly propagated.
We recommend using the following SYCL versions with this release:
- DPC++: 89327e0a or newer
- AdaptiveCpp (formerly hipSYCL): v24.06
- SimSYCL: master
See our platform support guide for a complete list of all officially supported configurations.
Added
- Add support for SimSYCL as a SYCL implementation (#238)
- Extend compiler support to GCC (optionally with sanitizers) and C++20 code bases (#238)
celerity::hints::oversubscribe
can be passed to a command group to increase split granularity and improve computation-communication overlap (#249)- Reductions are now unconditionally supported on all SYCL implementations (#265)
- Add support for profiling with Tracy, via
CELERITY_TRACY_SUPPORT
and environment variableCELERITY_TRACY
(#267) - The active SYCL implementation can now be queried via
CELERITY_SYCL_IS_*
macros (#277)
Changed
- All low-level host / device operations such as memory allocations, copies, and kernel launches are now represented in the single Instruction Graph for improved asynchronicity (#249)
- Celerity can now maintain multiple disjoint backing allocations per buffer, so disjoint accesses to the same buffer do not trigger bounding-box allocations (#249)
- The previous implicit size limit of 128 GiB on buffer transfers is lifted (#249, #252)
- Celerity now manages multiple devices per node / MPI rank. This significantly reduces overhead in multi-GPU setups (#265)
- Runtime lifetime is extended until destruction of the last queue, buffer, or host object (#265)
- Host object instances are now destroyed from a runtime background thread instead of the application thread (#265)
- Collective host tasks in the same collective group continue to execute on the same communicator, but not necessarily on the same background thread anymore (#265)
- Updated the internal libenvpp dependency to 1.4.1 and use its new features (#271)
- Celerity's compile-time feature flags and options are now written to
version.h
instead of being passed on the command line (#277)
Fixed
- Scheduler tracking structures are now garbage-collected after buffers and host objects go out of scope (#246)
- The previous requirement to order accessors by access mode is lifted (#265)
- SYCL reductions to which only some Celerity nodes contribute partial results would read uninitialized data (#265)
Removed
- Celerity does not attempt to spill device allocations to the host if resizing buffers fails due to an out-of-memory condition (#265)
- The
CELERITY_DEVICES
environment variable is removed in favor of platform-specific visibility specifiers such asCUDA_VISIBLE_DEVICES
(#265) - The obsolete
experimental::user_benchmarker
infrastructure has been removed (#268).