This Celerity release introduces multiple improvements to runtime performance, developer experience, and compatibility. This version requires C++20, and upgrading may also require minor adjustments in buffer access handling and usage of deprecated features.
HIGHLIGHTS
- Celerity can now be built without MPI for single-node, multi-device setups.
A single process can manage multiple devices without spawning extra MPI ranks. - Tracy integration has been improved, providing clearer warnings for uninitialized reads and better executor starvation reporting.
- Substantial performance optimizations, including per-device submission threads, thread pinning, and reduced MPI transfer overhead.
celerity::distr_queuehas been replaced bycelerity::queue.
Multiple instances ofcelerity::queueare now supported, with behavior more closely aligned with SYCL.- Runtime shutdown can now be explicitly controlled via
celerity::shutdown().
This complementscelerity::init()for finer control over the runtime lifecycle. - Celerity now uses and requires C++20
Changelog
This release includes changes that may require adjustments when upgrading:
- Celerity now requires C++20
celerity::distr_queuehas been replaced bycelerity::queue.
Multiple instances ofcelerity::queueare now supported, with behavior more closely aligned with SYCL.- Buffer access handling has been refactored: celerity::access_mode is now a dedicated enum.
Usingsycl::access_modeon Celerity buffers is no longer supported. - Coordinate-list constructors of
access::neighborhoodhave been deprecated in favor of therangeoverload. - We recommend performing a clean build when updating Celerity to ensure all updated submodule dependencies are properly propagated.
We recommend using the following SYCL versions with this release:
- DPC++: ad494e9d or newer
- AdaptiveCpp (formerly hipSYCL): v24.06
- SimSYCL: master
See our platform support guide for a complete list of all officially supported configurations.
Added
- Support builds for single-node multi-device setups without MPI by specifying
-DCELERITY_ENABLE_MPI=0in CMake (#282) - Add
celerity::oncetag type for host tasks (equivalent torange<0>{}) as a replacement foron_master_node(#282) - Replace
celerity::distr_queuewithcelerity::queue, which permits multiple instances and aligns closer with SYCL (#283) - The runtime can be explicitly shut down using
celerity::shutdown(), complementingcelerity::init()(#283) handler::parallel_for(size_t, [size_t,] ...)now acts as a shorthand forparallel_for(range<1>, [id<1>,] ...)(#288)- Experimental support for the AdaptiveCpp generic single-pass compiler (#294)
- Constructor overloads to the
access::neighborhoodrange mapper for reads in 3/5/7-point stencil codes (#292) - The SYCL backend now uses per-device submission threads to dispatch commands for better performance.
This new behaviour is enabled by default, and can be disabled viaCELERITY_BACKEND_DEVICE_SUBMISSION_THREADS(#303) - Celerity now has a thread pinning mechanism to control how threads are pinned to CPU cores.
This can be controlled via theCELERITY_THREAD_PINNINGenvironment variable (#309)
Changed
- Update Tracy dependency to v0.11.1 (#281)
- Update libenvpp dependency to 1.5 (#312)
- Update fmt dependency to 11.1.2 (#328)
- Update spdlog dependency to HEAD > 1.15.0 (#328)
- Celerity now requires C++20 (#291)
- Automatic runtime shutdown, which was previously triggered by the last queue / buffer / host object going out of scope,
is now postponed until process termination (atexit()). This allows multiple non-overlapping sections of Celerity code
to execute in the same process (#283) - Celerity warns on excessive calls to
queue::wait()ordistr_queue::slow_full_sync()in a long running program.
This operation has a much more pronounced performance penalty than its SYCL counterpart (#283) - On systems that do not support device-to-device copies, data is now staged in linearized buffers for better performance (#287)
- Removed the flush_async workaround for newer ACPP versions, keeping compatibility with older versions (#333)
- The
access::neighborhoodbuilt-in range mapper now receives arangeinstead of a coordinate list (#292) - Overhauled the installation and configuration documentation (#309)
- Celerity will now queue up several command groups in order to combine allocations and elide resize operations.
This behavior can be influenced using the newexperimental::set_lookaheadandexperimental::flushAPIs (#298) - Reduced small host-buffer allocations in MPI transfers by accumulating touched boxes during
anticipate()(#313) - Celerity internals are no longer exposed to users through installed headers (#308)
- Buffer
access_modeis now a dedicatedcelerity::access_modeenum instead of an alias ofsycl::access_mode, simplifying
the include tree and removing namespace ambiguity.sycl::access_modecan no longer be used with Celerity buffers. (#315) - Uninitialized read warnings now provide more helpful information (#321)
- Improved Tracy integration for executor starvation. Celerity now also prints a warning when execution time exceeds a
given percentage threshold, indicating that the application might be scheduler-bound (#322)
Fixed
- Host-initialized buffers will not read from user-provided memory after the last reference to the buffer has been dropped (#283)
- Fix a build issue on macOS where moving a std::function did not clear the source, causing failing test cases (#285)
- Fix a path hint for finding AdaptiveCpp when using an installed Celerity (#286)
- Fix a race condition in unit tests by updating last_epoch_reached before signalling the epoch promise, ensuring proper synchronization (#307)
- Fix a build issue with (rare) configurations which enable both Tracy and OOB-checks (#331)
Deprecated
celerity::distr_queueis deprecated in favor ofcelerity::queue(#283)- The coordinate-list constructors of
access::neighborhoodare deprecated in favor of therangeoverload (#292)
Internal
- Command graphs generate a single "fat" push command instead of a septate push for each write and target node. (#290)
- Event polling now only happens for instructions that are actively executing (#293)
- Task management now uses epoch-based structures, removes the ring buffer size limit, and handles tasks via
stable pointers, simplifying scheduler and application thread interactions (#295) - Command graph now uses
commandinstead ofabstract_command, moves CDAG-related pruning to the scheduler,
and maintains command pointers in the CDAG generator (#297) buffer_access_mapnow works in terms of consumed and produced regions instead of access modes.
This includes various related improvements to task requirements, execution ranges, and graph printing (#300)- Use
region_map::update_boxinstead ofupdate_regionwhere applicable (#302) - Improved "system" benchmarks to better capture effects that are highly significant in real-world workloads (#304)
- Unified thread code, with a single source of truth for thread names and Tracy thread ordering (#310)
- Optimize
perform_task_buffer_accessesto skip redundant last-writers updates and transpose loops,
yielding minor performance improvements in scheduler-bound workloads (#317) - The SimSYCL workaround for thread safety has been removed (#318)
- Prevent unbounded growth in
receive_arbiterby caching active transfers (#319) - Centralize definition of Tracy colors (#320)
- Change split functions to work on box instead of chunk (#323)
- Align await-pushes with pushes by computing the union of regions for remote chunks executed on the same node (#324)
- Celerity now uses
SYCL_IS_*macros instead ofdefined(__SYCL_COMPILER_VERSION)for checking the SYCL version (#329) - Removed internal branches on
CELERITY_FEATURE_UNNAMED_KERNELS, which now only exists for backwards compatibility in
applications (#329)