Skip to content

Commit c2400a7

Browse files
committed
NEWS: Add v1.18.0 description to main branch
1 parent ad1fb2d commit c2400a7

File tree

1 file changed

+151
-0
lines changed

1 file changed

+151
-0
lines changed

NEWS

+151
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,157 @@
1111
### Features:
1212
### Bugfixes:
1313

14+
## 1.18.0 (January 17, 2025)
15+
### Features:
16+
#### UCP
17+
* Enabled using CUDA staging buffers for pipeline protocols by default
18+
* Added endpoint reconfiguration support for non-reused p2p scenarios
19+
* Enabled non-cacheable memory domains, activated for gdr_copy
20+
* Added user_data parameter to ucp_ep_query
21+
* Added support for host memory pipeline through CUDA buffers for rendezvous protocol
22+
* Added global VA infrastructure and memory region in absence of error handling
23+
* Made protocol performance node names more informative
24+
* Enforced always running on the same thread in single thread mode
25+
* Multiple improvements in protocols selection infrastructure
26+
* Added UCP_MEM_MAP_LOCK API flag to enforce locked memory mapping
27+
* Allowed up-to 64 endpoint lanes for systems with many transports or devices
28+
* Added usage tracker to worker
29+
* Improved various logging messages
30+
#### RDMA CORE (IB, ROCE, etc.)
31+
* Added environment variable to manage DC initiator capacity
32+
* Added DC dcs_hybrid policy
33+
* Reduced MLX5/DV stack size consumption
34+
* Added ODP support for verbs and mlx5dv
35+
* Added support of CUDA managed memory on IB when ODP is available
36+
* Added support of Adaptive Routing on RoCE
37+
* Enabled use of implicit ODP with relaxed ordering
38+
* Improved GPU-Direct detection in IB transport
39+
* Increased DC initiator default count to 32 for performance optimization
40+
* Added ConnectX-8 device support with DDP
41+
* Added support for subnet filter list for RoCE interfaces
42+
* Enhanced the error message to provide more details when a connection cannot be established due to unreachable transports
43+
* Added IB MLX5 as a separate UCX module with separate RPM sub-package
44+
* Added initial support for GGA transport, for fast DPU memory access
45+
* Set IB DevX atomic mode based on device capabilities
46+
* Removed DC keepalive mechanism, since the keepalive is done on UCP layer
47+
* Optimized cross-gVMI memory registration using indirect memory keys cache
48+
* Improved various logging messages
49+
#### CUDA
50+
* Added multi-node NVlink support
51+
* Added CUDA Fabric memory support with detection and allocation
52+
* Improved gdr_copy latency estimations on AMD Milan systems
53+
* Added check for gdr_copy runtime/build version mismatch
54+
* Added handling missing IPC capability when unpacking keys
55+
* Added caching for CUDA IPC memory pool import operation
56+
* Added gdr_copy variables to optimize performance on Grace Hopper systems
57+
* Improved CUDA IPC concurrency for a larger count of reachable peers
58+
#### UCS
59+
* Added support for wildcards in configuration parameter names
60+
* Added ASAN protection to several internal data structures
61+
* Reduced stack usage in topology detection code
62+
* Improved bitmaps configuration parsing with wider bitfield
63+
* Added options to set topology distance between devices
64+
* Optimized VFS unix socket watch by using user private folder
65+
* Added general IP subnet matching infrastructure
66+
* Extend array data structure to support user-provided array copy routine
67+
* Improved time units description
68+
#### UCM
69+
* Extend CUDA memory hooks to include memory mapping APIs
70+
#### Tools
71+
* Improved performance by increasing window size for put_bw and add get_bw in ucx_perftest
72+
* Added multi-send flag for receive operations in bandwidth benchmarks in ucx_perftest
73+
* Improved ucx_perftest uni-directional test with added fence
74+
* Detailed ucx_perftest batch section of command-line documentation
75+
#### Documentation
76+
* Added a section regarding adaptive routing on RoCE
77+
#### Architecture
78+
* Added CPU Model for MI300A
79+
* Added Fujitsu ARM specific values to ucx.conf
80+
* Added AMD Turin support
81+
* Added an optimized non-temporal memory copy implementation for AMD CPU
82+
#### Build
83+
* Improved compiler error reporting with added flag
84+
* Improved coverity script to allow faster turnaround time
85+
* Improved Intel Compiler detection and support
86+
#### GO
87+
* Added multi-send flag and user memh support in request params
88+
#### Packaging
89+
* Improved dpkg-buildpackage sample command by explicitly adding mlx5 related arguments
90+
### Bugfixes:
91+
#### UCP
92+
* Fixed stack overflow in exported rkey unpack
93+
* Removed extra remote-cpu overhead from protocol estimation for zcopy
94+
* Fixed performance estimation for rndv pipeline protocols
95+
* Fixed ATP sending by picking the correct lane
96+
* Fixed missing reg_id on memh creation
97+
* Fixed repeated invalidations by retaining existing access flags
98+
* Fixed abort reason propagation for rendezvous RTR mtype
99+
* Do not check transport availability if it is disabled by UCX_TLS environment variable
100+
* Fixed wrong flag being used for checking BCOPY capability
101+
* Fixed sending too many ATPs for small messages
102+
* Enforced 16 bits size for Active Messages identifiers
103+
* Fixed unnecessary status check for emulated AMO
104+
* Fixed more than one fragment sending in rendezvous pipeline
105+
* Fixed crash by using biggest max frag across all lanes
106+
* Fixed missing memory handle flags by copying from parent to child
107+
* Fixed worker interface activate count
108+
* Fixed flush requests by replacing ATP/flush lane map with lane indexes
109+
* Fixed lost uct_flags when merging memory regions
110+
#### UCT
111+
* Fixed memory domain UCT flags description
112+
#### RDMA CORE (IB, ROCE, etc.)
113+
* Fixed FETCH_ADD remote access error for ODP/KSM case
114+
* Fixed missing conditional compilation checks for DM
115+
* Fixed IB MD allocation naming typo
116+
* Fixed invalid GIDs filter in IB
117+
* Fixed flags usage in MLX5 zcopy_post
118+
* Do not limit ODP registration retries
119+
* Fixed JUCX failures by considering the number of supported completion vectors
120+
#### CUDA
121+
* Fixed async memory handling using CUDA memory type on Grace
122+
* Added rcache overhead in performance estimation
123+
* Fixed gdr_copy performance regression by providing maximum estimation between get and put
124+
* Fixed CUDA IPC reachability check
125+
* Fixed crash in MPI_Finalize when CUDA context is destroyed
126+
* Always require rcache by default for gdr_copy
127+
* Fixed crash in gdr_copy cleanup when registration cache is disabled
128+
* Fixed CUDA copy memory domain allocations
129+
* Fixed multiple tests for gdr_copy transport
130+
* Fixed race condition in CUDA IPC peer accessible cache
131+
#### UCS
132+
* Fixed a crash by using heap allocation to process expired timers in batch
133+
* Fixed allocation issue on memtrack dump
134+
* Fixed deletion of the monitored folder in VFS
135+
* Fixed unsafe resize for DC initiator array
136+
* Fixed function macro invocation to match C standard
137+
* Fixed calling async handler on already released resource
138+
* Fixed performance by setting higher bandwidth for different NUMA nodes on Grace
139+
* Fixed undeclared value error in timer conversion routine
140+
* Fixed uninitialized value access in registration cache
141+
#### UCM
142+
* Fixed race condition in parsing proc maps
143+
* Fixed mremap failure while parsing /proc/self/maps
144+
#### ROCM
145+
* Fixed ROCM interface reachability test
146+
* Fixed memory domain fork test
147+
#### TCP
148+
* Always bind endpoint to interface
149+
#### Tools
150+
* Fixed buffer size potential overflow in ucx_perftest
151+
* Fixed missing address when packing memory keys on ucx_perftest
152+
* Fixed memory leak for endpoint report in ucx_info
153+
* Fixed build without openmp in ucx_perftest
154+
* Fixed UCT device override on server side on ucx_perftest
155+
#### Build
156+
* Fixed using correct ASAN version for running tests
157+
#### Configuration
158+
* Used POSIX bourne syntax to check equality
159+
* Fixed build failure by using proper flags in compiler.m4
160+
* Fixed perftest MAD support default guessing
161+
#### GO
162+
* Added serialized thread mode to avoid subtle races between threads
163+
* Fixed make distcheck
164+
14165
## 1.17.0 (June 13, 2024)
15166
### Features:
16167
#### UCP

0 commit comments

Comments
 (0)