1
1
# Changelog
2
2
3
+ ## Thrust 2.1.0
4
+
5
+ ### New Features
6
+
7
+ - NVIDIA/thrust #1805 : Add default constructors to ` transform_output_iterator `
8
+ and ` transform_input_output_iterator ` . Thanks to Mark Harris (@harrism ) for this contribution.
9
+ - NVIDIA/thrust #1836 : Enable constructions of vectors from ` std::initializer_list ` .
10
+
11
+ ### Bug Fixes
12
+
13
+ - NVIDIA/thrust #1768 : Fix type conversion warning in the ` thrust::complex ` utilities. Thanks to
14
+ Zishi Wu (@zishiwu123 ) for this contribution.
15
+ - NVIDIA/thrust #1809 : Fix some warnings about usage of ` __host__ ` functions in ` __device__ ` code.
16
+ - NVIDIA/thrust #1825 : Fix Thrust's CMake install rules. Thanks to Robert Maynard (@robertmaynard )
17
+ for this contribution.
18
+ - NVIDIA/thrust #1827 : Fix ` thrust::reduce_by_key ` when using non-default-initializable iterators.
19
+ - NVIDIA/thrust #1832 : Fix bug in device-side CDP ` thrust::reduce ` when using a large number of
20
+ inputs.
21
+
22
+ ### Other Enhancements
23
+
24
+ - NVIDIA/thrust #1815 : Update Thrust's libcu++ git submodule to version 1.8.1.
25
+ - NVIDIA/thrust #1841 : Fix invalid code in execution policy documentation example. Thanks to Raphaël
26
+ Frantz (@Eren121 ) for this contribution.
27
+ - NVIDIA/thrust #1848 : Improve error messages when attempting to launch a kernel on a device that is
28
+ not supported by compiled PTX versions. Thanks to Zahra Khatami (@zkhatami ) for this contribution.
29
+ - NVIDIA/thrust #1855 : Remove usage of deprecated CUDA error codes.
30
+
31
+ ## Thrust 2.0.1
32
+
33
+ ### Other Enhancements
34
+
35
+ - Disable CDP parallelization of device-side invocations of Thrust algorithms on SM90+. The removal
36
+ of device-side synchronization support in recent architectures makes Thrust's fork-join model
37
+ unimplementable on device, so a serial implementation will be used instead. Host-side invocations
38
+ of Thrust algorithms are not affected.
39
+
3
40
## Thrust 2.0.0
4
41
5
42
### Summary
@@ -26,7 +63,7 @@ several minor bugfixes and cleanups.
26
63
- ` THRUST_INCLUDE_HOST_CODE ` : Replace with ` NV_IF_TARGET ` .
27
64
- ` THRUST_INCLUDE_DEVICE_CODE ` : Replace with ` NV_IF_TARGET ` .
28
65
- ` THRUST_DEVICE_CODE ` : Replace with ` NV_IF_TARGET ` .
29
- - NVIDIA/thrust #1661 : Thrust’ s CUDA Runtime support macros have been updated to
66
+ - NVIDIA/thrust #1661 : Thrust' s CUDA Runtime support macros have been updated to
30
67
support ` NV_IF_TARGET ` . They are now defined consistently across all
31
68
host/device compilation passes. This should not affect most usages of these
32
69
macros, but may require changes for some edge cases.
@@ -59,7 +96,7 @@ several minor bugfixes and cleanups.
59
96
- CMake builds that use the Thrust packages via CPM, ` add_subdirectory ` ,
60
97
or ` find_package ` are not affected.
61
98
- NVIDIA/thrust #1760 : A compile-time error is now emitted when a ` __device__ `
62
- -only lambda’ s return type is queried from host code (requires libcu++ ≥
99
+ -only lambda' s return type is queried from host code (requires libcu++ ≥
63
100
1.9.0).
64
101
- Due to limitations in the CUDA programming model, the result of this query
65
102
is unreliable, and will silently return an incorrect result. This leads to
@@ -83,7 +120,7 @@ several minor bugfixes and cleanups.
83
120
to ` thrust::make_zip_function ` . Thanks to @mfbalin for this contribution.
84
121
- NVIDIA/thrust #1722 : Remove CUDA-specific error handler from code that may be
85
122
executed on non-CUDA backends. Thanks to @dkolsen-pgi for this contribution.
86
- - NVIDIA/thrust #1756 : Fix ` copy_if ` for output iterators that don’ t support copy
123
+ - NVIDIA/thrust #1756 : Fix ` copy_if ` for output iterators that don' t support copy
87
124
assignment. Thanks for @mfbalin for this contribution.
88
125
89
126
### Other Enhancements
@@ -157,7 +194,7 @@ numerous bugfixes and stability improvements.
157
194
158
195
#### New ` thrust::cuda::par_nosync ` Execution Policy
159
196
160
- Most of Thrust’ s parallel algorithms are fully synchronous and will block the
197
+ Most of Thrust' s parallel algorithms are fully synchronous and will block the
161
198
calling CPU thread until all work is completed. This design avoids many pitfalls
162
199
associated with asynchronous GPU programming, resulting in simpler and
163
200
less-error prone usage for new CUDA developers. Unfortunately, this improvement
@@ -222,12 +259,12 @@ on the calling GPU thread instead of launching a device-wide kernel.
222
259
223
260
### Enhancements
224
261
225
- - NVIDIA/thrust#1511: Use CUB’ s new `DeviceMergeSort` API and remove Thrust’ s
262
+ - NVIDIA/thrust#1511: Use CUB' s new `DeviceMergeSort` API and remove Thrust' s
226
263
internal implementation.
227
264
- NVIDIA/thrust#1566: Improved performance of `thrust::shuffle`. Thanks to
228
265
@djns99 for this contribution.
229
266
- NVIDIA/thrust#1584: Support user-defined `CMAKE_INSTALL_INCLUDEDIR` values in
230
- Thrust’ s CMake install rules. Thanks to @robertmaynard for this contribution.
267
+ Thrust' s CMake install rules. Thanks to @robertmaynard for this contribution.
231
268
232
269
### Bug Fixes
233
270
@@ -239,7 +276,7 @@ on the calling GPU thread instead of launching a device-wide kernel.
239
276
- NVIDIA/thrust#1597: Fix some collisions with the `small` macro defined
240
277
in `windows.h`.
241
278
- NVIDIA/thrust#1599, NVIDIA/thrust#1603: Fix some issues with version handling
242
- in Thrust’ s CMake packages.
279
+ in Thrust' s CMake packages.
243
280
- NVIDIA/thrust#1614: Clarify that scan algorithm results are non-deterministic
244
281
for pseudo-associative operators (e.g. floating-point addition).
245
282
@@ -752,7 +789,7 @@ Starting with the upcoming 1.10.0 release, C++03 support will be dropped
752
789
passing a size.
753
790
This was necessary to enable usage of Thrust caching MR allocators with
754
791
synchronous Thrust algorithms.
755
- This change has allowed NVC++’ s C++17 Parallel Algorithms implementation to
792
+ This change has allowed NVC++' s C++17 Parallel Algorithms implementation to
756
793
switch to use Thrust caching MR allocators for device temporary storage,
757
794
which gives a 2x speedup on large multi-GPU systems such as V100 and A100
758
795
DGX where `cudaMalloc` is very slow.
0 commit comments