-
Notifications
You must be signed in to change notification settings - Fork 282
[DOC] Add temp_storage_bytes usage guide #6208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,6 +11,47 @@ Device-Wide Primitives | |
../api/device | ||
|
||
|
||
Determining Temporary Storage Requirements | ||
++++++++++++++++++++++++++++++++++++++++++++++++++ | ||
|
||
**Two-Phase API** (Traditional) | ||
|
||
Most CUB device-wide algorithms follow a two-phase usage pattern: | ||
|
||
1. **Query Phase**: Call the algorithm with ``d_temp_storage = nullptr`` to determine required temporary storage size | ||
2. **Execution Phase**: Allocate storage and call the algorithm again to perform the actual operation | ||
|
||
**What arguments are needed during the query phase?** | ||
|
||
* **Required**: Data types (via template parameters and iterator types) and problem size (``num_items``) | ||
* **Can be nullptr/uninitialized**: All input/output pointers (``d_in``, ``d_out``, etc.) | ||
* **Note**: The algorithm does not access input data during the query phase | ||
|
||
Example pattern: | ||
|
||
.. code-block:: c++ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we stopped using |
||
|
||
// Determine temporary storage requirements | ||
void* d_temp_storage = nullptr; | ||
size_t temp_storage_bytes = 0; | ||
|
||
cub::DeviceReduce::Sum( | ||
d_temp_storage, temp_storage_bytes, | ||
nullptr, nullptr, num_items); // Input/output pointers can be null | ||
|
||
// Allocate temporary storage | ||
cudaMalloc(&d_temp_storage, temp_storage_bytes); | ||
|
||
// Run the actual algorithm with real pointers | ||
cub::DeviceReduce::Sum( | ||
d_temp_storage, temp_storage_bytes, | ||
d_in, d_out, num_items); | ||
|
||
**Single-Phase API** (Environment-Based) | ||
|
||
Some algorithms provide environment-based overloads that eliminate the two-phase call pattern. | ||
These APIs accept an execution environment parameter. See the individual algorithm documentation for availability. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There are more to be said but are not feature complete so we can avoid them for now. |
||
|
||
CUB device-level single-problem parallel algorithms: | ||
|
||
* :cpp:struct:`cub::DeviceAdjacentDifference` computes the difference between adjacent elements residing within device-accessible memory | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to also mention the single-phase API