Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions docs/cub/api_docs/device_wide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,47 @@ Device-Wide Primitives
../api/device


Determining Temporary Storage Requirements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to also mention the single-phase API

++++++++++++++++++++++++++++++++++++++++++++++++++

**Two-Phase API** (Traditional)

Most CUB device-wide algorithms follow a two-phase usage pattern:

1. **Query Phase**: Call the algorithm with ``d_temp_storage = nullptr`` to determine required temporary storage size
2. **Execution Phase**: Allocate storage and call the algorithm again to perform the actual operation

**What arguments are needed during the query phase?**

* **Required**: Data types (via template parameters and iterator types) and problem size (``num_items``)
* **Can be nullptr/uninitialized**: All input/output pointers (``d_in``, ``d_out``, etc.)
* **Note**: The algorithm does not access input data during the query phase

Example pattern:

.. code-block:: c++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we stopped using code-block and rather use literalincludes now linked to standalone .cu source examples.


// Determine temporary storage requirements
void* d_temp_storage = nullptr;
size_t temp_storage_bytes = 0;

cub::DeviceReduce::Sum(
d_temp_storage, temp_storage_bytes,
nullptr, nullptr, num_items); // Input/output pointers can be null

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run the actual algorithm with real pointers
cub::DeviceReduce::Sum(
d_temp_storage, temp_storage_bytes,
d_in, d_out, num_items);

**Single-Phase API** (Environment-Based)

Some algorithms provide environment-based overloads that eliminate the two-phase call pattern.
These APIs accept an execution environment parameter. See the individual algorithm documentation for availability.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some algorithms: we are in the process of adding these APIs for all algorithms. Let's not make it sound like we selectively picked some algorithms to support this feature.

eliminate the two-phase call pattern: they eliminate the explicit allocation requirements which relives us from a) two phase call b) the two legacy storage arguments in the beginning of the parameter list of every device premitive.

These APIs accept an execution environment parameter.: I feel like there is much more to be said here. There is a blogpost coming out soon that can be used as inspo. Trying to sum it up briefly:

  1. the execution environment arg is at the end and is defaulted which means users do not have to pass something explicitly to it.

  2. the execution environment arg can be used to select on which memory pool the primitive will be executed:

    i) there are existing memory resources provided by CCCL or

    ii) the user can create their own custom memory resource

  3. in some algorithms the argument can be used to specify the deterministic requirements (expand on det reqs) of the algorithm

  4. the user can still pass the stream as before in the templated execution environment arg

  5. all of the above can be passed simultaneously in a centralized control argument manner (show how).

There are more to be said but are not feature complete so we can avoid them for now.


CUB device-level single-problem parallel algorithms:

* :cpp:struct:`cub::DeviceAdjacentDifference` computes the difference between adjacent elements residing within device-accessible memory
Expand Down