Add CUDA streams to cudf-polars #20291

TomAugspurger · 2025-10-16T21:41:46Z

Description

This adds CUDA streams to all pylibcudf calls in cudf-polars.

At the moment, we continue to use the default stream for all operations, so we're explicitly using the default stream. A future PR will update things to use non-default streams.

As far as I can tell, this should get all the pylibcudf calls in cudf-polars. It's a lot of code to review. Unfortunately, it mixes many trivial changes (add stream=stream to a bunch of spots) with a handful of non-trivial changes. I'll comment inline on all the non-trivial changes. I'm more than happy to break those changes out to their own PR (but it gets complicated. The changes to Column.nan_count, for example, forces the change to broadcast and aggregation.py...)

Closes #20239

Part of #20228

copy-pr-bot · 2025-10-16T21:41:49Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

TomAugspurger

I haven't finished flagging the non-trivial changes, but I have to run. I'll finish that up later.

TomAugspurger · 2025-10-16T21:43:41Z

python/cudf_polars/cudf_polars/containers/column.py


-    @functools.cached_property
-    def obj_scalar(self) -> plc.Scalar:
+    def obj_scalar(self, stream: Stream) -> plc.Scalar:


Previously, Column.obj_scalar used @functools.cached_property. Now we need it to be a method so that we can pass in a stream for the plc.copying.get_element.

To retain the caching behavior, we store the result on the instance at self._obj_scalar.

Similar story for Column.nan_count

python/cudf_polars/cudf_polars/containers/column.py

TomAugspurger · 2025-10-16T21:47:41Z

python/cudf_polars/cudf_polars/dsl/expressions/aggregation.py

        # preprocessed into pylibcudf requests.
        child = self.children[0]
-        return self.op(child.evaluate(df, context=context))
+        return self.op(child.evaluate(df, context=context), stream=df.stream)


Note: all our reduction functions (self.op) now accept a stream. This is where we pass it in. The column is valid on df.stream since it's from child.evaluate(df).

TomAugspurger · 2025-10-16T21:50:31Z

python/cudf_polars/cudf_polars/dsl/expressions/boolean.py

-                    py_val=False, dtype=self.dtype.plc_type
+                    py_val=False, dtype=self.dtype.plc_type, stream=df.stream
                ),
+                stream=df.stream,


Note: here's one of the spots where we call self._distinct, which now accepts a stream. Note how column, source_value, and target_value are all valid on df.stream.

TomAugspurger · 2025-10-16T21:58:30Z

python/cudf_polars/cudf_polars/dsl/expressions/rolling.py

+        # The scalars in self.preceding and self.following are constructed on the
+        # stream dedicated to building offset/period scalars. We need to join
+        # our stream into its stream.
+        stream = get_joined_cuda_stream(
+            upstreams=(df.stream, get_stream_for_offset_windows())
+        )
+


This get_stream_for_offset_windows counts as non-trivial.

All the way down in cudf_polars.utils.windows.duration_to_scalar, we make some plc.Scalars from Python scalars:

return plc.Scalar.from_py( value, plc.DataType(plc.TypeId.DURATION_NANOSECONDS), stream=stream )

The callers of this function (RollingWindow.__int__ -> offsets_to_windows) don't have a stream at hand; we aren't in the context of an an IR.do_evaluate node, for example. We really don't want to use the default stream, because IIUC any usage of the default stream will force unnecessary synchronizations across the device.

However, we still need a stable, unique stream so that the users of the output (this block of code) has something to synchronize with.

My solution: a singleton stream dedicated to getting the offset windows. This provides a way duration_to_scalar and the ultimate user of its output to synchronize without actually being directly tied to each other in a call stack.

I'm very open to other suggestions.

Similar story for the other get_stream_for_* functions.

We could refactor so that offsets_to_windows is only called in do_evaluate. That seems probably saner anyway.

I looked at both calling offsets_to_windows only in do_evaluate and just creating a stream and synchronizing.

offsets_to_windows is also called via rewrite_rolling via _translate_ir for pl_ir.GroupBy. Threading a stream all the way through there looks challenging.

So I'll plan to create a stream in offsets_to_windows and synchronize that after creating the scalars on that stream.

These stream singletons have all been removed in

af6b7e0 - Remove get_stream_for_conditional_join_predicate

61ad556 - Remove get_stream_for_stats

84d4fda - Remove stream singleton for duration_to_scalar

Instead of a stream singleton and joining streams, we just create a temporary stream and synchronize it before the function returns.

python/cudf_polars/cudf_polars/dsl/utils/windows.py

TomAugspurger · 2025-10-16T22:01:38Z

python/cudf_polars/cudf_polars/dsl/ir.py

    ) -> DataFrame:
        """Evaluate and return a dataframe."""
-        keys = broadcast(*(k.evaluate(df) for k in keys_in), target_length=df.num_rows)
+        # The scalars in preceding and following are constructed on the


See https://github.com/rapidsai/cudf/pull/20291/files#r2437563679 for context.

Let's, similarly, defer the construction of these scalars to do_evaluate if we can.

Or, as noted for the conditionaljoin case, we can construct in __init__ and sync.

TomAugspurger · 2025-10-16T22:02:33Z

python/cudf_polars/cudf_polars/dsl/ir.py

        def __init__(self, predicate: expr.Expr):
            self.predicate = predicate
-            ast_result = to_ast(predicate)
+            ast_result = to_ast(


See https://github.com/rapidsai/cudf/pull/20291/files#r2437563679 for context, but this one is for to_ast. In this Predicate.__init__ we again don't have a stream at hand that we can use.

Things we do in __init__ I think we're better off launching on some stream and then syncing that stream.

python/cudf_polars/cudf_polars/dsl/to_ast.py

python/cudf_polars/cudf_polars/experimental/base.py

TomAugspurger · 2025-10-17T11:23:24Z

python/cudf_polars/cudf_polars/experimental/sort.py


        by = options["by"]

+        stream = get_joined_cuda_stream(upstreams=(df.stream, sort_boundaries.stream))


Note: df and sort_boundaries are potentially on different streams, so we join them here.

TomAugspurger · 2025-10-17T11:24:22Z

python/cudf_polars/cudf_polars/experimental/sort.py

-            stream=DEFAULT_STREAM,
+            stream=stream,
        )
+        # TODO: figure out handoff with rapidsmpf


A bit more context here: we need to look into the pylibcudf methods called by rapidsmpf and ensure that they all accept a stream argument, and that we have a way to pass that through.

rapidsmpf's BufferResource has a stream_pool that would ideally be used (it wraps rmm::cuda_stream_pool). It might take a bit of plumbing to make that available everywhere that needs a new stream.

python/cudf_polars/cudf_polars/utils/cuda_stream.py

This adds CUDA streams to all pylibcudf calls in cudf-polars. At the moment, we continue to use the default stream for all operations, so we're *explicitly* using the default stream. A future PR will update things to use non-default streams.

python/cudf_polars/cudf_polars/experimental/base.py

python/cudf_polars/cudf_polars/dsl/ir.py

This adds CUDA streams to `cudf_polars.dsl.expressions.aggregation`. Streams are still missing from some `cudf_polars.containers.Column` calls in this file, but all the directly pylibcudf calls should be covered. Split off rapidsai#20291.

TomAugspurger

The remaining non-trivial changes.

python/cudf_polars/cudf_polars/experimental/explain.py

python/cudf_polars/cudf_polars/experimental/statistics.py

python/cudf_polars/cudf_polars/experimental/base.py

github-actions bot assigned TomAugspurger Oct 16, 2025

github-actions bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels Oct 16, 2025

github-project-automation bot added this to cuDF Python Oct 16, 2025

GPUtester moved this to In Progress in cuDF Python Oct 16, 2025

TomAugspurger commented Oct 16, 2025

View reviewed changes

TomAugspurger commented Oct 17, 2025

View reviewed changes

TomAugspurger force-pushed the tom/polars-cuda-stream-everything branch 3 times, most recently from 27e8178 to 1d32cdc Compare October 17, 2025 13:25

Add CUDA streams to cudf-polars

1d32cdc

This adds CUDA streams to all pylibcudf calls in cudf-polars. At the moment, we continue to use the default stream for all operations, so we're *explicitly* using the default stream. A future PR will update things to use non-default streams.

TomAugspurger commented Oct 17, 2025

View reviewed changes

python/cudf_polars/cudf_polars/experimental/base.py Show resolved Hide resolved

TomAugspurger mentioned this pull request Oct 17, 2025

Add streams to cudf_polars.dsl.expressions.string #20295

Closed

wence- reviewed Oct 17, 2025

View reviewed changes

python/cudf_polars/cudf_polars/dsl/ir.py Outdated Show resolved Hide resolved

TomAugspurger mentioned this pull request Oct 17, 2025

Add streams to dsl.expressions.aggregation #20299

Closed

TomAugspurger added 3 commits October 17, 2025 10:13

Remove stream singleton for duration_to_scalar

84d4fda

Remove get_stream_for_stats

61ad556

Remove get_stream_for_conditional_join_predicate

af6b7e0

TomAugspurger added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 17, 2025

TomAugspurger commented Oct 17, 2025

View reviewed changes

TomAugspurger changed the title ~~[WIP]: Add CUDA streams to cudf-polars~~ Add CUDA streams to cudf-polars Oct 17, 2025

TomAugspurger marked this pull request as ready for review October 17, 2025 20:27

TomAugspurger requested a review from a team as a code owner October 17, 2025 20:27

TomAugspurger requested review from Matt711 and mroeschke October 17, 2025 20:27

TomAugspurger commented Oct 17, 2025

View reviewed changes


		by = options["by"]

		stream = get_joined_cuda_stream(upstreams=(df.stream, sort_boundaries.stream))

Add CUDA streams to cudf-polars #20291

Are you sure you want to change the base?

Add CUDA streams to cudf-polars #20291

Uh oh!

Conversation

TomAugspurger commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

copy-pr-bot bot commented Oct 16, 2025

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TomAugspurger commented Oct 16, 2025 •

edited

Loading

TomAugspurger Oct 16, 2025 •

edited

Loading

TomAugspurger Oct 17, 2025 •

edited

Loading