Skip to content

[BUG]: CUDA stream scheduler in cudax execution library is broken #5564

@ericniebler

Description

@ericniebler

Is this a duplicate?

Type of Bug

Runtime Error

Component

CUDA Experimental (cudax)

Describe the bug

due to design flaws in the sender algorithm customization scheme, the transitions between the CPU and GPU are not always orchestrated correctly by the CUDA stream scheduler in cudax, leading to hangs or crashes; hence, its tests have been disabled for some time while i come to grips with the issue. a fix must be made quickly to std::execution for C++26.

here are my current design thoughts/directions:

  • now that ensure_started and split have been removed, there isn't much argument anymore for early customization.
  • we have get_scheduler and get_completion_scheduler<SetTag>, and we have get_domain but no get_completion_domain<SetTag>. i think this is an oversight.
  • get_completion_[scheduler|domain]<SetTag> needs the receiver's environment in order to properly answer the query. just() can only know where it will complete when it knows where it is started.
  • although not strictly necessary, it would be helpful to adopt P3206, "A sender query for completion behaviour". if a sender is known to complete inline, then its completion scheduler/domain is the scheduler/domain on which it is started.
  • every sender has two domains: the one it is started on, and the one it completes on, and they could be different. in connect and get_completion_signatures we can know both. the question is how to use them.

the last bullet is the most interesting. i can imagine a scheduler wanting to do something special, say, whenever a foo sender is started, and i can imagine another scheduler wanting to do something special when a foo sender completes. this suggest to me that transform_sender might need to apply two transforms to a sender in connect, one for each domain (if they are different). Q: does it matter which order they are applied?

so a given domain might want to provide two different transforms for each sender: the "start" transform and the "complete" transform. if we had such a thing, we no longer need schedule_from to be a different algorithm from continues_on. domain A can provide a "start" continues_on transform for transfers off of context A, and domain B can provide a "complete" continues_on transform for transfers onto context B.

my intention is to implement this design in cudax and then update P3718 for (fingers crossed) inclusion in C++26.

How to Reproduce

n/a

Expected behavior

n/a

Reproduction link

No response

Operating System

No response

nvidia-smi output

No response

NVCC version

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working right.

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions