[BUG]: CUDA stream scheduler in cudax execution library is broken

### Is this a duplicate?

- [x] I confirmed there appear to be no [duplicate issues](https://github.com/NVIDIA/cccl/issues) for this bug and that I agree to the [Code of Conduct](CODE_OF_CONDUCT.md)

### Type of Bug

Runtime Error

### Component

CUDA Experimental (cudax)

### Describe the bug

due to design flaws in the sender algorithm customization scheme, the transitions between the CPU and GPU are not always orchestrated correctly by the CUDA stream scheduler in cudax, leading to hangs or crashes; hence, its tests have been disabled for some time while i come to grips with the issue. a fix must be made quickly to `std::execution` for C++26.

here are my current design thoughts/directions:

* now that `ensure_started` and `split` have been removed, there isn't much argument anymore for early customization.
* we have `get_scheduler` and `get_completion_scheduler<SetTag>`, and we have `get_domain` but no `get_completion_domain<SetTag>`. i think this is an oversight.
* `get_completion_[scheduler|domain]<SetTag>` needs the receiver's environment in order to properly answer the query. `just()` can only know where it will complete when it knows where it is started.
* although not strictly necessary, it would be helpful to adopt [P3206](https://wg21.link/P3206), "A sender query for completion behaviour". if a sender is known to complete inline, then its completion scheduler/domain is the scheduler/domain on which it is started.
* every sender has _two_ domains: the one it is started on, and the one it completes on, and they could be different. in `connect` and `get_completion_signatures` we can know both. the question is how to use them.

the last bullet is the most interesting. i can imagine a scheduler wanting to do something special, say, whenever a `foo` sender is started, and i can imagine another scheduler wanting to do something special when a `foo` sender completes. this suggest to me that `transform_sender` might need to apply _two_ transforms to a sender in `connect`, one for each domain (if they are different). Q: does it matter which order they are applied?

so a given domain might want to provide two different transforms for each sender: the "start" transform and the "complete" transform. if we had such a thing, we no longer need `schedule_from` to be a different algorithm from `continues_on`. domain `A` can provide a "start" `continues_on` transform for transfers _off of_ context `A`, and domain `B` can provide a "complete" `continues_on` transform for transfers _onto_ context `B`.

my intention is to implement this design in cudax and then update [P3718](https://wg21.link/P3718) for (fingers crossed) inclusion in C++26.

### How to Reproduce

n/a

### Expected behavior

n/a

### Reproduction link

_No response_

### Operating System

_No response_

### nvidia-smi output

_No response_

### NVCC version

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG]: CUDA stream scheduler in cudax execution library is broken #5564

Is this a duplicate?

Type of Bug

Component

Describe the bug

How to Reproduce

Expected behavior

Reproduction link

Operating System

nvidia-smi output

NVCC version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]: CUDA stream scheduler in cudax execution library is broken #5564

Description

Is this a duplicate?

Type of Bug

Component

Describe the bug

How to Reproduce

Expected behavior

Reproduction link

Operating System

nvidia-smi output

NVCC version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions