Attempt to fix a flaky coroutine-dump-verifying test #4589

dkhalanskyjb · 2025-12-11T13:05:45Z

Fixes #4418
(unless it keeps happening)

This problem couldn't be reproduced locally, to this fix is purely analytical.

The problematic test attempts to launch a coroutine then await until the coroutine suspends.
The way it was doing that before the change is:

Hold a monitor and wait on the test body side;
Acquire a monitor and notify on the coroutine side right before the suspension point;
On the test body side, wait for the coroutine thread to enter the TIMED_WAIT state, indicating that its scheduler worker has finished its piece of work and now waits for new commands, which must mean the suspension point was reached.

The problem is that thread states are not synchronization primitives, and no happens-before is established between the code a thread executes before the state change and the code right after the state change is observed.

With this change, we establish a complete happens-before chain:

The test body wakes up after it's resumed as a coroutine.
complete on a latch happens-before the resume.
The suspension happens-before the complete, as suspension and the complete are done in the same thread.

With no way to verify the fix, it's unclear if that was the problem, so we can only hope the change helps.

Fixes #4418 (unless it keeps happening) This problem couldn't be reproduced locally, to this fix is purely analytical. The problematic test attempts to launch a coroutine then await until the coroutine suspends. The way it was doing that before the change is: - Hold a monitor and `wait` on the test body side; - Acquire a monitor and `notify` on the coroutine side *right before* the suspension point; - On the test body side, wait for the coroutine thread to enter the `TIMED_WAIT` state, indicating that its scheduler worker has finished its piece of work and now waits for new commands, which must mean the suspension point was reached. The problem is that thread states are not synchronization primitives, and no happens-before is established between the code a thread executes before the state change and the code right after the state change is observed. With this change, we establish a complete happens-before chain: - The test body wakes up after it's `resume`d as a coroutine. - `complete` on a latch happens-before the `resume`. - The suspension happens-before the `complete`, as suspension and the `complete` are done in the same thread. With no way to verify the fix, it's unclear if that was the problem, so we can only hope the change helps.

dkhalanskyjb requested a review from murfel December 11, 2025 13:05

dkhalanskyjb mentioned this pull request Dec 11, 2025

CoroutinesDumpTest#testSuspendedCoroutine failed #4418

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Attempt to fix a flaky coroutine-dump-verifying test #4589

Attempt to fix a flaky coroutine-dump-verifying test #4589

Uh oh!

dkhalanskyjb commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Attempt to fix a flaky coroutine-dump-verifying test #4589

Are you sure you want to change the base?

Attempt to fix a flaky coroutine-dump-verifying test #4589

Uh oh!

Conversation

dkhalanskyjb commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants