Skip to content

Conversation

@dkhalanskyjb
Copy link
Collaborator

Fixes #4418
(unless it keeps happening)

This problem couldn't be reproduced locally, to this fix is purely analytical.

The problematic test attempts to launch a coroutine then await until the coroutine suspends.
The way it was doing that before the change is:

  • Hold a monitor and wait on the test body side;
  • Acquire a monitor and notify on the coroutine side right before the suspension point;
  • On the test body side, wait for the coroutine thread to enter the TIMED_WAIT state, indicating that its scheduler worker has finished its piece of work and now waits for new commands, which must mean the suspension point was reached.

The problem is that thread states are not synchronization primitives, and no happens-before is established between the code a thread executes before the state change and the code right after the state change is observed.

With this change, we establish a complete happens-before chain:

  • The test body wakes up after it's resumed as a coroutine.
  • complete on a latch happens-before the resume.
  • The suspension happens-before the complete, as suspension and the complete are done in the same thread.

With no way to verify the fix, it's unclear if that was the problem, so we can only hope the change helps.

Fixes #4418
(unless it keeps happening)

This problem couldn't be reproduced locally, to this fix is purely
analytical.

The problematic test attempts to launch a coroutine then await
until the coroutine suspends.
The way it was doing that before the change is:
- Hold a monitor and `wait` on the test body side;
- Acquire a monitor and `notify` on the coroutine side
  *right before* the suspension point;
- On the test body side, wait for the coroutine thread to enter the
  `TIMED_WAIT` state, indicating that its scheduler worker
  has finished its piece of work and now waits for new commands,
  which must mean the suspension point was reached.

The problem is that thread states are not synchronization
primitives, and no happens-before is established between the
code a thread executes before the state change and the code right
after the state change is observed.

With this change, we establish a complete happens-before chain:
- The test body wakes up after it's `resume`d as a coroutine.
- `complete` on a latch happens-before the `resume`.
- The suspension happens-before the `complete`,
  as suspension and the `complete` are done in the same thread.

With no way to verify the fix, it's unclear if that was
the problem, so we can only hope the change helps.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants