Skip to content

Scheduler: Use a "scheduler" task for thread sleep #57544

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 27, 2025
Merged

Conversation

kpamnany
Copy link
Member

@kpamnany kpamnany commented Feb 26, 2025

A Julia thread runs Julia's scheduler in the context of the switching task. If no task is found to switch to, the thread will sleep while holding onto the (possibly completed) task, preventing the task from being garbage collected. This recent Discourse post illustrates precisely this problem.

A solution to this would be for an idle Julia thread to switch to a "scheduler" task, thereby freeing the old task.

This PR uses OncePerThread to create a "scheduler" task (that does nothing but run wait() in a loop) and switches to that task when the thread finds itself idle.

Other approaches considered and discarded in favor of this one: #57465 and #57543.

@kpamnany kpamnany force-pushed the kp-sched-task-alt2 branch 2 times, most recently from b345e87 to 4c08ecb Compare February 28, 2025 00:16
@kpamnany
Copy link
Member Author

kpamnany commented Mar 5, 2025

This is currently blocked on what seems to be a bug in OncePerThread serialization/deserialization; found with @gbaraldi.

@kpamnany kpamnany marked this pull request as ready for review March 5, 2025 15:14
gbaraldi added a commit that referenced this pull request Mar 20, 2025
…simage (#57656)

This is quite tricky to test unfortunately, but
#57544 caught this and this fixes
that

---------

Co-authored-by: Jameson Nash <[email protected]>
@kpamnany kpamnany force-pushed the kp-sched-task-alt2 branch from f280243 to 684637a Compare March 20, 2025 18:56
KristofferC pushed a commit that referenced this pull request Mar 20, 2025
…simage (#57656)

This is quite tricky to test unfortunately, but
#57544 caught this and this fixes
that

---------

Co-authored-by: Jameson Nash <[email protected]>
(cherry picked from commit bf01638)
@kpamnany
Copy link
Member Author

Unblocked... thanks @gbaraldi!

Now, some Channel tests and a Sockets test are failing. Looking into these failures.

Copy link
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those tests are specifically testing race conditions in the code, so you might need to adjust them slightly to account for the change in process_events ordering and count

@JamesWrigley
Copy link
Contributor

Is there any chance of getting this backported to 1.12? It's quite tricky to debug and would be nice if it was fixed in the next release.

@kpamnany kpamnany force-pushed the kp-sched-task-alt2 branch from 684637a to 3bcde1b Compare March 24, 2025 19:15
This small group of tests is written with assumptions about when
and how the libuv event loop is run. As this PR changes this
behavior, the tests needed adjusting.
@kpamnany
Copy link
Member Author

The channels tests are fixed, but I don't see a way to fix the Sockets test that's failing. Any ideas @gbaraldi or @vtjnash?

Previously, this test depended on scheduler behavior, which is
slightly changed in this PR. Changed the test to connect to a
non-routable IP address so that it no longer depends on task
ordering.
@kpamnany
Copy link
Member Author

Thanks @vtjnash for the idea on how to fix the Sockets test.

@kpamnany
Copy link
Member Author

kpamnany commented Mar 26, 2025

The channels tests that are failing on FreeBSD are a bit mystifying. How come Workqueue is empty on Linux but not on FreeBSD? Do we have an extra sticky task?

@kpamnany kpamnany merged commit 0d4d6d9 into master Mar 27, 2025
5 of 7 checks passed
@kpamnany kpamnany deleted the kp-sched-task-alt2 branch March 27, 2025 15:10
# We may have already switched tasks (via the scheduler task), so
# only switch if we haven't.
if !have_result
@assert task isa Task
Copy link
Contributor

@nsajko nsajko Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any specific reason this doesn't just typeassert? Just curious.

Suggested change
@assert task isa Task
task = task::Task

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No specific reason. Seems like it should be a typeassert, actually.

@KristofferC KristofferC added the backport 1.12 Change should be backported to release-1.12 label Apr 16, 2025
KristofferC pushed a commit that referenced this pull request Apr 16, 2025
A Julia thread runs Julia's scheduler in the context of the switching
task. If no task is found to switch to, the thread will sleep while
holding onto the (possibly completed) task, preventing the task from
being garbage collected. This recent [Discourse
post](https://discourse.julialang.org/t/weird-behaviour-of-gc-with-multithreaded-array-access/125433)
illustrates precisely this problem.

A solution to this would be for an idle Julia thread to switch to a
"scheduler" task, thereby freeing the old task.

This PR uses `OncePerThread` to create a "scheduler" task (that does
nothing but run `wait()` in a loop) and switches to that task when the
thread finds itself idle.

Other approaches considered and discarded in favor of this one:
#57465 and
#57543.

(cherry picked from commit 0d4d6d9)
@KristofferC KristofferC removed the backport 1.12 Change should be backported to release-1.12 label Apr 25, 2025
nsajko added a commit to nsajko/julia that referenced this pull request May 31, 2025
A typeassert seems like better style. Given that `typeassert` is a
builtin, why not put it to use.

See:

* JuliaLang#57544 (comment)
@nsajko nsajko mentioned this pull request May 31, 2025
nsajko added a commit to nsajko/julia that referenced this pull request May 31, 2025
A typeassert seems like better style. Given that `typeassert` is a
builtin, why not put it to use.

See:

* JuliaLang#57544 (comment)
nsajko added a commit to nsajko/julia that referenced this pull request May 31, 2025
A typeassert seems like better style. Given that `typeassert` is a
builtin, why not put it to use.

See:

* JuliaLang#57544 (comment)
oscardssmith pushed a commit that referenced this pull request Jun 1, 2025
Follows up on this PR:

* #57544 (comment)
@Keno
Copy link
Member

Keno commented Jun 16, 2025

This change interacts badly with ^C. Many interrupt exception will now get thrown to the scheduler task, which then dies and takes down the ability to ever schedule again in the future.

@kpamnany
Copy link
Member Author

This change interacts badly with ^C. Many interrupt exception will now get thrown to the scheduler task, which then dies and takes down the ability to ever schedule again in the future.

Ugh.

^C behavior wasn't especially good before, but this is clearly worse. Suggestions?

@Keno
Copy link
Member

Keno commented Jun 17, 2025

I think we need to revert this for 1.12 and then work on implementing a proper behavior for 1.13 (#52291).

@kpamnany
Copy link
Member Author

We could also simply never direct the interrupt at scheduler tasks?

@Keno
Copy link
Member

Keno commented Jun 17, 2025

That needs an extra feature to find the correct task to send it to then. This is too much for 1.12 at this point - we should reset to the old behavior for the release and we can work on something better for 1.13.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants