Skip to content

RFC: Reactor - IO threads#7230

Open
nefelim4ag wants to merge 6 commits into
Klipper3d:masterfrom
nefelim4ag:reactor-io-threads
Open

RFC: Reactor - IO threads#7230
nefelim4ag wants to merge 6 commits into
Klipper3d:masterfrom
nefelim4ag:reactor-io-threads

Conversation

@nefelim4ag

@nefelim4ag nefelim4ag commented Mar 19, 2026

Copy link
Copy Markdown
Collaborator

There are several places where the reactor's greenlet can be used to do blocking file IO.
Unfortunately, it is not possible to do file IO in a non-blocking fashion.

In some places, it is handled with a daemon (adxl345.py), which is cumbersome.
In others, ignored (virtual_sdcard).
Somewhere, it is threads (palette2.py)

This is my attempt to solve the possible IO blocks inside the virtual_sdcard code.
Because it is a somewhat generic problem, I've tried to make a generic solution to do so across the code.
Hence, the weird way of wrapping the FileIO wrapper. So, it can wrap whatever wrapper supports the blocking read/write.

I did a base test, where I triggered the ENOSPC on write, and checked the virtual_sdcard reading with a large file with dummy commands.

Thanks,
-Timofey


  • Fixed: I have to figure out how to read during the analyze shutdown event =\ where I should not pause.
  • Fixed: Hmmm, the test exits before the first read happens, because do_resume expects that the next timer is blocking.

@nefelim4ag nefelim4ag force-pushed the reactor-io-threads branch 2 times, most recently from b59ade2 to 7782ef9 Compare March 20, 2026 15:29
@KevinOConnor

Copy link
Copy Markdown
Collaborator

Thanks for working on this.

I agree that using threads like this is likely a better solution that using "fadvice".

I have a handful of high-level comments. Don't take these comments too seriously. In no particular order:

  • I don't think reactor.py is a good place for this code to live. I suspect the code really wants a reference to the printer object so that it can be called on a "klippy:disconnect" event. It also seems like the code is more a user of the reactor than a core component of the reactor.
  • For what it is worth, I'm not sure a generic "threaded IO solution" will be sufficiently scalable. The adxl345.py code is trying to avoid CPU starvation during writes, while the virtual_sdcard.py code has some "fast path" performance requirements. It might be worthwhile to consider focusing on virtual_sdcard.py and its particular buffering requirements.
  • I didn't understand the references to pallette2.py - unless I'm missing something, it doesn't read/write local files, nor use threads.
  • Note that the save_variables.py module has been reported as causing failures due to its local file IO. (Though it doesn't have the performance considerations of adxl345.py and vitual_sdcard.py .)
  • FWIW, I saw a couple of errors in the code. I suspect this PR was intended as a proof of concept, but just for completeness: the adxl345.py call to open(filename, "w") can also block; the virtual_sdcard.py code tries to call f.seek(), f.tell(), etc. on the ReactorFileAIO object.

Thanks again,
-Kevin

@nefelim4ag nefelim4ag force-pushed the reactor-io-threads branch 3 times, most recently from 09624a6 to dda9818 Compare March 21, 2026 19:12
@nefelim4ag

nefelim4ag commented Mar 21, 2026

Copy link
Copy Markdown
Collaborator Author

Yeah, palette2 is my hallucination. I've seen several "extensions" with UART, and it is just blended in my head.
My initial intention was that there are "cheap" calls (tell, seek), which should not be blocked, so they were bypassed.

Anyway:

  • Now, AIO (I'm not sure about the naming) is a separate entity
  • There is a thread pool to optimize the case with many SAVE_VARIABLE calls in a row
  • adxl345 does everything in one call (we can simply drop that patch, if there are concerns).
  • save_variable should not cause any issues (I hope)

Where virtual_sdcard now has:

  • Fix for tests
  • Buffer for the data. I wanted to use mmap, alas, I think it will cause unexpected issues if one modifies the G-code file.
  • Using the buffer, it is possible to split data feeding/reading into separate timer, to increase the throughput.
  • That allows the use of aio, without affecting the shutdown hook.

Fancy thread naming can be removed; it is mostly for monitoring what is going on. But I think it is suboptimal for everything except virtual_sdcard.

Thanks,
-Timofey


Hmmm,
I can probably refactor the "adxl345" internal client to basically write in the background.
If a write is requested, one can create an anonymous (create + unlink) file inside the /tmp.
Write to it each batch, and then, upon finishing, simply link() it back to the real name.

That way, there should be no IO or CPU spikes; it will be evenly spread out.

Where the only limitation that I can think of is to correctly handle the request_end_time


Hmmm, I think that if I'm offloading the computation and writing to the thread, the problem is, that now the command is blocking.
So, whatever G-code triggered the write, it will be paused.
Which is correct behaviour from my PoV, but it is a behaviour change for accelerometers.

@nefelim4ag nefelim4ag force-pushed the reactor-io-threads branch 3 times, most recently from cabe873 to 9544f4c Compare March 23, 2026 15:50
@nefelim4ag

nefelim4ag commented Mar 23, 2026

Copy link
Copy Markdown
Collaborator Author

I've underestimated the amount of work that is necessary to rework the accelerometer data write-out, so it has been removed from here.


Indirect issue (Klipper Discord), which should be fixed by this. In sequential print mode, if one excludes the object, the G-code can take quite a time spinning inside the exclude_object _ignore_move(). So, as there will be a reactor.pause() upon every read, it should avoid this issue.

Cannot reproduce, it seems that even in that case, reactor are able to execute other timers.

@nefelim4ag nefelim4ag force-pushed the reactor-io-threads branch 3 times, most recently from 3ecdf6f to fcddb4c Compare March 30, 2026 17:51
@github-actions

Copy link
Copy Markdown

Thank you for your contribution to Klipper. Unfortunately, a reviewer has not assigned themselves to this GitHub Pull Request. All Pull Requests are reviewed before merging, and a reviewer will need to volunteer. Further information is available at: https://www.klipper3d.org/CONTRIBUTING.html

There are some steps that you can take now:

  1. Perform a self-review of your Pull Request by following the steps at: https://www.klipper3d.org/CONTRIBUTING.html#what-to-expect-in-a-review
    If you have completed a self-review, be sure to state the results of that self-review explicitly in the Pull Request comments. A reviewer is more likely to participate if the bulk of a review has already been completed.
  2. Consider opening a topic on the Klipper Discourse server to discuss this work. The Discourse server is a good place to discuss development ideas and to engage users interested in testing. Reviewers are more likely to prioritize Pull Requests with an active community of users.
  3. Consider helping out reviewers by reviewing other Klipper Pull Requests. Taking the time to perform a careful and detailed review of others work is appreciated. Regular contributors are more likely to prioritize the contributions of other regular contributors.

Unfortunately, if a reviewer does not assign themselves to this GitHub Pull Request then it will be automatically closed. If this happens, then it is a good idea to move further discussion to the Klipper Discourse server. Reviewers can reach out on that forum to let you know if they are interested and when they are available.

Best regards,
~ Your friendly GitIssueBot

PS: I'm just an automated script, not a human being.

@nefelim4ag nefelim4ag force-pushed the reactor-io-threads branch 4 times, most recently from db34d30 to 2444f37 Compare April 19, 2026 18:03
@nefelim4ag nefelim4ag changed the title PoC: Reactor - IO threads RFC: Reactor - IO threads Apr 19, 2026
@nefelim4ag nefelim4ag force-pushed the reactor-io-threads branch 2 times, most recently from 62eb78b to 4fef8ef Compare April 19, 2026 19:06
@nefelim4ag nefelim4ag force-pushed the reactor-io-threads branch from 4fef8ef to 4ed9f35 Compare May 1, 2026 23:14
@nefelim4ag nefelim4ag force-pushed the reactor-io-threads branch from 4ed9f35 to 03a0bda Compare May 18, 2026 14:55
@nefelim4ag nefelim4ag force-pushed the reactor-io-threads branch from 03a0bda to e67e341 Compare May 31, 2026 01:10
In a batch mode, GCodeIO can exit before virtual_sdcard
finishes execution of G-code file

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Shutdown hook tries to reread the previous and next bytes
That will complicate the async file access loop
Simply avoid that by always working over memory representation

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Allow to wrap the file wrapper with the AIO wrapper
Which will pause greenlet upon blocking calls like read/write

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
@nefelim4ag nefelim4ag force-pushed the reactor-io-threads branch from e67e341 to 0dc7f2b Compare May 31, 2026 12:45

@dewi-ny-je dewi-ny-je left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, an automatic LLM review is better than nothing, hoping it will make the PR get further.

Significant concerns

  1. gcode.py now reaches into an extras module (the biggest red flag).

    # Workaround virtual_sdcard
    obj = self.printer.lookup_object("virtual_sdcard", None)
    wait_virtual_input = obj.get_status(eventtime)["is_active"] if obj else False

    Core gcode.py shouldn't know about virtual_sdcard, and the author's own "Workaround" comment admits it. This is a layering inversion that will block merge. The real issue is that async/buffered reads make the existing synchronous fileinput-EOF detection race; it deserves a cleaner abstraction (e.g. virtual_sdcard registering that input is pending) rather than a hardcoded lookup in the core.

  2. Shared FileAIO + Queue(1) + raise Full is fragile under concurrent access. Executor.submit raises Full if a call is already in flight ("single execution flow assumed"). current_file is a single shared object; the PR was careful to make _handle_analyze_shutdown read from the in-memory cache instead of touching the file precisely to avoid this — but it relies on no other code path ever calling a current_file method while the work-timer greenlet has a read outstanding. That invariant isn't enforced or documented, and any future caller of e.g. seek/tell from another greenlet would trip Full (or interleave). Worth at least asserting/guarding.

  3. temperature_host mixes threads on one handle. Reads run on a borrowed worker thread (_get_sample does file_handle.seek/read), but handle_disconnect calls self.file_handle.close() synchronously on the main greenlet. If a sample is in flight at disconnect, that's a close/read race on the same handle. Closing through the executor (as save_variables does its writes) would be more consistent.

  4. Pool grows but never shrinks, and join() vs. an in-flight item. Each new concurrent consumer spawns a permanent thread (no upper bound, no reclamation until disconnect). Also join() does put_nowait(sentinel) into a Queue(1); it's only safe because the worker get()s before processing (leaving the queue empty), but it's timing-dependent and undocumented. A blocking put for the sentinel, or a dedicated stop flag checked without the queue, would be more robust.

Minor / correctness nits

  • FilePager.read() changed semantics: it no longer returns a fixed 8 KiB but "from current offset to end of the current page," so the first read after a non-aligned seek returns a short chunk. The work loop tolerates variable-length reads, so this is fine functionally — but it's a behavioral change worth a comment.
  • Cache miss in the shutdown dump: _handle_analyze_shutdown does self.current_file.pages.get(page_num, ""). With CACHE_SIZE=3 the relevant page is normally resident, but if it was recycled the diagnostic dump is silently empty/truncated. Acceptable for a best-effort dump, but a regression vs. the old guaranteed seek+read.
  • FilePager.__getattr__ returns the attribute directly (no executor proxy). It's only correct because the wrapped file_object is itself a FileAIO that re-proxies — a two-layer indirection that's easy to break later. A one-line comment would help.
  • No timeout on submit (completion.wait() defaults to _NEVER). If the underlying I/O genuinely hangs (e.g. dead NFS), the greenlet parks forever with no cancellation. That's arguably the intended trade-off, but it's an unbounded wait worth acknowledging.
  • No docs/config reference or tests. New [aio_executor] object is auto-loaded, so no user config needed, but there's no documentation and no test coverage of the pager/executor.
  • Copyright year 2026 in the new file header — just flag for consistency.

Recommendation

Good direction and a genuinely useful capability; the executor primitive is well-built. But it's not mergeable as-is, and the title's RFC status is appropriate. The two things I'd push back on hardest:

  1. Remove the virtual_sdcard knowledge from gcode.py — find a generic hook so the core stays decoupled.
  2. Pin down the concurrency contract on the shared FileAIO/Queue(1) (one in-flight op per file, enforced) and make temperature_host's close go through the same thread.

@nefelim4ag

Copy link
Copy Markdown
Collaborator Author

@dewi-ny-je, I appreciate your intention, but the methods are questionable.

Generally, I think:
The point of review is that the reviewer acknowledges the changes and can basically share "accountability".
For example, explain the behavioral changes to anyone, or help triangulate and fix problems later.
Where help in finding "bugs" is important, but secondary.

If one cannot review and grasp the changes (for example, I'm too goofy to review some PRs or I do not use that feature, for example). But if that one person is interested in those changes/want to help. One can test those changes.
Testing, with some feedback (review/comment), and simple "it works for me" or "it does not work for me". As valuable as a review for similar reasons. And, well, the author can have limited resources to produce actual testing, taking into account the variety of setups out there.
As a matter of bonus, it can indirectly indicate interest in those changes.

That being said, it is generally hard to make a decision, whether changes are worth it or not.
Where specific implementation can often contradict the general high-level view of how it should work.
So, specific implementation is not so important in practice if the feature is "worth it", it will be eventually implemented anyway. If not, that's sad, but it is also normal.

The above actionables can help in general.
One person spent time implementing, another one testing/reviewing, third maintaining it later.
All of them now know more and can do more.
I can also suggest reading this: https://kristoff.it/blog/contributor-poker-and-ai/

So, unless LLM helps you personally understand the changes, makes reviewing or testing easier for you.
I think it is better to refrain from using or sharing the "brilliant" LLM findings, unless, again, until you can prove them yourself (then it is "I have found, with help", instead of "LLM has found").

Also, I do generally like what is written here: https://mikemcquaid.com/open-source-maintainers-owe-you-nothing/

All the above is my personal opinion.
Regards,
-Timofey

@dewi-ny-je

dewi-ny-je commented Jun 20, 2026

Copy link
Copy Markdown

@nefelim4ag
I understand your point and I'm aware of the issues of AI/generated PRs.

My motivation is obvious: I have seen tens of what looked (to me) like a really good idea to me dropped just because no reviewer picked up the PR. And there are over 200 PRs waiting for a reviewer.
Some of them are so useful that Kevin at a certain point does it himself, but it's a small fraction and a number just gets lost.
I thought that if I can help some of them to proceed faster it would be worth a shot, but sure this costs time to the author of the PR and I didn't think about it.

After all, I'm not blindly enthusiastic about LLMs: my attempt to help here is the result of an impressive success rate (feature addition, cleanup, bug fixes), which led me to think that the other way around could be just as good.
I was wrong at least for this PR and the comments only resulted in a waste of time. The other PR review about Stealthchop lag seems more solid, and in other ones it was proven to be correct.

Nevertheless I did it only for the few PRs which I found potentially useful for my use case to avoid spamming too much, I put a disclaimer and (unless I clicked wrong) I did not "request changes" but I only submitted comments.

If you and/or @KevinOConnor prefer not to have any automatic review, I will comply and I won't take it personally because I totally recognise that if this happens often, the time wasted could be more than the benefits.
I would need to filter much more strictly without giving any benefit of the dobt to the LLM (I'm still filtering, I don't accept every comment), limiting only to what I can directly verify/understand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants