fix(ci): prevent kafka consumer teardown hang with xdist#111261
Open
mchen-sentry wants to merge 3 commits intomasterfrom
Open
fix(ci): prevent kafka consumer teardown hang with xdist#111261mchen-sentry wants to merge 3 commits intomasterfrom
mchen-sentry wants to merge 3 commits intomasterfrom
Conversation
Contributor
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Autofix Details
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Missing try/except lets one consumer failure skip others
- Added try/except block around signal_shutdown() call to ensure one consumer's failure doesn't prevent cleanup of remaining consumers in the loop.
Or push these changes by commenting:
@cursor push bf4c8e5e4f
Preview (bf4c8e5e4f)
diff --git a/src/sentry/testutils/pytest/kafka.py b/src/sentry/testutils/pytest/kafka.py
--- a/src/sentry/testutils/pytest/kafka.py
+++ b/src/sentry/testutils/pytest/kafka.py
@@ -81,7 +81,10 @@
for consumer_name, consumer in all_consumers.items():
if consumer is not None:
- consumer.signal_shutdown()
+ try:
+ consumer.signal_shutdown()
+ except Exception:
+ _log.warning("Could not shutdown consumer %s", consumer_name)
@pytest.fixture(scope="function")This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.
609e0bd to
ee19ad9
Compare
Contributor
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Run `consumer.run()` on a daemon thread with a 5s join timeout instead of calling it directly in the `scope_consumers` teardown. The old teardown called `signal_shutdown()` then `run()` on the main thread. `run()` calls `_shutdown()` → `consumer.close()`, which sends a LeaveGroup to the broker and blocks in the rebalance callback. With xdist, multiple workers tear down simultaneously, the broker gets slow, and `close()` blocks indefinitely — hanging the CI shard at ~99%. The daemon thread still attempts a graceful shutdown but bails out after 5 seconds. If the thread outlives the join, it dies on process exit and the broker detects the dead consumer via heartbeat timeout.
ee19ad9 to
30b77c3
Compare
Temporary instrumentation to understand why consumer.close() hangs during scope_consumers teardown with xdist. Logs member_id, time since last poll, and probes committed() before shutdown.
…y theory" This reverts commit c1958f5.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Observed these 99% completion hang failures after xdist. Reproduced on a test branch and a thread dump on the hung process showed:
Let's run
consumer.run()on a daemon thread with a 5-second join timeout. This still attempts a graceful shutdown, but bails out if the broker is slow. If the thread outlives the join, it dies on process exit and the broker detects the dead consumer via heartbeat timeout.Tested that hangs no longer happen with the fix.
We could also potentially just remove
consumer.run()entirely and skip graceful shutdown as the fixture is session-scoped so teardown runs right before process exit.