Fix _Once.ensure() to propagate handshake failure to concurrent waiters#3397
Fix _Once.ensure() to propagate handshake failure to concurrent waiters#3397bysiber wants to merge 1 commit intopython-trio:mainfrom
Conversation
When two tasks use an SSLStream concurrently (one sending, one receiving), both call _Once.ensure() to trigger the lazy handshake. If the first task starts the handshake and it fails (e.g. certificate error, connection reset), the exception propagates to that task but _done is never set. The second task, already waiting on _done.wait(), blocks indefinitely — the Event is never signalled and started is permanently True, so there is no recovery path. Store the failure exception and set _done even on error, so that all waiters wake up and receive a BrokenResourceError chained from the original handshake exception.
Codecov Report❌ Patch coverage is
❌ Your patch status has failed because the patch coverage (66.66667%) is below the target coverage (100.00000%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #3397 +/- ##
====================================================
- Coverage 100.00000% 99.97942% -0.02058%
====================================================
Files 128 128
Lines 19424 19434 +10
Branches 1318 1320 +2
====================================================
+ Hits 19424 19430 +6
- Misses 0 2 +2
- Partials 0 2 +2
🚀 New features to boost your workflow:
|
A5rocks
left a comment
There was a problem hiding this comment.
Could you create an issue before making a PR fixing a bug like this? I'd like to confirm that this is a real problem first!
| try: | ||
| await self._afn(*self._args) | ||
| except BaseException as exc: | ||
| self._failure = exc |
There was a problem hiding this comment.
This is a bad idea, because a) the stack frames for exc will be mutated, so you will have weird stack traces for raise ... from self._failure and b) this will lead to a refcycle I believe.
Problem
_Once.ensure()inSSLStreamcan leave concurrent waiters hanging forever when the handshake fails.When two tasks share an
SSLStream(one sending, one receiving), both callensure()to lazily perform the TLS handshake. The first task setsstarted = Trueand begins the handshake. The second task seesstarted=True, finds_donenot yet set, and enters_done.wait().If the handshake fails (certificate error, connection reset, etc.), the exception propagates to the first task — but
_done.set()is never called. The second task is stuck forever in_done.wait(): the Event will never be signalled, andstartedis permanentlyTrue, so re-entry won't help either.Reproduction scenario
send_all()→ entersensure(), starts handshakereceive_some()→ entersensure(), waits on_doneBrokenResourceError— correctFix
Store the exception on failure and still signal
_done, so that concurrent waiters (and any future callers) wake up and receive aBrokenResourceErrorchained from the original handshake exception.