GOVFOUN-408 | prevent PoolTimeout cascade caused by holding connection slots during retry sleep#911
Conversation
…tion pool When all worker threads hit PoolTimeout simultaneously, the httpcore connection pool enters a degraded state. Subsequent main-thread searches retry into the same broken pool, causing a ~15-20 minute retry loop. reset_http_session() closes and rebuilds the httpx.Client with a fresh pool so the next retry starts clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ool slot When a 429 (or other retryable status) triggers a retry sleep, the response stream was left open, holding the httpcore connection slot for the full sleep duration. With 4 threads all receiving 429 simultaneously, all 4 slots were held for 30s, causing PoolTimeout for any queued requests. Fix: call response.close() / response.aclose() before retry.sleep() so the connection returns to the pool immediately. Headers are already buffered in memory, so Retry-After parsing in retry.sleep() is unaffected. Root cause confirmed via httpcore DEBUG logging: 429 at 11:20:01 → 30s sleep → PoolTimeout at 11:20:31 (exactly pool=30.0s). No stale-socket events. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| """Close and rebuild the HTTP session to recover from a degraded connection pool.""" | ||
| try: | ||
| self._session.close() | ||
| except Exception: |
There was a problem hiding this comment.
Consider: The bare except Exception: pass silently swallows all errors during session close. While this is intentional (to ensure we always proceed to create a new session), logging the exception at DEBUG level would help with troubleshooting:
| except Exception: | |
| except Exception as e: | |
| LOGGER.debug("Failed to close old HTTP session: %s", e) |
This preserves the recovery-at-all-costs behavior while leaving a trace for debugging.
| event_hooks={"response": [log_response]}, | ||
| ) | ||
| self._401_has_retried.set(False) | ||
| LOGGER.warning("HTTP session reset: new connection pool created") |
There was a problem hiding this comment.
Nit: LOGGER.warning() for a normal recovery operation may be too noisy in production. Consider using LOGGER.info() since this is expected recovery behavior, not an unexpected condition.
| threading.Event.wait(timeout=None). With pool=30.0 the SDK raises an | ||
| exception quickly instead. | ||
| """ | ||
| original_transport = client._session._transport |
There was a problem hiding this comment.
Fragile test: Accessing client._session._transport._transport._pool relies on internal httpx/httpcore implementation details. If httpx changes its internal structure, this test will break.
Consider adding a comment documenting this coupling, or wrapping it in a try/except with pytest.skip() if the internal structure changes:
def _get_httpcore_pool(client: AtlanClient):
"""Access the underlying httpcore pool. Relies on httpx internals; may need updating if httpx changes."""
try:
return client._session._transport._transport._pool
except AttributeError:
pytest.skip("httpx internal structure changed; update test helper")
Aryamanz29
left a comment
There was a problem hiding this comment.
Review: GOVFOUN-408 — httpcore Pool Timeout Fix
Excellent PR. The RCA is one of the best I've seen — live surgery on a stuck pod with gdb/py-spy to prove the mechanism before shipping the fix. The changes are minimal, targeted, and well-tested.
Verdict: APPROVE
Findings
[QUAL-F1] All 5 changes are correct and necessary ✅
| Change | Assessment |
|---|---|
pool=30.0 timeout |
Correct. None caused infinite blocking. 30s is generous enough for normal operation, fast enough to fail-and-retry. |
httpx.Limits(max_connections=50, keepalive_expiry=30.0) |
Correct. keepalive_expiry=30.0 < nginx keepalive_timeout=75s prevents CLOSE_WAIT accumulation. Reducing from 100→50 is sensible for SDK clients. |
response.close() before retry.sleep() |
Critical fix. Holding the connection slot during Retry-After sleep is the secondary deadlock vector. The isinstance guard for httpx.HTTPError is correct. |
max_retries context manager gets same Limits |
Correct — without this, with client.max_retries(): would silently revert to unsafe defaults. |
reset_http_session() |
Good escape hatch for long-running workflows. Correctly preserves proxy, verify, headers, and resets _401_has_retried. |
[QUAL-F2] DRY concern — Limits repeated 4 times
Severity: Minor
httpx.Limits(max_connections=50, max_keepalive_connections=10, keepalive_expiry=30.0) appears in 4 places:
__init__(line 220)reset_http_session(line 304)max_retries(line 2068)- Comments reference it too
Consider extracting to a class-level constant:
_DEFAULT_POOL_LIMITS = httpx.Limits(
max_connections=50,
max_keepalive_connections=10,
keepalive_expiry=30.0,
)Not a blocker — just prevents future drift if values need tuning.
[TEST-F1] Comprehensive test coverage ✅
34 tests covering:
- Pool timeout value and propagation
- Transport limits (max_connections, keepalive_expiry, max_keepalive_connections)
- max_retries transport replacement and restoration
- reset_http_session (new session, limits, close, 401 flag, proxy)
- response.close() call ordering before retry.sleep()
- isinstance guard for exception retries
- Integration: concurrent requests, 429 retry without PoolTimeout
[SEC-F1] No security concerns ✅
No credentials in code/logs. reset_http_session correctly rebuilds auth headers from existing client state, not from new inputs.
Strengths
- Root cause analysis with live process inspection is gold-standard debugging
- Fix addresses both the symptom (no pool timeout) and the structural cause (CLOSE_WAIT accumulation)
response.close()beforeretry.sleep()is a subtle but critical fix — most people would miss this- Tests verify call ordering, not just outcomes
✨ Description
csa-metadata-completenesswas hanging indefinitely on production tenants. Root cause: the SDK's retry logic calledretry.sleep(response)while the HTTP response stream was still open, keeping the httpcore connection slot occupied for the full sleep duration. Under concurrent load with 429 rate-limiting, all pool slots filled with sleeping threads — no slot was available for retries, causing a PoolTimeout cascade that stalled workflows permanently.Complete details provided here - https://atlanhq.atlassian.net/wiki/spaces/dg/pages/1909096582/pyatlan+SDK+httpcore+Pool+Timeout+Fix
Changes
Core fix — release connection before sleeping (
pyatlan/client/transport.py)Added
response.close()/await response.aclose()beforeretry.sleep()/await retry.asleep()inPyatlanSyncTransport._retry_operationandPyatlanAsyncTransport._retry_operation_async. The response headers are already buffered in memory at this point, soRetry-Afterparsing is unaffected. Anisinstance(response, httpx.Response)guard protects the exception-retry path where the loop variable holds anhttpx.HTTPError(no.close()).Connection pool configuration (
pyatlan/client/atlan.py)httpx.Limits(max_connections=50, max_keepalive_connections=10, keepalive_expiry=30.0)— limits pool size and retires idle connections before nginx's75s keepalive FIN, preventing CLOSE_WAIT socket accumulation.
httpx.Timeout(pool=30.0)— threads now raisePoolTimeoutwithin 30s instead of blocking onthreading.Event.wait(timeout=None)forever.max_retriescontext manager, and the newreset_http_session().reset_http_session()— degraded pool recoveryNew public method that closes the current
httpx.Client, rebuilds it with a fresh connection pool (same limits/auth/proxy config), and resets_401_has_retried. Useful when callers detect a degraded pool and want to recover without reinitializing the full client.How has this been tested?
retry.sleep(response)holding connection slot → PoolTimeout cascade across all worker threads.test_response_closed_before_retry_sleep(sync + async) — assertsclose()is called beforesleep()using call-order trackingtest_no_close_on_exception_retry— verifiesisinstanceguard for the exception-retry pathTestResetHttpSession— covers session replacement, correct limits, old session teardown,_401_has_retriedreset, and proxy/verify forwardingmax_retrieslimits regression testsPoolTimeoutpropagation, concurrent 429 retry withoutPoolTimeout
Jira link: [https://linear.app/atlan-epd/issue/GOVFOUN-408/mastercardorreplace-custom-metadata-hangs-indefinitely-in-csametadata]
🧩 Type of change
Select all that apply:
✅ How has this been tested? (e.g. screenshots, logs, workflow links)
Describe how the change was tested. Include:
📋 Checklist