GOVFOUN-408 | prevent PoolTimeout cascade caused by holding connection slots during retry sleep by ankitpatnaik-atlan · Pull Request #911 · atlanhq/atlan-python

ankitpatnaik-atlan · 2026-04-24T12:56:56Z

✨ Description

csa-metadata-completeness was hanging indefinitely on production tenants. Root cause: the SDK's retry logic called retry.sleep(response) while the HTTP response stream was still open, keeping the httpcore connection slot occupied for the full sleep duration. Under concurrent load with 429 rate-limiting, all pool slots filled with sleeping threads — no slot was available for retries, causing a PoolTimeout cascade that stalled workflows permanently.

Complete details provided here - https://atlanhq.atlassian.net/wiki/spaces/dg/pages/1909096582/pyatlan+SDK+httpcore+Pool+Timeout+Fix

Changes

Core fix — release connection before sleeping (`pyatlan/client/transport.py`)

Added response.close() / await response.aclose() before retry.sleep() / await retry.asleep() in PyatlanSyncTransport._retry_operation andPyatlanAsyncTransport._retry_operation_async. The response headers are already buffered in memory at this point, so Retry-After parsing is unaffected. An isinstance(response, httpx.Response) guard protects the exception-retry path where the loop variable holds an httpx.HTTPError (no.close()).

Connection pool configuration (`pyatlan/client/atlan.py`)

httpx.Limits(max_connections=50, max_keepalive_connections=10, keepalive_expiry=30.0) — limits pool size and retires idle connections before nginx's
75s keepalive FIN, preventing CLOSE_WAIT socket accumulation.
httpx.Timeout(pool=30.0) — threads now raise PoolTimeout within 30s instead of blocking on threading.Event.wait(timeout=None) forever.
Same limits applied consistently across session init, max_retries context manager, and the new reset_http_session().

`reset_http_session()` — degraded pool recovery

New public method that closes the current httpx.Client, rebuilds it with a fresh connection pool (same limits/auth/proxy config), and resets
_401_has_retried. Useful when callers detect a degraded pool and want to recover without reinitializing the full client.

How has this been tested?

Confirmed root cause from Argo workflow logs: 429 → retry.sleep(response) holding connection slot → PoolTimeout cascade across all worker threads.
Validated fix on live CSA workflows: 15-asset run (manual score verification) and 10K-asset run both completed without errors.
Unit tests added:
- test_response_closed_before_retry_sleep (sync + async) — asserts close() is called before sleep() using call-order tracking
- test_no_close_on_exception_retry — verifies isinstance guard for the exception-retry path
- TestResetHttpSession — covers session replacement, correct limits, old session teardown, _401_has_retried reset, and proxy/verify forwarding
- Pool config and max_retries limits regression tests
Integration tests added: pool limits on live client, concurrent request deadlock check, PoolTimeout propagation, concurrent 429 retry without
PoolTimeout
15 Asset run - https://gov-studio.atlan.com/workflows/profile/csa-metadata-completeness-1776843614/runs?name=csa-metadata-completeness-1776843614-pkd9h
10K asset run - https://gov-studio.atlan.com/workflows/profile/csa-metadata-completeness-1776843614/runs?name=csa-metadata-completeness-1776843614-px4zq

Jira link: [https://linear.app/atlan-epd/issue/GOVFOUN-408/mastercardorreplace-custom-metadata-hangs-indefinitely-in-csametadata]

🧩 Type of change

Select all that apply:

🚀 New feature (non-breaking change that adds functionality)
🐛 Bug fix (non-breaking change that fixes an issue) — please include tests! Refer testing-toolkit 🧪
🔄 Refactor (code change that neither fixes a bug nor adds a feature)
🧹 Maintenance (chores, cleanup, minor improvements)
💥 Breaking change (fix or feature that may break existing functionality)
📦 Dependency upgrade/downgrade
📚 Documentation updates

✅ How has this been tested? (e.g. screenshots, logs, workflow links)

Describe how the change was tested. Include:

Steps to reproduce
Any relevant screenshots, logs, or links to successful workflow runs
Details on environment/setup if applicable

📋 Checklist

My code follows the project’s style guidelines
I’ve performed a self-review of my code
I’ve added comments in tricky or complex areas
I’ve updated the documentation as needed
There are no new warnings from my changes
I’ve added tests to cover my changes
All new and existing tests pass locally

…tion pool When all worker threads hit PoolTimeout simultaneously, the httpcore connection pool enters a degraded state. Subsequent main-thread searches retry into the same broken pool, causing a ~15-20 minute retry loop. reset_http_session() closes and rebuilds the httpx.Client with a fresh pool so the next retry starts clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ool slot When a 429 (or other retryable status) triggers a retry sleep, the response stream was left open, holding the httpcore connection slot for the full sleep duration. With 4 threads all receiving 429 simultaneously, all 4 slots were held for 30s, causing PoolTimeout for any queued requests. Fix: call response.close() / response.aclose() before retry.sleep() so the connection returns to the pool immediately. Headers are already buffered in memory, so Retry-After parsing in retry.sleep() is unaffected. Root cause confirmed via httpcore DEBUG logging: 429 at 11:20:01 → 30s sleep → PoolTimeout at 11:20:31 (exactly pool=30.0s). No stale-socket events. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude · 2026-04-24T13:00:18Z

+        """Close and rebuild the HTTP session to recover from a degraded connection pool."""
+        try:
+            self._session.close()
+        except Exception:


Consider: The bare except Exception: pass silently swallows all errors during session close. While this is intentional (to ensure we always proceed to create a new session), logging the exception at DEBUG level would help with troubleshooting:

Suggested change

except Exception:

except Exception as e:

LOGGER.debug("Failed to close old HTTP session: %s", e)

This preserves the recovery-at-all-costs behavior while leaving a trace for debugging.

claude · 2026-04-24T13:00:26Z

+            event_hooks={"response": [log_response]},
+        )
+        self._401_has_retried.set(False)
+        LOGGER.warning("HTTP session reset: new connection pool created")


Nit: LOGGER.warning() for a normal recovery operation may be too noisy in production. Consider using LOGGER.info() since this is expected recovery behavior, not an unexpected condition.

claude · 2026-04-24T13:00:36Z

+    threading.Event.wait(timeout=None). With pool=30.0 the SDK raises an
+    exception quickly instead.
+    """
+    original_transport = client._session._transport


Fragile test: Accessing client._session._transport._transport._pool relies on internal httpx/httpcore implementation details. If httpx changes its internal structure, this test will break.

Consider adding a comment documenting this coupling, or wrapping it in a try/except with pytest.skip() if the internal structure changes:

def _get_httpcore_pool(client: AtlanClient): """Access the underlying httpcore pool. Relies on httpx internals; may need updating if httpx changes.""" try: return client._session._transport._transport._pool except AttributeError: pytest.skip("httpx internal structure changed; update test helper")

Aryamanz29

Review: GOVFOUN-408 — httpcore Pool Timeout Fix

Excellent PR. The RCA is one of the best I've seen — live surgery on a stuck pod with gdb/py-spy to prove the mechanism before shipping the fix. The changes are minimal, targeted, and well-tested.

Verdict: APPROVE

Findings

[QUAL-F1] All 5 changes are correct and necessary ✅

Change	Assessment
`pool=30.0` timeout	Correct. `None` caused infinite blocking. 30s is generous enough for normal operation, fast enough to fail-and-retry.
`httpx.Limits(max_connections=50, keepalive_expiry=30.0)`	Correct. `keepalive_expiry=30.0` < nginx `keepalive_timeout=75s` prevents CLOSE_WAIT accumulation. Reducing from 100→50 is sensible for SDK clients.
`response.close()` before `retry.sleep()`	Critical fix. Holding the connection slot during Retry-After sleep is the secondary deadlock vector. The `isinstance` guard for `httpx.HTTPError` is correct.
`max_retries` context manager gets same Limits	Correct — without this, `with client.max_retries():` would silently revert to unsafe defaults.
`reset_http_session()`	Good escape hatch for long-running workflows. Correctly preserves proxy, verify, headers, and resets `_401_has_retried`.

[QUAL-F2] DRY concern — Limits repeated 4 times

Severity: Minor

httpx.Limits(max_connections=50, max_keepalive_connections=10, keepalive_expiry=30.0) appears in 4 places:

__init__ (line 220)
reset_http_session (line 304)
max_retries (line 2068)
Comments reference it too

Consider extracting to a class-level constant:

_DEFAULT_POOL_LIMITS = httpx.Limits(
    max_connections=50,
    max_keepalive_connections=10,
    keepalive_expiry=30.0,
)

Not a blocker — just prevents future drift if values need tuning.

[TEST-F1] Comprehensive test coverage ✅

34 tests covering:

Pool timeout value and propagation
Transport limits (max_connections, keepalive_expiry, max_keepalive_connections)
max_retries transport replacement and restoration
reset_http_session (new session, limits, close, 401 flag, proxy)
response.close() call ordering before retry.sleep()
isinstance guard for exception retries
Integration: concurrent requests, 429 retry without PoolTimeout

[SEC-F1] No security concerns ✅

No credentials in code/logs. reset_http_session correctly rebuilds auth headers from existing client state, not from new inputs.

Strengths

Root cause analysis with live process inspection is gold-standard debugging
Fix addresses both the symptom (no pool timeout) and the structural cause (CLOSE_WAIT accumulation)
response.close() before retry.sleep() is a subtle but critical fix — most people would miss this
Tests verify call ordering, not just outcomes

ankitpatnaik-atlan and others added 4 commits April 23, 2026 20:31

GOVFOUN-408: Fix connection pool issue

cca466b

GOVFOUN-408: Added tests

e4e9dfe

claude Bot reviewed Apr 24, 2026

View reviewed changes

ankitpatnaik-atlan changed the title ~~Govfoun 408~~ GOVFOUN-408 | prevent PoolTimeout cascade caused by holding connection slots during retry sleep Apr 24, 2026

Aryamanz29 approved these changes Apr 27, 2026

View reviewed changes

Aryamanz29 self-assigned this Apr 27, 2026

Aryamanz29 added the bugfix Bug fix pull request label Apr 27, 2026

ankitpatnaik-atlan merged commit 9334dfd into main Apr 27, 2026
23 of 33 checks passed

This was referenced Apr 27, 2026

Revert "GOVFOUN-408 | prevent PoolTimeout cascade caused by holding connection slots during retry sleep" #912

Closed

refactor: address review comments for GOVFOUN-408 pool timeout fix #913

Merged

[release] Bumped to release 9.6.0 #914

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GOVFOUN-408 | prevent PoolTimeout cascade caused by holding connection slots during retry sleep#911

GOVFOUN-408 | prevent PoolTimeout cascade caused by holding connection slots during retry sleep#911
ankitpatnaik-atlan merged 4 commits into
mainfrom
GOVFOUN-408

ankitpatnaik-atlan commented Apr 24, 2026 •

edited

Loading

Uh oh!

claude Bot Apr 24, 2026

Uh oh!

claude Bot Apr 24, 2026

Uh oh!

claude Bot Apr 24, 2026

Uh oh!

Aryamanz29 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	except Exception:
	except Exception as e:
	LOGGER.debug("Failed to close old HTTP session: %s", e)

Conversation

ankitpatnaik-atlan commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

Changes

Core fix — release connection before sleeping (pyatlan/client/transport.py)

Connection pool configuration (pyatlan/client/atlan.py)

reset_http_session() — degraded pool recovery

How has this been tested?

🧩 Type of change

✅ How has this been tested? (e.g. screenshots, logs, workflow links)

📋 Checklist

Uh oh!

claude Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Aryamanz29 left a comment

Choose a reason for hiding this comment

Review: GOVFOUN-408 — httpcore Pool Timeout Fix

Verdict: APPROVE

Findings

[QUAL-F1] All 5 changes are correct and necessary ✅

[QUAL-F2] DRY concern — Limits repeated 4 times

[TEST-F1] Comprehensive test coverage ✅

[SEC-F1] No security concerns ✅

Strengths

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ankitpatnaik-atlan commented Apr 24, 2026 •

edited

Loading

Core fix — release connection before sleeping (`pyatlan/client/transport.py`)

Connection pool configuration (`pyatlan/client/atlan.py`)

`reset_http_session()` — degraded pool recovery