Skip to content

GOVFOUN-408 | prevent PoolTimeout cascade caused by holding connection slots during retry sleep#911

Merged
ankitpatnaik-atlan merged 4 commits into
mainfrom
GOVFOUN-408
Apr 27, 2026
Merged

GOVFOUN-408 | prevent PoolTimeout cascade caused by holding connection slots during retry sleep#911
ankitpatnaik-atlan merged 4 commits into
mainfrom
GOVFOUN-408

Conversation

@ankitpatnaik-atlan
Copy link
Copy Markdown
Contributor

@ankitpatnaik-atlan ankitpatnaik-atlan commented Apr 24, 2026

✨ Description

csa-metadata-completeness was hanging indefinitely on production tenants. Root cause: the SDK's retry logic called retry.sleep(response) while the HTTP response stream was still open, keeping the httpcore connection slot occupied for the full sleep duration. Under concurrent load with 429 rate-limiting, all pool slots filled with sleeping threads — no slot was available for retries, causing a PoolTimeout cascade that stalled workflows permanently.

Complete details provided here - https://atlanhq.atlassian.net/wiki/spaces/dg/pages/1909096582/pyatlan+SDK+httpcore+Pool+Timeout+Fix

Changes

Core fix — release connection before sleeping (pyatlan/client/transport.py)

Added response.close() / await response.aclose() before retry.sleep() / await retry.asleep() in PyatlanSyncTransport._retry_operation andPyatlanAsyncTransport._retry_operation_async. The response headers are already buffered in memory at this point, so Retry-After parsing is unaffected. An isinstance(response, httpx.Response) guard protects the exception-retry path where the loop variable holds an httpx.HTTPError (no.close()).

Connection pool configuration (pyatlan/client/atlan.py)

  • httpx.Limits(max_connections=50, max_keepalive_connections=10, keepalive_expiry=30.0) — limits pool size and retires idle connections before nginx's
    75s keepalive FIN, preventing CLOSE_WAIT socket accumulation.
  • httpx.Timeout(pool=30.0) — threads now raise PoolTimeout within 30s instead of blocking on threading.Event.wait(timeout=None) forever.
  • Same limits applied consistently across session init, max_retries context manager, and the new reset_http_session().

reset_http_session() — degraded pool recovery

New public method that closes the current httpx.Client, rebuilds it with a fresh connection pool (same limits/auth/proxy config), and resets
_401_has_retried. Useful when callers detect a degraded pool and want to recover without reinitializing the full client.

How has this been tested?

Jira link: [https://linear.app/atlan-epd/issue/GOVFOUN-408/mastercardorreplace-custom-metadata-hangs-indefinitely-in-csametadata]


🧩 Type of change

Select all that apply:

  • 🚀 New feature (non-breaking change that adds functionality)
  • 🐛 Bug fix (non-breaking change that fixes an issue) — please include tests! Refer testing-toolkit 🧪
  • 🔄 Refactor (code change that neither fixes a bug nor adds a feature)
  • 🧹 Maintenance (chores, cleanup, minor improvements)
  • 💥 Breaking change (fix or feature that may break existing functionality)
  • 📦 Dependency upgrade/downgrade
  • 📚 Documentation updates

✅ How has this been tested? (e.g. screenshots, logs, workflow links)

Describe how the change was tested. Include:

  • Steps to reproduce
  • Any relevant screenshots, logs, or links to successful workflow runs
  • Details on environment/setup if applicable

📋 Checklist

  • My code follows the project’s style guidelines
  • I’ve performed a self-review of my code
  • I’ve added comments in tricky or complex areas
  • I’ve updated the documentation as needed
  • There are no new warnings from my changes
  • I’ve added tests to cover my changes
  • All new and existing tests pass locally

ankitpatnaik-atlan and others added 4 commits April 23, 2026 20:31
…tion pool

When all worker threads hit PoolTimeout simultaneously, the httpcore
connection pool enters a degraded state. Subsequent main-thread searches
retry into the same broken pool, causing a ~15-20 minute retry loop.

reset_http_session() closes and rebuilds the httpx.Client with a fresh
pool so the next retry starts clean.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ool slot

When a 429 (or other retryable status) triggers a retry sleep, the response
stream was left open, holding the httpcore connection slot for the full sleep
duration. With 4 threads all receiving 429 simultaneously, all 4 slots were
held for 30s, causing PoolTimeout for any queued requests.

Fix: call response.close() / response.aclose() before retry.sleep() so the
connection returns to the pool immediately. Headers are already buffered in
memory, so Retry-After parsing in retry.sleep() is unaffected.

Root cause confirmed via httpcore DEBUG logging: 429 at 11:20:01 → 30s sleep
→ PoolTimeout at 11:20:31 (exactly pool=30.0s). No stale-socket events.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread pyatlan/client/atlan.py
"""Close and rebuild the HTTP session to recover from a degraded connection pool."""
try:
self._session.close()
except Exception:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider: The bare except Exception: pass silently swallows all errors during session close. While this is intentional (to ensure we always proceed to create a new session), logging the exception at DEBUG level would help with troubleshooting:

Suggested change
except Exception:
except Exception as e:
LOGGER.debug("Failed to close old HTTP session: %s", e)

This preserves the recovery-at-all-costs behavior while leaving a trace for debugging.

Comment thread pyatlan/client/atlan.py
event_hooks={"response": [log_response]},
)
self._401_has_retried.set(False)
LOGGER.warning("HTTP session reset: new connection pool created")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: LOGGER.warning() for a normal recovery operation may be too noisy in production. Consider using LOGGER.info() since this is expected recovery behavior, not an unexpected condition.

threading.Event.wait(timeout=None). With pool=30.0 the SDK raises an
exception quickly instead.
"""
original_transport = client._session._transport
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fragile test: Accessing client._session._transport._transport._pool relies on internal httpx/httpcore implementation details. If httpx changes its internal structure, this test will break.

Consider adding a comment documenting this coupling, or wrapping it in a try/except with pytest.skip() if the internal structure changes:

def _get_httpcore_pool(client: AtlanClient):
    """Access the underlying httpcore pool. Relies on httpx internals; may need updating if httpx changes."""
    try:
        return client._session._transport._transport._pool
    except AttributeError:
        pytest.skip("httpx internal structure changed; update test helper")

@ankitpatnaik-atlan ankitpatnaik-atlan changed the title Govfoun 408 GOVFOUN-408 | prevent PoolTimeout cascade caused by holding connection slots during retry sleep Apr 24, 2026
Copy link
Copy Markdown
Member

@Aryamanz29 Aryamanz29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: GOVFOUN-408 — httpcore Pool Timeout Fix

Excellent PR. The RCA is one of the best I've seen — live surgery on a stuck pod with gdb/py-spy to prove the mechanism before shipping the fix. The changes are minimal, targeted, and well-tested.

Verdict: APPROVE

Findings

[QUAL-F1] All 5 changes are correct and necessary ✅

Change Assessment
pool=30.0 timeout Correct. None caused infinite blocking. 30s is generous enough for normal operation, fast enough to fail-and-retry.
httpx.Limits(max_connections=50, keepalive_expiry=30.0) Correct. keepalive_expiry=30.0 < nginx keepalive_timeout=75s prevents CLOSE_WAIT accumulation. Reducing from 100→50 is sensible for SDK clients.
response.close() before retry.sleep() Critical fix. Holding the connection slot during Retry-After sleep is the secondary deadlock vector. The isinstance guard for httpx.HTTPError is correct.
max_retries context manager gets same Limits Correct — without this, with client.max_retries(): would silently revert to unsafe defaults.
reset_http_session() Good escape hatch for long-running workflows. Correctly preserves proxy, verify, headers, and resets _401_has_retried.

[QUAL-F2] DRY concern — Limits repeated 4 times

Severity: Minor

httpx.Limits(max_connections=50, max_keepalive_connections=10, keepalive_expiry=30.0) appears in 4 places:

  • __init__ (line 220)
  • reset_http_session (line 304)
  • max_retries (line 2068)
  • Comments reference it too

Consider extracting to a class-level constant:

_DEFAULT_POOL_LIMITS = httpx.Limits(
    max_connections=50,
    max_keepalive_connections=10,
    keepalive_expiry=30.0,
)

Not a blocker — just prevents future drift if values need tuning.

[TEST-F1] Comprehensive test coverage ✅

34 tests covering:

  • Pool timeout value and propagation
  • Transport limits (max_connections, keepalive_expiry, max_keepalive_connections)
  • max_retries transport replacement and restoration
  • reset_http_session (new session, limits, close, 401 flag, proxy)
  • response.close() call ordering before retry.sleep()
  • isinstance guard for exception retries
  • Integration: concurrent requests, 429 retry without PoolTimeout

[SEC-F1] No security concerns ✅

No credentials in code/logs. reset_http_session correctly rebuilds auth headers from existing client state, not from new inputs.

Strengths

  • Root cause analysis with live process inspection is gold-standard debugging
  • Fix addresses both the symptom (no pool timeout) and the structural cause (CLOSE_WAIT accumulation)
  • response.close() before retry.sleep() is a subtle but critical fix — most people would miss this
  • Tests verify call ordering, not just outcomes

@Aryamanz29 Aryamanz29 self-assigned this Apr 27, 2026
@Aryamanz29 Aryamanz29 added the bugfix Bug fix pull request label Apr 27, 2026
@ankitpatnaik-atlan ankitpatnaik-atlan merged commit 9334dfd into main Apr 27, 2026
23 of 33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix Bug fix pull request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants