Fix telemetry HTTP client socket leak preventing CRaC checkpoint#1333
Fix telemetry HTTP client socket leak preventing CRaC checkpoint#1333gopalldb wants to merge 1 commit into
Conversation
…abricks#1325) Root cause: After Connection.close(), delayed telemetry flush tasks could re-create TELEMETRY HTTP clients and TelemetryClients that were never closed, leaking TCP sockets and preventing CRaC checkpoint. Two race conditions fixed: 1. TelemetryClientFactory: getTelemetryClient() after closeTelemetryClient() re-created an orphaned TelemetryClient. Fixed by tracking closed connection UUIDs and returning NoopTelemetryClient, and by reordering closeTelemetryClient() to export pending collector events before closing the client. 2. DatabricksHttpClientFactory: getClient(ctx, TELEMETRY) after removeClient(ctx) re-created a leaked DatabricksHttpClient via computeIfAbsent. Fixed by adding closeConnection() which permanently marks a connection as closed, causing getClient() to return null. Co-authored-by: Isaac Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
| LOGGER.debug( | ||
| "Rejecting getClient() for closed connection {} with type {}", | ||
| context.getConnectionUuid(), | ||
| type); |
There was a problem hiding this comment.
[F1] TOCTOU race — the PR does not actually close the race it claims to fix (High)
The guard reads closedConnections.contains(uuid) then falls through to instances.computeIfAbsent(...). These are two separate operations on two separate maps with no mutual exclusion. Interleaving:
- Thread A (delayed
TelemetryPushTaskon the 10-thread pool): evaluatesclosedConnections.contains(uuid)→ false. - Thread B (
DatabricksConnection.close()): runscloseConnection()— adds UUID, wipesinstances. - Thread A resumes
computeIfAbsent→ creates a newDatabricksHttpClient(TELEMETRY)→ leaks a TCP socket.
This is literally the race the RCA describes on the old code. The new code narrows but does not close it.
Fix — fuse guard and map op atomically:
String uuid = context.getConnectionUuid();
return instances.compute(
getClientKey(uuid, type),
(k, existing) -> {
if (existing != null) return existing;
if (uuid != null && closedConnections.contains(uuid)) return null;
return new DatabricksHttpClient(context, type);
});Apply the same pattern in TelemetryClientFactory.getTelemetryClient().
Flagged independently by 4 reviewers (language, security, performance, devils-advocate).
| * Tracks connection UUIDs for which removeClient() has been called. Prevents getClient() from | ||
| * re-creating HTTP clients for closed connections via computeIfAbsent. Without this guard, | ||
| * delayed TelemetryPushTask executions can create orphaned HTTP clients that leak TCP sockets. | ||
| * See GitHub issue #1325. |
There was a problem hiding this comment.
[F2] Unbounded heap growth — PR trades a bounded socket leak for an unbounded heap leak (High)
closedConnections is a process-wide ConcurrentHashMap.newKeySet() that is append-only for the JVM lifetime — no TTL, no cap, no eviction, no removal path. TelemetryClientFactory.closedConnectionUuids has the same shape and clears only in reset() (test-only).
Per-entry cost ≈ 120 B × 2 sets ≈ ~240 B per closed connection. 100 closes/sec (realistic for HikariCP under load) → ~2 GB/month. This hits exactly the long-lived CRaC workloads this PR targets.
Fix options (in preference order):
- Bound via Caffeine
CachewithexpireAfterWrite(5-10 minutes)— the race window is seconds, not JVM lifetime. - Flip to an allowlist: insert UUID in
getClienton first use, remove incloseConnection. Bounded to live connections and solves F1's TOCTOU for free. - Encode the tombstone in the existing
instancesmap.
Flagged by 6 reviewers (performance, maintainability, security, ops, devils-advocate, architecture).
| // getClient(ctx, TELEMETRY) after removeClient(ctx) has already run. | ||
| String connectionUuid = context.getConnectionUuid(); | ||
| if (connectionUuid != null && closedConnections.contains(connectionUuid)) { | ||
| LOGGER.debug( |
There was a problem hiding this comment.
[F3] getClient() silently widened to nullable return — 7 unverified callers (High)
Pre-PR getClient() was non-null. Post-PR it can return null. Only TelemetryPushClient.pushEvent() was updated. Verified call sites that store or dereference without a null check:
| File:line | Usage | Null-safe? |
|---|---|---|
DatabricksThriftAccessor.java:662 |
passed to DatabricksHttpTTransport ctor |
No |
DatabricksTokenFederationProvider.java:74 |
stored in this.hc |
No |
AzureMSICredentialProvider.java:43 |
stored in this.httpClient |
No |
PrivateKeyClientCredentialProvider.java:27 |
stored | No |
DatabricksDriverFeatureFlagsContext.java:92 |
runs on a background scheduler; execute() caught by catch(Exception) at TRACE → silent failure of FF refresh |
No |
ArrowStreamResult.java:53, 153 |
ctor param → NPE later in chunk download | No |
No @Nullable annotation despite the codebase using javax.annotation.Nullable elsewhere (SqlParameter.java:10, DatabricksSession.java:102). Contract change is invisible to callers.
Fix (preferred): return a rejecting sentinel IDatabricksHttpClient whose execute() throws ConnectionClosedException (symmetric to NoopTelemetryClient). All callers stay safe by default.
Alternative: split into getClient() (throws) + getClientOrNull() (used only by TelemetryPushClient), or add @Nullable + Javadoc and null-check every caller.
Flagged by 6 reviewers.
|
|
||
| /** | ||
| * Permanently closes all HTTP clients for the given connection and prevents new ones from being | ||
| * created. This should be called from DatabricksConnection.close() to prevent delayed |
There was a problem hiding this comment.
[F5] closeConnection() NPEs on null UUID — inconsistent with getClient()'s null guard (High)
ConcurrentHashMap.newKeySet() rejects null elements — closedConnections.add(null) throws NPE. Yet getClient() in the same file at the new lines if (connectionUuid != null && closedConnections.contains(connectionUuid)) explicitly null-guards, implying nulls are considered possible. The two paths disagree on the invariant.
Same issue in TelemetryClientFactory.closeTelemetryClient at the closedConnectionUuids.add(connectionUuid) calls inside the computeIfPresent lambdas — no null-guard.
Fix: Pick one invariant. Either (a) assert non-null in closeConnection/closeTelemetryClient and document the constraint on IDatabricksConnectionContext.getConnectionUuid(), or (b) mirror the getClient guard:
String uuid = context.getConnectionUuid();
if (uuid != null) closedConnections.add(uuid);Flagged by 2 reviewers (architecture, language).
| * flushed through the existing client. See GitHub issue #1325. | ||
| */ | ||
| public void closeTelemetryClient(IDatabricksConnectionContext connectionContext) { | ||
| String key = TelemetryHelper.keyOf(connectionContext); |
There was a problem hiding this comment.
[F6] Reordered close is not fail-safe — one exception skips ALL cleanup and re-opens the entire leak (High)
The new ordering runs exportAllPendingTelemetryDetails() BEFORE the two computeIfPresent blocks that (a) mark UUID closed, (b) close the client, (c) remove the holder.
If exportAllPendingTelemetryDetails() or the exportTelemetryLog it calls throws any unchecked exception, execution exits closeTelemetryClient and propagates up to DatabricksConnection.close(). Consequences:
TelemetryClientstays in the holder map with its periodic flush scheduler aliveclosedConnectionUuidsnever updated → futuregetTelemetryClient()won't return NoopDatabricksHttpClientFactory.closeConnection()on the next line ofDatabricksConnection.close()never runs
One export-time exception silently re-opens the entire leak the PR set out to fix. The old ordering was actually more robust (cleanup ran first, export was a trailer).
Fix:
- Mark UUIDs closed unconditionally up-front (before export).
- Run holder cleanup.
- Run
exportAllPendingTelemetryDetailslast, wrapped intry/catchthat logs-and-swallows. - In
DatabricksConnection.close()wrap the cleanup chain in try/finally socloseConnection(ctx)always runs even if earlier steps throw.
Flagged by ops reviewer.
| "LEAK BUG (issue #1325): getTelemetryClient() after closeTelemetryClient() " | ||
| + "created a new TelemetryClient instead of returning NoopTelemetryClient. " | ||
| + "This orphaned client will never be closed, and its periodic flush creates " | ||
| + "TELEMETRY HTTP clients that leak TCP sockets."); |
There was a problem hiding this comment.
[F4] Tests are single-threaded; do not reproduce the cross-thread race (High)
The RCA describes a race between main thread (DatabricksConnection.close()) and pool thread (delayed TelemetryPushTask.run() on the 10-thread telemetryExecutorService). All three tests run sequentially on the JUnit main thread — no CountDownLatch/CyclicBarrier/real executor/parallel stress.
- Test #1 (
testGetTelemetryClientAfterCloseReCreatesClient) is tautological:closedConnectionUuids.add(uuid)runs insidecloseTelemetryClientfirst, so a sequentialgetTelemetryClientcall trivially returnsNoopTelemetryClient. Cannot fail against a partial revert of the fix. - Test Refactoring packages and adding some skeletons for ResultSet and Statement #3 mocks
TelemetryHelper.exportTelemetryLogwith a lambda that callsgetTelemetryClient(ctx). But the realexportTelemetryLogreads fromDatabricksThreadContextHolder.getConnectionContext()— a thread-local thatConnection.close()clears. The real re-creation vector isTelemetryPushClient.pushEvent()capturingctxat ctor time (verified in source atTelemetryPushClient.java:47-49) — which no test covers.
Fix — add a real cross-thread test:
@Test
void testRaceBetweenCloseAndDelayedPushTask() throws Exception {
CountDownLatch blockInPush = new CountDownLatch(1);
CountDownLatch closeDone = new CountDownLatch(1);
// Spy push client blocks inside pushEvent until main thread calls closeConnection
// Submit flush task on real pool
// Main thread: await blockInPush entered, then closeConnection(ctx)
// Release latch; assert instances map has no TELEMETRY entry for uuid after join
}Delete or rewrite Tests #1 and #3 so they would fail against a partial revert of the fix.
Flagged by 2 reviewers (test, devils-advocate).
| } | ||
| } | ||
|
|
||
| /** |
There was a problem hiding this comment.
[F7] No @VisibleForTesting reset() on DatabricksHttpClientFactory — test pollution (Medium)
TelemetryClientFactory.reset() (in this PR at the hunk around line 243) clears closedConnectionUuids. DatabricksHttpClientFactory has no equivalent. It's a JVM singleton and closedConnections is an instance field.
Consequence: any test that calls closeConnection(ctx) permanently adds UUID to closedConnections for the rest of the test JVM. A later test reusing the same UUID gets null from getClient() with no symptom except "my test fails mysteriously when run after another test."
Fix: Add @VisibleForTesting void reset() on DatabricksHttpClientFactory that clears both instances and closedConnections. Call in @BeforeEach/@AfterEach of TelemetryHttpClientLeakTest.
| k -> new DatabricksHttpClient(context, type)); | ||
| } | ||
|
|
||
| /** |
There was a problem hiding this comment.
[F8] removeClient(ctx) is a public footgun with zero production callers (Medium)
After this PR, the no-arg removeClient(IDatabricksConnectionContext) overload has exactly one caller in the repo: DatabricksHttpClientTest.java:240 (a test). All production call sites were migrated to closeConnection() by this PR.
The RCA explicitly identifies removeClient's "allows re-creation via computeIfAbsent" behavior as the root cause of #1325. Shipping two nearly-identically-named public methods where one silently re-enables the bug this PR fixes is a future-maintainer trap — the name "removeClient" reads as the obvious choice on IDE autocomplete.
Fix: Delete the no-arg overload and migrate the test; or mark @VisibleForTesting / @Deprecated with a pointer to closeConnection. At minimum, rename to signal intent (e.g., evictClientsForReconnect).
Flagged by 4 reviewers (maintainability, architecture, devils-advocate, agent-compat).
| * | ||
| * <p>The connection UUID is added to closedConnectionUuids FIRST to prevent getTelemetryClient() | ||
| * from re-creating a TelemetryClient during or after the close sequence. Pending | ||
| * TelemetryCollector events are exported BEFORE the TelemetryClient is closed, so they are |
There was a problem hiding this comment.
[F9] Reordered close has a concurrency regression under concurrent close of same UUID (Medium)
New ordering: (1) removeCollector, (2) exportAllPendingTelemetryDetails() which calls getTelemetryClient(ctx) through the factory, (3) computeIfPresent removes holder and marks UUID closed.
Single-threaded: fine. Under concurrent close of the same UUID (legal via the public API):
- Thread A starts step 2's export loop.
- Thread B completes step 3 first (removes holder, marks UUID closed).
- Thread A's next
getTelemetryClient(ctx)now returnsNoopTelemetryClient→ events silently dropped. - Or a concurrent re-open races in between, and step 2's
getTelemetryClientcreates a newTelemetryClient(the exact original bug).
Old ordering was order-insensitive because holder-remove+close was atomic under computeIfPresent.
Fix: Mark UUID closed BEFORE export, and export via a directly-held TelemetryClient reference (either look it up from the holder map first, or have the removal return the client). Don't route the export back through the factory.
Flagged by architecture reviewer.
| DatabricksClientConfiguratorManager.getInstance().removeInstance(connectionContext); | ||
| DatabricksDriverFeatureFlagsContextFactory.removeInstance(connectionContext); | ||
| DatabricksHttpClientFactory.getInstance().removeClient(connectionContext); | ||
| DatabricksHttpClientFactory.getInstance().closeConnection(connectionContext); |
There was a problem hiding this comment.
[F10] No regression test for the Connection.close() post-condition (Medium)
This one-line change alters observable behavior: after connection.close(), DatabricksHttpClientFactory.getInstance().getClient(ctx, TELEMETRY) now returns null (previously always returned a new client). Telemetry pushes submitted during close are now silently dropped.
No existing test was modified; grep shows no test asserts the new post-close invariant. The "3085 tests pass" claim only proves nothing else broke, not that the new behavior is covered.
Fix: Add one focused assertion in the existing DatabricksConnection test class:
@Test
void getClientReturnsNullAfterConnectionClose() {
DatabricksConnection conn = ...;
conn.close();
assertNull(DatabricksHttpClientFactory.getInstance()
.getClient(conn.getConnectionContext(), HttpClientType.TELEMETRY));
}Flagged by test reviewer.
| .getClient(connectionContext, HttpClientType.TELEMETRY); | ||
| if (httpClient == null) { | ||
| // Connection was closed — HTTP client factory rejected the request to prevent socket leaks. | ||
| LOGGER.debug("Skipping telemetry push: connection has been closed"); |
There was a problem hiding this comment.
[F12] Rejection log at DEBUG; no metric for dropped telemetry (Medium)
Two invisibility paths added by this PR:
DatabricksHttpClientFactory.getClient— rejection logged atLOGGER.debug("Rejecting getClient() for closed connection {} with type {}"). Below default production log level. If a non-TELEMETRY caller hits this path due to a future bug or code change, operators never see it.TelemetryPushClient.pushEventon null —LOGGER.debug("Skipping telemetry push: connection has been closed"). In aggressive close scenarios (shutdown, failover, pool eviction), every dropped push is invisible. No counter. Operators cannot answer "is telemetry working?"
Fix:
- Raise the
getClientrejection log to WARN for non-TELEMETRY types (preserves DEBUG for the expected path). - Add an
AtomicLong droppedOnClosedCounton the factory; emit a periodic aggregated WARN when > 0, or expose via JMX/existing metrics. - Include
Thread.currentThread().getName()in the rejection log for diagnosis.
Flagged by 3 reviewers (ops, agent-compat, maintainability).
| * here so that subsequent getTelemetryClient() calls (e.g., from delayed flush tasks or collector | ||
| * exports) return NoopTelemetryClient instead of re-creating an orphaned TelemetryClient. This | ||
| * prevents the socket leak described in GitHub issue #1325. | ||
| */ |
There was a problem hiding this comment.
[F13] Duplicated registry pattern across two factories (Medium)
closedConnectionUuids here and closedConnections in DatabricksHttpClientFactory are the same concept implemented twice with different names. Both are ConcurrentHashMap.newKeySet() of connection UUIDs, both checked before compute* in the same way, both added on close. Neither is bounded.
Fix: Extract a ClosedConnectionRegistry helper in com.databricks.jdbc.common with markClosed(uuid) / isClosed(uuid) / clear(). Back it with a bounded cache (Caffeine, e.g. expireAfterWrite(5 min) or max ~10K entries LRU). Both factories hold a reference (or share a singleton). Solves the duplication AND F2's memory growth in one change.
Flagged by 3 reviewers (maintainability, agent-compat, ops).
| @@ -31,17 +40,42 @@ public IDatabricksHttpClient getClient(IDatabricksConnectionContext context) { | |||
|
|
|||
| public IDatabricksHttpClient getClient( | |||
There was a problem hiding this comment.
[F16] @Nullable annotation + irreversibility are undocumented (Low/Medium)
The codebase already uses javax.annotation.Nullable (e.g., SqlParameter.java:10, DatabricksSession.java:102,110, IDatabricksSession.java:21,24). This PR silently widens getClient to a nullable return without annotation or Javadoc update. Agents and humans adding future callers will omit the null check — see F3 for the consequence.
Also document that closeConnection is irreversible — the UUID is blacklisted for the lifetime of the factory. The field comment says "permanently closed" but the public method Javadoc only says "permanently closes all HTTP clients" — reads as "closes them once," not "blacklists UUID forever."
Fix:
/**
* @return the HTTP client, or {@code null} if {@link #closeConnection} has been
* called for this context. Callers MUST null-check when the call can
* race with connection close (e.g., async tasks).
*/
@Nullable
public IDatabricksHttpClient getClient(
IDatabricksConnectionContext context, HttpClientType type) { ... }Add to closeConnection Javadoc: "This is irreversible — the UUID is blacklisted for the lifetime of this factory. Do not call until you're certain the context will never be used again." Same on TelemetryClientFactory.closeTelemetryClient.
| @@ -0,0 +1,208 @@ | |||
| # RCA: Leaked Socket Prevents CRaC Checkpointing (Issue #1325) | |||
There was a problem hiding this comment.
[F17] RCA doc in docs/ is unprecedented and point-in-time (Low)
docs/ currently contains only enduring reference material (LOGGING.md, TESTING.md, JDBC_METHOD_INVENTORY.md, JDBC_SPEC_COVERAGE_ANALYSIS.md, features/). This adds a 208-line root-cause analysis for a single bug.
Issues:
- Sets precedent without an established convention (no
docs/rca/README.mdpolicy, no numbering scheme). - Hardcodes line numbers ("line 421", "line 172") that will rot on the next edit to
DatabricksConnection.close(). - Content is point-in-time and will never be updated.
- Duplicates what should live in the PR description / JIRA / issue [BUG] Leaked Socket prevents CRaC checkpointing #1325.
Fix: Delete this file and fold the content into the PR description + issue #1325 comments. If in-tree RCAs are desired, establish a convention first (e.g., docs/rca/NNN-title.md + a README.md policy).
Flagged by 3 reviewers (maintainability, agent-compat, language).
| */ | ||
| @ExtendWith(MockitoExtension.class) | ||
| @MockitoSettings(strictness = Strictness.LENIENT) | ||
| public class TelemetryHttpClientLeakTest { |
There was a problem hiding this comment.
[F20] Test hygiene: Strictness.LENIENT, unused stub, empty catch (Low)
- Line with
@MockitoSettings(strictness = Strictness.LENIENT)silently masks unused-stub bugs — combined with broadany()matchers, tests can pass while stubbing the wrong method. RemoveLENIENTand fix any resulting unused-stub errors. when(ctx.getHost()).thenReturn(host)is unused; the productionkeyOfpath usesgetHostForOAuth. Would fail under strict Mockito. Remove.- The empty catch around
when(ctx.getHostUrl()).thenReturn(...)silently swallows checked exceptions during setup. UsedoReturn("https://" + host).when(ctx).getHostUrl()to bypass the checked exception cleanly.
Code Review Squad ReportMerge Safety Score: 10/100 — CRITICAL RISK Executive SummaryTargets a real cross-thread race from issue #1325 and the defense-in-depth intent is sound. But the implementation has fundamental issues that defeat the fix:
Recommend blocking merge until F1, F2, F3, F4, F5, F6 are addressed. Findings by SeverityHigh (blocker) — posted as inline comments
Medium — posted as inline comments
Low — summarized here (not posted inline to reduce noise)
Verified but No Action
Suggested Minimum Path to Merge
Review generated by Code Review Squad (9 parallel specialized reviewers). Feedback welcome in #code-review-squad-feedback. |
Summary
Fixes #1325 (follow-up to #1233). After
Connection.close(), delayed telemetry flush tasks could re-create TELEMETRY HTTP clients that were never closed, leaking TCP sockets and preventing CRaC checkpoint.Two cross-thread race conditions fixed:
TelemetryClient re-creation:
getTelemetryClient()aftercloseTelemetryClient()created an orphanedTelemetryClientwith a periodic flush scheduler that nobody closes. Fixed by tracking closed connection UUIDs and returningNoopTelemetryClient, and by reorderingcloseTelemetryClient()to export pending collector events before closing the client.HTTP client re-creation:
getClient(ctx, TELEMETRY)afterremoveClient(ctx)re-created aDatabricksHttpClientviacomputeIfAbsentthat nobody closes. Fixed by addingcloseConnection()which permanently marks a connection as closed, causinggetClient()to returnnull.Files changed
TelemetryClientFactory.javaclosedConnectionUuidsguard, reordered close sequenceDatabricksHttpClientFactory.javaclosedConnectionsguard, newcloseConnection()methodDatabricksConnection.javacloseConnection()instead ofremoveClient()TelemetryPushClient.javagetClient()return valueTelemetryHttpClientLeakTest.javaRCA_SOCKET_LEAK_TELEMETRY_HTTP_CLIENT.mdTest results
Test plan
TelemetryHttpClientLeakTest#testGetTelemetryClientAfterCloseReCreatesClient— verifiesNoopTelemetryClientreturned after closeTelemetryHttpClientLeakTest#testGetClientReturnsNullAfterCloseConnection— verifiesnullreturned from HTTP client factory after closeTelemetryHttpClientLeakTest#testCloseTelemetryClientWithPendingCollectorEventsReCreatesClient— verifies pending collector events don't cause re-creationConnection.close())This pull request was AI-assisted by Isaac.