Fix telemetry HTTP client socket leak preventing CRaC checkpoint by gopalldb · Pull Request #1333 · databricks/databricks-jdbc

gopalldb · 2026-03-27T06:17:54Z

Summary

Fixes #1325 (follow-up to #1233). After Connection.close(), delayed telemetry flush tasks could re-create TELEMETRY HTTP clients that were never closed, leaking TCP sockets and preventing CRaC checkpoint.

Two cross-thread race conditions fixed:

TelemetryClient re-creation: getTelemetryClient() after closeTelemetryClient() created an orphaned TelemetryClient with a periodic flush scheduler that nobody closes. Fixed by tracking closed connection UUIDs and returning NoopTelemetryClient, and by reordering closeTelemetryClient() to export pending collector events before closing the client.
HTTP client re-creation: getClient(ctx, TELEMETRY) after removeClient(ctx) re-created a DatabricksHttpClient via computeIfAbsent that nobody closes. Fixed by adding closeConnection() which permanently marks a connection as closed, causing getClient() to return null.

Files changed

File	Change
`TelemetryClientFactory.java`	Added `closedConnectionUuids` guard, reordered close sequence
`DatabricksHttpClientFactory.java`	Added `closedConnections` guard, new `closeConnection()` method
`DatabricksConnection.java`	Use `closeConnection()` instead of `removeClient()`
`TelemetryPushClient.java`	Null guard for `getClient()` return value
`TelemetryHttpClientLeakTest.java`	3 reproduction/verification tests
`RCA_SOCKET_LEAK_TELEMETRY_HTTP_CLIENT.md`	Full RCA with reproduction and verification plan

Test results

3085 unit tests pass, 0 failures, 0 errors
3 new tests specifically reproduce and verify the fix for both race conditions

Test plan

TelemetryHttpClientLeakTest#testGetTelemetryClientAfterCloseReCreatesClient — verifies NoopTelemetryClient returned after close
TelemetryHttpClientLeakTest#testGetClientReturnsNullAfterCloseConnection — verifies null returned from HTTP client factory after close
TelemetryHttpClientLeakTest#testCloseTelemetryClientWithPendingCollectorEventsReCreatesClient — verifies pending collector events don't cause re-creation
All 152 existing telemetry tests pass
Full unit test suite (3085 tests) passes
Manual verification with CRaC-enabled JDK (0 sockets after Connection.close())

This pull request was AI-assisted by Isaac.

…abricks#1325) Root cause: After Connection.close(), delayed telemetry flush tasks could re-create TELEMETRY HTTP clients and TelemetryClients that were never closed, leaking TCP sockets and preventing CRaC checkpoint. Two race conditions fixed: 1. TelemetryClientFactory: getTelemetryClient() after closeTelemetryClient() re-created an orphaned TelemetryClient. Fixed by tracking closed connection UUIDs and returning NoopTelemetryClient, and by reordering closeTelemetryClient() to export pending collector events before closing the client. 2. DatabricksHttpClientFactory: getClient(ctx, TELEMETRY) after removeClient(ctx) re-created a leaked DatabricksHttpClient via computeIfAbsent. Fixed by adding closeConnection() which permanently marks a connection as closed, causing getClient() to return null. Co-authored-by: Isaac Signed-off-by: Gopal Lal <gopal.lal@databricks.com>

vikrantpuppala · 2026-04-22T08:57:03Z

+      LOGGER.debug(
+          "Rejecting getClient() for closed connection {} with type {}",
+          context.getConnectionUuid(),
+          type);


[F1] TOCTOU race — the PR does not actually close the race it claims to fix (High)

The guard reads closedConnections.contains(uuid) then falls through to instances.computeIfAbsent(...). These are two separate operations on two separate maps with no mutual exclusion. Interleaving:

Thread A (delayed TelemetryPushTask on the 10-thread pool): evaluates closedConnections.contains(uuid) → false.

Thread B (DatabricksConnection.close()): runs closeConnection() — adds UUID, wipes instances.

Thread A resumes computeIfAbsent → creates a new DatabricksHttpClient(TELEMETRY) → leaks a TCP socket.

This is literally the race the RCA describes on the old code. The new code narrows but does not close it.

Fix — fuse guard and map op atomically:

String uuid = context.getConnectionUuid(); return instances.compute( getClientKey(uuid, type), (k, existing) -> { if (existing != null) return existing; if (uuid != null && closedConnections.contains(uuid)) return null; return new DatabricksHttpClient(context, type); });

Apply the same pattern in TelemetryClientFactory.getTelemetryClient().

Flagged independently by 4 reviewers (language, security, performance, devils-advocate).

vikrantpuppala · 2026-04-22T08:57:11Z

+   * Tracks connection UUIDs for which removeClient() has been called. Prevents getClient() from
+   * re-creating HTTP clients for closed connections via computeIfAbsent. Without this guard,
+   * delayed TelemetryPushTask executions can create orphaned HTTP clients that leak TCP sockets.
+   * See GitHub issue #1325.


[F2] Unbounded heap growth — PR trades a bounded socket leak for an unbounded heap leak (High)

closedConnections is a process-wide ConcurrentHashMap.newKeySet() that is append-only for the JVM lifetime — no TTL, no cap, no eviction, no removal path. TelemetryClientFactory.closedConnectionUuids has the same shape and clears only in reset() (test-only).

Per-entry cost ≈ 120 B × 2 sets ≈ ~240 B per closed connection. 100 closes/sec (realistic for HikariCP under load) → ~2 GB/month. This hits exactly the long-lived CRaC workloads this PR targets.

Fix options (in preference order):

Bound via Caffeine Cache with expireAfterWrite(5-10 minutes) — the race window is seconds, not JVM lifetime.

Flip to an allowlist: insert UUID in getClient on first use, remove in closeConnection. Bounded to live connections and solves F1's TOCTOU for free.

Encode the tombstone in the existing instances map.

Flagged by 6 reviewers (performance, maintainability, security, ops, devils-advocate, architecture).

vikrantpuppala · 2026-04-22T08:57:16Z

+    // getClient(ctx, TELEMETRY) after removeClient(ctx) has already run.
+    String connectionUuid = context.getConnectionUuid();
+    if (connectionUuid != null && closedConnections.contains(connectionUuid)) {
+      LOGGER.debug(


[F3] getClient() silently widened to nullable return — 7 unverified callers (High)

Pre-PR getClient() was non-null. Post-PR it can return null. Only TelemetryPushClient.pushEvent() was updated. Verified call sites that store or dereference without a null check:

File:line Usage Null-safe?

DatabricksThriftAccessor.java:662 passed to DatabricksHttpTTransport ctor No

DatabricksTokenFederationProvider.java:74 stored in this.hc No

AzureMSICredentialProvider.java:43 stored in this.httpClient No

PrivateKeyClientCredentialProvider.java:27 stored No

DatabricksDriverFeatureFlagsContext.java:92 runs on a background scheduler; execute() caught by catch(Exception) at TRACE → silent failure of FF refresh No

ArrowStreamResult.java:53, 153 ctor param → NPE later in chunk download No

No @Nullable annotation despite the codebase using javax.annotation.Nullable elsewhere (SqlParameter.java:10, DatabricksSession.java:102). Contract change is invisible to callers.

Fix (preferred): return a rejecting sentinel IDatabricksHttpClient whose execute() throws ConnectionClosedException (symmetric to NoopTelemetryClient). All callers stay safe by default.
Alternative: split into getClient() (throws) + getClientOrNull() (used only by TelemetryPushClient), or add @Nullable + Javadoc and null-check every caller.

Flagged by 6 reviewers.

vikrantpuppala · 2026-04-22T08:57:20Z


+  /**
+   * Permanently closes all HTTP clients for the given connection and prevents new ones from being
+   * created. This should be called from DatabricksConnection.close() to prevent delayed


[F5] closeConnection() NPEs on null UUID — inconsistent with getClient()'s null guard (High)

ConcurrentHashMap.newKeySet() rejects null elements — closedConnections.add(null) throws NPE. Yet getClient() in the same file at the new lines if (connectionUuid != null && closedConnections.contains(connectionUuid)) explicitly null-guards, implying nulls are considered possible. The two paths disagree on the invariant.

Same issue in TelemetryClientFactory.closeTelemetryClient at the closedConnectionUuids.add(connectionUuid) calls inside the computeIfPresent lambdas — no null-guard.

Fix: Pick one invariant. Either (a) assert non-null in closeConnection/closeTelemetryClient and document the constraint on IDatabricksConnectionContext.getConnectionUuid(), or (b) mirror the getClient guard:

String uuid = context.getConnectionUuid(); if (uuid != null) closedConnections.add(uuid);

Flagged by 2 reviewers (architecture, language).

vikrantpuppala · 2026-04-22T08:57:23Z

+   * flushed through the existing client. See GitHub issue #1325.
   */
  public void closeTelemetryClient(IDatabricksConnectionContext connectionContext) {
    String key = TelemetryHelper.keyOf(connectionContext);


[F6] Reordered close is not fail-safe — one exception skips ALL cleanup and re-opens the entire leak (High)

The new ordering runs exportAllPendingTelemetryDetails() BEFORE the two computeIfPresent blocks that (a) mark UUID closed, (b) close the client, (c) remove the holder.

If exportAllPendingTelemetryDetails() or the exportTelemetryLog it calls throws any unchecked exception, execution exits closeTelemetryClient and propagates up to DatabricksConnection.close(). Consequences:

TelemetryClient stays in the holder map with its periodic flush scheduler alive

closedConnectionUuids never updated → future getTelemetryClient() won't return Noop

DatabricksHttpClientFactory.closeConnection() on the next line of DatabricksConnection.close() never runs

One export-time exception silently re-opens the entire leak the PR set out to fix. The old ordering was actually more robust (cleanup ran first, export was a trailer).

Fix:

Mark UUIDs closed unconditionally up-front (before export).

Run holder cleanup.

Run exportAllPendingTelemetryDetails last, wrapped in try/catch that logs-and-swallows.

In DatabricksConnection.close() wrap the cleanup chain in try/finally so closeConnection(ctx) always runs even if earlier steps throw.

Flagged by ops reviewer.

vikrantpuppala · 2026-04-22T08:57:27Z

+          "LEAK BUG (issue #1325): getTelemetryClient() after closeTelemetryClient() "
+              + "created a new TelemetryClient instead of returning NoopTelemetryClient. "
+              + "This orphaned client will never be closed, and its periodic flush creates "
+              + "TELEMETRY HTTP clients that leak TCP sockets.");


[F4] Tests are single-threaded; do not reproduce the cross-thread race (High)

The RCA describes a race between main thread (DatabricksConnection.close()) and pool thread (delayed TelemetryPushTask.run() on the 10-thread telemetryExecutorService). All three tests run sequentially on the JUnit main thread — no CountDownLatch/CyclicBarrier/real executor/parallel stress.

Test #1 (testGetTelemetryClientAfterCloseReCreatesClient) is tautological: closedConnectionUuids.add(uuid) runs inside closeTelemetryClient first, so a sequential getTelemetryClient call trivially returns NoopTelemetryClient. Cannot fail against a partial revert of the fix.

Test Refactoring packages and adding some skeletons for ResultSet and Statement #3 mocks TelemetryHelper.exportTelemetryLog with a lambda that calls getTelemetryClient(ctx). But the real exportTelemetryLog reads from DatabricksThreadContextHolder.getConnectionContext() — a thread-local that Connection.close() clears. The real re-creation vector is TelemetryPushClient.pushEvent() capturing ctx at ctor time (verified in source at TelemetryPushClient.java:47-49) — which no test covers.

Fix — add a real cross-thread test:

@Test void testRaceBetweenCloseAndDelayedPushTask() throws Exception { CountDownLatch blockInPush = new CountDownLatch(1); CountDownLatch closeDone = new CountDownLatch(1); // Spy push client blocks inside pushEvent until main thread calls closeConnection // Submit flush task on real pool // Main thread: await blockInPush entered, then closeConnection(ctx) // Release latch; assert instances map has no TELEMETRY entry for uuid after join }

Delete or rewrite Tests #1 and #3 so they would fail against a partial revert of the fix.

Flagged by 2 reviewers (test, devils-advocate).

vikrantpuppala · 2026-04-22T08:57:37Z

    }
  }

+  /**


[F7] No @VisibleForTesting reset() on DatabricksHttpClientFactory — test pollution (Medium)

TelemetryClientFactory.reset() (in this PR at the hunk around line 243) clears closedConnectionUuids. DatabricksHttpClientFactory has no equivalent. It's a JVM singleton and closedConnections is an instance field.

Consequence: any test that calls closeConnection(ctx) permanently adds UUID to closedConnections for the rest of the test JVM. A later test reusing the same UUID gets null from getClient() with no symptom except "my test fails mysteriously when run after another test."

Fix: Add @VisibleForTesting void reset() on DatabricksHttpClientFactory that clears both instances and closedConnections. Call in @BeforeEach/@AfterEach of TelemetryHttpClientLeakTest.

vikrantpuppala · 2026-04-22T08:57:40Z

        k -> new DatabricksHttpClient(context, type));
  }

+  /**


[F8] removeClient(ctx) is a public footgun with zero production callers (Medium)

After this PR, the no-arg removeClient(IDatabricksConnectionContext) overload has exactly one caller in the repo: DatabricksHttpClientTest.java:240 (a test). All production call sites were migrated to closeConnection() by this PR.

The RCA explicitly identifies removeClient's "allows re-creation via computeIfAbsent" behavior as the root cause of #1325. Shipping two nearly-identically-named public methods where one silently re-enables the bug this PR fixes is a future-maintainer trap — the name "removeClient" reads as the obvious choice on IDE autocomplete.

Fix: Delete the no-arg overload and migrate the test; or mark @VisibleForTesting / @Deprecated with a pointer to closeConnection. At minimum, rename to signal intent (e.g., evictClientsForReconnect).

Flagged by 4 reviewers (maintainability, architecture, devils-advocate, agent-compat).

vikrantpuppala · 2026-04-22T08:58:29Z

+   *
+   * <p>The connection UUID is added to closedConnectionUuids FIRST to prevent getTelemetryClient()
+   * from re-creating a TelemetryClient during or after the close sequence. Pending
+   * TelemetryCollector events are exported BEFORE the TelemetryClient is closed, so they are


[F9] Reordered close has a concurrency regression under concurrent close of same UUID (Medium)

New ordering: (1) removeCollector, (2) exportAllPendingTelemetryDetails() which calls getTelemetryClient(ctx) through the factory, (3) computeIfPresent removes holder and marks UUID closed.

Single-threaded: fine. Under concurrent close of the same UUID (legal via the public API):

Thread A starts step 2's export loop.

Thread B completes step 3 first (removes holder, marks UUID closed).

Thread A's next getTelemetryClient(ctx) now returns NoopTelemetryClient → events silently dropped.

Or a concurrent re-open races in between, and step 2's getTelemetryClient creates a new TelemetryClient (the exact original bug).

Old ordering was order-insensitive because holder-remove+close was atomic under computeIfPresent.

Fix: Mark UUID closed BEFORE export, and export via a directly-held TelemetryClient reference (either look it up from the holder map first, or have the removal return the client). Don't route the export back through the factory.

Flagged by architecture reviewer.

vikrantpuppala · 2026-04-22T08:58:36Z

    DatabricksClientConfiguratorManager.getInstance().removeInstance(connectionContext);
    DatabricksDriverFeatureFlagsContextFactory.removeInstance(connectionContext);
-    DatabricksHttpClientFactory.getInstance().removeClient(connectionContext);
+    DatabricksHttpClientFactory.getInstance().closeConnection(connectionContext);


[F10] No regression test for the Connection.close() post-condition (Medium)

This one-line change alters observable behavior: after connection.close(), DatabricksHttpClientFactory.getInstance().getClient(ctx, TELEMETRY) now returns null (previously always returned a new client). Telemetry pushes submitted during close are now silently dropped.

No existing test was modified; grep shows no test asserts the new post-close invariant. The "3085 tests pass" claim only proves nothing else broke, not that the new behavior is covered.

Fix: Add one focused assertion in the existing DatabricksConnection test class:

@Test void getClientReturnsNullAfterConnectionClose() { DatabricksConnection conn = ...; conn.close(); assertNull(DatabricksHttpClientFactory.getInstance() .getClient(conn.getConnectionContext(), HttpClientType.TELEMETRY)); }

Flagged by test reviewer.

vikrantpuppala · 2026-04-22T08:58:38Z

            .getClient(connectionContext, HttpClientType.TELEMETRY);
+    if (httpClient == null) {
+      // Connection was closed — HTTP client factory rejected the request to prevent socket leaks.
+      LOGGER.debug("Skipping telemetry push: connection has been closed");


[F12] Rejection log at DEBUG; no metric for dropped telemetry (Medium)

Two invisibility paths added by this PR:

DatabricksHttpClientFactory.getClient — rejection logged at LOGGER.debug("Rejecting getClient() for closed connection {} with type {}"). Below default production log level. If a non-TELEMETRY caller hits this path due to a future bug or code change, operators never see it.

TelemetryPushClient.pushEvent on null — LOGGER.debug("Skipping telemetry push: connection has been closed"). In aggressive close scenarios (shutdown, failover, pool eviction), every dropped push is invisible. No counter. Operators cannot answer "is telemetry working?"

Fix:

Raise the getClient rejection log to WARN for non-TELEMETRY types (preserves DEBUG for the expected path).

Add an AtomicLong droppedOnClosedCount on the factory; emit a periodic aggregated WARN when > 0, or expose via JMX/existing metrics.

Include Thread.currentThread().getName() in the rejection log for diagnosis.

Flagged by 3 reviewers (ops, agent-compat, maintainability).

vikrantpuppala · 2026-04-22T08:58:40Z

+   * here so that subsequent getTelemetryClient() calls (e.g., from delayed flush tasks or collector
+   * exports) return NoopTelemetryClient instead of re-creating an orphaned TelemetryClient. This
+   * prevents the socket leak described in GitHub issue #1325.
+   */


[F13] Duplicated registry pattern across two factories (Medium)

closedConnectionUuids here and closedConnections in DatabricksHttpClientFactory are the same concept implemented twice with different names. Both are ConcurrentHashMap.newKeySet() of connection UUIDs, both checked before compute* in the same way, both added on close. Neither is bounded.

Fix: Extract a ClosedConnectionRegistry helper in com.databricks.jdbc.common with markClosed(uuid) / isClosed(uuid) / clear(). Back it with a bounded cache (Caffeine, e.g. expireAfterWrite(5 min) or max ~10K entries LRU). Both factories hold a reference (or share a singleton). Solves the duplication AND F2's memory growth in one change.

Flagged by 3 reviewers (maintainability, agent-compat, ops).

vikrantpuppala · 2026-04-22T08:58:42Z

@@ -31,17 +40,42 @@ public IDatabricksHttpClient getClient(IDatabricksConnectionContext context) {

  public IDatabricksHttpClient getClient(


[F16] @Nullable annotation + irreversibility are undocumented (Low/Medium)

The codebase already uses javax.annotation.Nullable (e.g., SqlParameter.java:10, DatabricksSession.java:102,110, IDatabricksSession.java:21,24). This PR silently widens getClient to a nullable return without annotation or Javadoc update. Agents and humans adding future callers will omit the null check — see F3 for the consequence.

Also document that closeConnection is irreversible — the UUID is blacklisted for the lifetime of the factory. The field comment says "permanently closed" but the public method Javadoc only says "permanently closes all HTTP clients" — reads as "closes them once," not "blacklists UUID forever."

Fix:

/** * @return the HTTP client, or {@code null} if {@link #closeConnection} has been * called for this context. Callers MUST null-check when the call can * race with connection close (e.g., async tasks). */ @Nullable public IDatabricksHttpClient getClient( IDatabricksConnectionContext context, HttpClientType type) { ... }

Add to closeConnection Javadoc: "This is irreversible — the UUID is blacklisted for the lifetime of this factory. Do not call until you're certain the context will never be used again." Same on TelemetryClientFactory.closeTelemetryClient.

vikrantpuppala · 2026-04-22T08:58:59Z

@@ -0,0 +1,208 @@
+# RCA: Leaked Socket Prevents CRaC Checkpointing (Issue #1325)


[F17] RCA doc in docs/ is unprecedented and point-in-time (Low)

docs/ currently contains only enduring reference material (LOGGING.md, TESTING.md, JDBC_METHOD_INVENTORY.md, JDBC_SPEC_COVERAGE_ANALYSIS.md, features/). This adds a 208-line root-cause analysis for a single bug.

Issues:

Sets precedent without an established convention (no docs/rca/README.md policy, no numbering scheme).

Hardcodes line numbers ("line 421", "line 172") that will rot on the next edit to DatabricksConnection.close().

Content is point-in-time and will never be updated.

Duplicates what should live in the PR description / JIRA / issue [BUG] Leaked Socket prevents CRaC checkpointing #1325.

Fix: Delete this file and fold the content into the PR description + issue #1325 comments. If in-tree RCAs are desired, establish a convention first (e.g., docs/rca/NNN-title.md + a README.md policy).

Flagged by 3 reviewers (maintainability, agent-compat, language).

vikrantpuppala · 2026-04-22T08:59:00Z

+ */
+@ExtendWith(MockitoExtension.class)
+@MockitoSettings(strictness = Strictness.LENIENT)
+public class TelemetryHttpClientLeakTest {


[F20] Test hygiene: Strictness.LENIENT, unused stub, empty catch (Low)

Line with @MockitoSettings(strictness = Strictness.LENIENT) silently masks unused-stub bugs — combined with broad any() matchers, tests can pass while stubbing the wrong method. Remove LENIENT and fix any resulting unused-stub errors.

when(ctx.getHost()).thenReturn(host) is unused; the production keyOf path uses getHostForOAuth. Would fail under strict Mockito. Remove.

The empty catch around when(ctx.getHostUrl()).thenReturn(...) silently swallows checked exceptions during setup. Use doReturn("https://" + host).when(ctx).getHostUrl() to bypass the checked exception cleanly.

vikrantpuppala · 2026-04-22T08:59:51Z

Code Review Squad Report

Merge Safety Score: 10/100 — CRITICAL RISK
Reviewers: 9/9 delivered (security, architecture, language, ops, performance, test, maintainability, agent-compat, devils-advocate)

Executive Summary

Targets a real cross-thread race from issue #1325 and the defense-in-depth intent is sound. But the implementation has fundamental issues that defeat the fix:

The headline guard does not close the race it claims to fix — contains() check sits outside computeIfAbsent (flagged by 4 reviewers) — see inline F1.
One unchecked exception from exportAllPendingTelemetryDetails() skips ALL cleanup — F6.
closeConnection() NPEs on null UUID — asymmetric with getClient()'s null guard — F5.
Bounded socket leak traded for unbounded heap leak — two append-only process-wide sets; hits the long-running CRaC workloads this PR targets — F2.
Tests are strictly single-threaded — don't exercise the cross-thread race; two of three are tautological or mis-mocked — F4.
getClient() silently widened to nullable — 7 existing callers are not null-safe — F3.

Recommend blocking merge until F1, F2, F3, F4, F5, F6 are addressed.

Findings by Severity

High (blocker) — posted as inline comments

F1 TOCTOU race in getClient() / getTelemetryClient() (4 reviewers)
F2 Unbounded heap growth in closed-UUID sets (6 reviewers)
F3 getClient() silently nullable; 7 unsafe callers (6 reviewers)
F4 Tests don't reproduce the cross-thread race (2 reviewers)
F5 closeConnection() NPEs on null UUID (2 reviewers)
F6 Reordered close is not fail-safe — one exception re-opens the leak (ops)

Medium — posted as inline comments

F7 No @VisibleForTesting reset() on DatabricksHttpClientFactory — test pollution
F8 removeClient(ctx) is a public footgun with zero production callers
F9 Reordered close has a concurrency regression under concurrent close of same UUID
F10 No regression test for Connection.close() post-condition
F12 Rejection log at DEBUG; no metric for dropped telemetry
F13 Duplicated registry pattern across two factories

Low — summarized here (not posted inline to reduce noise)

F11 The close-ordering fix itself is untested. A future refactor reverting the reorder would pass every test because the guard alone would catch it.
F14 TelemetryClientFactory.getTelemetryClient() runs isTelemetryAllowedForConnection() (which can reach into feature-flag infra + HTTP) before the closedConnectionUuids.contains check. Move the closed-UUID check to be the first statement.
F15 closedConnectionUuids.add(uuid) happens inside computeIfPresent — which runs AFTER the collector export. During export the UUID isn't marked yet, so events can still submit pool flushes that hit F1's race in the HTTP factory.
F16 Missing @Nullable on getClient + irreversibility of closeConnection undocumented (posted inline).
F17 RCA doc in docs/ is unprecedented and point-in-time (posted inline).
F18 Naming inconsistency: closedConnectionUuids vs closedConnections — subsumed by F13.
F19 closeConnection() on DatabricksHttpClientFactory reads as "close a JDBC Connection". Consider markConnectionClosed() / onConnectionClosed().
F20 Test hygiene: Strictness.LENIENT, empty catch, unused getHost stub (posted inline).
F21 context.getConnectionUuid() called twice per getClient — cache to a local.
F22 Issue [BUG] Leaked Socket prevents CRaC checkpointing #1325 rationale repeated ~5× across files. One authoritative comment + one-liners referencing it.
F23 Close-latency regression: collector export now runs synchronously on close() before client close. Performance reviewer concluded net latency is unchanged; flag for follow-up if telemetry endpoint ever degrades.
F24 TelemetryClient.close() cancels the periodic flush task AFTER flush(true).get(). If the scheduler is mid-dispatch when cancel runs, a late TelemetryPushTask can still land on the pool and hit F1's race.

Verified but No Action

Close-sequence reorder does NOT add an extra sync HTTP round-trip (performance reviewer confirmed).
No security concerns on auth/TLS/secrets/data exposure.
DatabricksDriverFeatureFlagsContextFactory.removeInstance DOES call shutdown() cancelling its scheduler (verified) — the feature-flag scheduled-thread NPE concern from F3 is narrow, not broad.

Suggested Minimum Path to Merge

F1: fuse guard + map op into a single compute(...) (both factories).
F5: null-guard closedConnections.add(uuid) / closedConnectionUuids.add(uuid) (or assert non-null with a documented invariant).
F6: mark UUIDs closed up-front; wrap exportAllPendingTelemetryDetails in try/catch-swallow; wrap DatabricksConnection.close() cleanup chain in try/finally.
F2: replace sets with a bounded Caffeine cache OR flip to an allowlist of open UUIDs (solves F1 + F2 in one change).
F3: return a rejecting IDatabricksHttpClient sentinel (symmetric with NoopTelemetryClient) so all callers stay safe.
F4: add one real cross-thread test with CountDownLatch; rewrite or delete the tautological Test #1 and the mis-mocked Test Refactoring packages and adding some skeletons for ResultSet and Statement #3.

Review generated by Code Review Squad (9 parallel specialized reviewers). Feedback welcome in #code-review-squad-feedback.

gopalldb requested a review from samikshya-db March 27, 2026 06:27

vikrantpuppala reviewed Apr 22, 2026

View reviewed changes

File:line	Usage	Null-safe?
`DatabricksThriftAccessor.java:662`	passed to `DatabricksHttpTTransport` ctor	No
`DatabricksTokenFederationProvider.java:74`	stored in `this.hc`	No
`AzureMSICredentialProvider.java:43`	stored in `this.httpClient`	No
`PrivateKeyClientCredentialProvider.java:27`	stored	No
`DatabricksDriverFeatureFlagsContext.java:92`	runs on a background scheduler; `execute()` caught by `catch(Exception)` at TRACE → silent failure of FF refresh	No
`ArrowStreamResult.java:53, 153`	ctor param → NPE later in chunk download	No

                   }
                 }
+                /**

		@@ -31,17 +40,42 @@ public IDatabricksHttpClient getClient(IDatabricksConnectionContext context) {

		public IDatabricksHttpClient getClient(

		@@ -0,0 +1,208 @@
		# RCA: Leaked Socket Prevents CRaC Checkpointing (Issue #1325)

Conversation

gopalldb commented Mar 27, 2026

Summary

Files changed

Test results

Test plan

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vikrantpuppala commented Apr 22, 2026

Code Review Squad Report

Executive Summary

Findings by Severity

High (blocker) — posted as inline comments

Medium — posted as inline comments

Low — summarized here (not posted inline to reduce noise)

Verified but No Action

Suggested Minimum Path to Merge

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants