Skip to content

Fix ambiguous errors when statements are cancelled or connection closed#1291

Merged
gopalldb merged 4 commits into
databricks:mainfrom
gopalldb:fix/cancel-error-handling
Apr 27, 2026
Merged

Fix ambiguous errors when statements are cancelled or connection closed#1291
gopalldb merged 4 commits into
databricks:mainfrom
gopalldb:fix/cancel-error-handling

Conversation

@gopalldb
Copy link
Copy Markdown
Collaborator

Summary

  • When a running query is cancelled (via Statement.cancel() or Connection.close()), the driver returned ambiguous errors like "error: [null]" or generic "Statement execution failed" indistinguishable from real connection errors
  • Thrift: Server returns CANCELED_STATE with OK_STATUS and null errorMessage. The driver didn't detect CANCELED_STATE as a terminal state, causing downstream failures with null data
  • SEA: CANCELED state fell through to handleFailedExecution which threw EXECUTE_STATEMENT_FAILED — same error code as real failures
  • Connection.close() called statement.close() without cancelling in-flight queries first, causing running statements to fail with socket errors

Changes

  1. Thrift path (DatabricksThriftAccessor): Detect CANCELED_STATE in polling and throw DatabricksSQLException with message "Statement [id] was cancelled", SQL state HY008, error code EXECUTE_STATEMENT_CANCELLED
  2. SEA path (DatabricksSdkClient): Detect CANCELED state in handleFailedExecution and throw with same clear signal
  3. Connection close (DatabricksConnection): Best-effort cancel() on in-flight statements before closing
  4. Null error message (ExecutionStatus): Default to "Statement was cancelled" when state is ABORTED but error message is null
  5. SQL state constant: Add OPERATION_CANCELLED_SQLSTATE = "HY008" (standard SQL state for operation cancelled)

Customer impact

Customers can now programmatically distinguish cancellations from actual errors by checking:

  • SQL state: HY008 = cancellation
  • Error code: EXECUTE_STATEMENT_CANCELLED
  • Message: "Statement [id] was cancelled"

Test plan

  • Full jdbc-core test suite: 3162 tests pass, 0 failures
  • Manual verification: cancel a long-running query and verify the error message/SQL state

This pull request was AI-assisted by Isaac.

gopalldb and others added 3 commits March 19, 2026 23:24
…ion is closed

When a running query is cancelled (via Statement.cancel() or
Connection.close()), the driver returned ambiguous errors like
"error: [null]" or generic "Statement execution failed" that were
indistinguishable from real connection errors.

Root cause: The Thrift server returns CANCELED_STATE with OK_STATUS and
null errorMessage. The driver didn't detect CANCELED_STATE as a terminal
state, causing it to silently continue and fail downstream with null data.

Fixes:
- Detect CANCELED_STATE in Thrift polling path and throw with clear
  "Statement was cancelled" message, SQL state HY008, and
  EXECUTE_STATEMENT_CANCELLED error code
- Detect CANCELED state in SEA handleFailedExecution and throw with
  same clear signal instead of generic EXECUTE_STATEMENT_FAILED
- Cancel in-flight statements in Connection.close() before closing them,
  so running queries get a cancellation signal rather than socket errors
- Default to "Statement was cancelled" message in ExecutionStatus when
  state is ABORTED but error message is null

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
- Add Thrift test: CANCELED_STATE in getStatementResult throws HY008
- Add SEA tests: CANCELED state throws HY008, FAILED state does not
- Add ExecutionStatus tests: null error on cancelled state gets
  meaningful message, null error on other states stays null
- Add CANCELED_STATE detection in getStatementResult (not just polling)
- Revert best-effort cancel in connection.close (separate concern)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
…ction

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
@github-actions
Copy link
Copy Markdown

This PR has been marked as Stale because it has been open for 30 days with no activity. If you would like the PR to remain open, please remove the stale label or comment on the PR.

@github-actions github-actions Bot added the Stale label Apr 20, 2026
@github-actions
Copy link
Copy Markdown

This PR was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions Bot closed this Apr 27, 2026
@gopalldb gopalldb reopened this Apr 27, 2026
Co-authored-by: Isaac
Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
@gopalldb gopalldb enabled auto-merge (squash) April 27, 2026 10:29
@msrathore-db
Copy link
Copy Markdown
Collaborator

msrathore-db commented Apr 27, 2026

[F1] PR description and first commit promise a Connection.close() change that was reverted — High

The Summary (bullet 4) and Changes (item 3) say:

Connection.close() called statement.close() without cancelling in-flight queries first, causing running statements to fail with socket errors
Connection close (DatabricksConnection): Best-effort cancel() on in-flight statements before closing.

The first commit message (87749855) makes the same claim. But commit 5a5605fd ("Add tests for cancel detection, revert connection.close cancel") removed that change, and the diff has zero edits to DatabricksConnection.java (close() at line 415-427 still calls statement.close(false) without first cancelling).

Net: customers/reviewers reading the description will believe the socket-error-on-close case is fixed; it isn't. Operationally this is also the most expensive class of leak the PR was framed around — orphaned server-side queries continue executing on the warehouse until server-side timeout.

Suggested fix: either re-add the in-flight cancel in DatabricksConnection.close() (iterate statementSet, defensive cancel(), swallow exceptions, then close(false)), or update the PR title, summary, and first commit message to drop the close() claim and explicitly mark it deferred.

throw new DatabricksSQLException(
cancelMessage,
OPERATION_CANCELLED_SQLSTATE,
DatabricksDriverErrorCode.EXECUTE_STATEMENT_CANCELLED);
Copy link
Copy Markdown
Collaborator

@msrathore-db msrathore-db Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F2] EXECUTE_STATEMENT_CANCELLED is silently discarded — getErrorCode() returns 0 — High

The PR's customer-impact section promises:

Customers can now programmatically distinguish cancellations from actual errors by checking: ... Error code: EXECUTE_STATEMENT_CANCELLED

But the 3-arg DatabricksSQLException(String reason, String sqlState, DatabricksDriverErrorCode internalError) constructor (DatabricksSQLException.java:52-56) is:

public DatabricksSQLException(
    String reason, String sqlState, DatabricksDriverErrorCode internalError) {
  super(reason, sqlState);                       // ← internalError NOT passed to vendor code
  logTelemetryEvent(sqlState, reason, false);    // ← internalError NOT logged either
}

The internalError parameter is purely advisory — super(reason, sqlState) does not set SQLException's vendor code, and logTelemetryEvent only takes sqlState/reason. So e.getErrorCode() on a cancellation exception returns 0, and EXECUTE_STATEMENT_CANCELLED never appears anywhere observable to the customer. None of the new tests assert on getErrorCode(), hiding the gap.

Suggested fix: either (a) fix the constructor to use super(reason, sqlState, vendorCode) with a stable mapping for DatabricksDriverErrorCode, then add assertEquals(...) on getErrorCode() in the new tests; or (b) drop the error-code claim from the PR description and rely solely on HY008 + the message string.

if (statementState == StatementState.CANCELED) {
String cancelMessage = String.format("Statement [%s] was cancelled", statementId);
LOGGER.info(cancelMessage);
throw new DatabricksSQLException(
Copy link
Copy Markdown
Collaborator

@msrathore-db msrathore-db Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F3] Cancellation emits ERROR-level telemetry — defeats the PR's "distinguishability" goal — High

This throw (and the symmetric one in DatabricksThriftAccessor.java:859) goes through DatabricksSQLException(reason, sqlState, internalError), which calls:

// DatabricksSQLException.java:52-56
super(reason, sqlState);
logTelemetryEvent(sqlState, reason, false);   // silentExceptions = false (hard-coded)

logTelemetryEvent with silentExceptions=false unconditionally calls exportFailureLog(..., TelemetryLogLevel.ERROR) (DatabricksSQLException.java:73-79). So every user-initiated Statement.cancel() and every Ctrl-C from a BI tool emits an ERROR-level telemetry event with sqlState=HY008.

Cancellations are common in normal BI workloads (Tableau/Looker speculatively cancel; users close tabs); a customer batch-cancelling 10k queries will look like a 10k-error spike to driver-side monitoring. The PR's goal — making cancellations distinguishable from real failures — is undone by sending them through the same ERROR alert path as real failures.

Suggested fix: introduce a 3-arg silentExceptions=true overload (or directly downgrade to INFO/TRACE inside logTelemetryEvent when sqlState == OPERATION_CANCELLED_SQLSTATE), and use it for both cancellation throw sites. Bonus: this also helps F4-style idempotency (concurrent pollers each observing CANCELED would no longer multi-fire ERROR telemetry).

response = getOperationStatus(request, statementId);
TOperationState operationState = response.getOperationState();
if (operationState == TOperationState.CANCELED_STATE) {
throw cancelledStatementException(statementId.toSQLExecStatementId());
Copy link
Copy Markdown
Collaborator

@msrathore-db msrathore-db Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F4] Asymmetric fix — SEA DatabricksSdkClient.getStatementResult does NOT detect CANCELED — High

This Thrift fix is correct, but the SEA equivalent at DatabricksSdkClient.java:395-423 was not updated:

public DatabricksResultSet getStatementResult(...) throws SQLException {
  ...
  response = apiClient.execute(req, GetStatementResponse.class);
  ...
  return new DatabricksResultSet(
      response.getStatus(),
      typedStatementId,
      response.getResult(),   // ← null when state is CANCELED
      ...);
}

A user who calls executeStatementAsync (SEA), then Statement.cancel(), then polls via getStatementResult gets a DatabricksResultSet wrapping a CANCELED status with null result data — exactly the original bug, on the SEA async code path. The PR's "programmatically distinguish cancellations" promise fails for SEA async users.

Suggested fix: after the apiClient.execute(...) line, add a CANCELED check using the same HY008 / EXECUTE_STATEMENT_CANCELLED contract used in handleFailedExecution at line 698-705. Ideally, factor cancelledStatementException(statementId) into a shared helper (today the construction is duplicated inline in handleFailedExecution and as a private method in DatabricksThriftAccessor).


if (statusResp.isSetOperationState()
&& statusResp.getOperationState() == TOperationState.CANCELED_STATE) {
throw cancelledStatementException(statementId);
Copy link
Copy Markdown
Collaborator

@msrathore-db msrathore-db Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F5] New CANCELED branch in checkOperationStatusForErrors is not exercised by any test — High

This branch is on the synchronous polling path — it is reached from pollTillOperationFinished (this file, lines 295 and 315) and is the actual code path for the "real customer" cancellation scenario: query is RUNNING, user calls Statement.cancel(), the next poll returns CANCELED_STATE.

The PR's only Thrift test (testGetStatementResult_cancelled_throwsWithHY008) calls accessor.getStatementResult(...) and exercises the standalone branch at line 407-409 — not this branch. Result: the polling-path CANCELED detection has zero coverage, and a regression in this block (e.g., a future refactor reordering the checks against isErrorStatusCode at line 805) would ship undetected.

Suggested fix: add a test where thriftClient.GetOperationStatus(...) returns RUNNING_STATE then CANCELED_STATE on consecutive calls (precedent: testGetStatementResult_pending already uses multi-call mocks), driving executeStatement and asserting HY008. This single test covers both the polling-loop path and this new branch.

@gopalldb gopalldb merged commit 7592f65 into databricks:main Apr 27, 2026
25 checks passed
gopalldb added a commit that referenced this pull request Apr 27, 2026
…th (#1425)

## Summary
Follow-up to #1291 addressing review feedback:

- **F2**: `getErrorCode()` now returns
`DatabricksDriverErrorCode.EXECUTE_STATEMENT_CANCELLED.ordinal()`
instead of 0. The 3-arg `DatabricksSQLException` constructor was
silently discarding the error code.
- **F3**: Cancellation exceptions use `silentExceptions=true` so BI-tool
cancellations (Tableau, Looker) don't emit ERROR-level telemetry.
Cancellations are logged at TRACE level instead.
- **F4**: SEA `getStatementResult()` now detects CANCELED state and
throws `HY008` — previously the async cancel path returned a
`DatabricksResultSet` wrapping null result data.
- **F5**: Added polling-path test (RUNNING → CANCELED during
`pollTillOperationFinished`) for Thrift, which had zero coverage.

## Test plan
- [x]
`DatabricksSdkClientTest#testHandleFailedExecution_CancelledState_ThrowsWithHY008`
— asserts HY008, message, and error code
- [x]
`DatabricksSdkClientTest#testHandleFailedExecution_FailedState_ThrowsWithoutHY008`
— unchanged
- [x]
`DatabricksSdkClientTest#testGetStatementResult_CancelledState_ThrowsWithHY008`
— **new** (F4)
- [x]
`DatabricksThriftAccessorTest#testGetStatementResult_cancelled_throwsWithHY008`
— now asserts error code
- [x]
`DatabricksThriftAccessorTest#testPollingPath_cancelledDuringExecution_throwsWithHY008`
— **new** (F5)

NO_CHANGELOG=true

This pull request was AI-assisted by Isaac.

---------

Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants