Fix ambiguous errors when statements are cancelled or connection closed#1291
Conversation
…ion is closed When a running query is cancelled (via Statement.cancel() or Connection.close()), the driver returned ambiguous errors like "error: [null]" or generic "Statement execution failed" that were indistinguishable from real connection errors. Root cause: The Thrift server returns CANCELED_STATE with OK_STATUS and null errorMessage. The driver didn't detect CANCELED_STATE as a terminal state, causing it to silently continue and fail downstream with null data. Fixes: - Detect CANCELED_STATE in Thrift polling path and throw with clear "Statement was cancelled" message, SQL state HY008, and EXECUTE_STATEMENT_CANCELLED error code - Detect CANCELED state in SEA handleFailedExecution and throw with same clear signal instead of generic EXECUTE_STATEMENT_FAILED - Cancel in-flight statements in Connection.close() before closing them, so running queries get a cancellation signal rather than socket errors - Default to "Statement was cancelled" message in ExecutionStatus when state is ABORTED but error message is null Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
- Add Thrift test: CANCELED_STATE in getStatementResult throws HY008 - Add SEA tests: CANCELED state throws HY008, FAILED state does not - Add ExecutionStatus tests: null error on cancelled state gets meaningful message, null error on other states stays null - Add CANCELED_STATE detection in getStatementResult (not just polling) - Revert best-effort cancel in connection.close (separate concern) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
…ction Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
|
This PR has been marked as Stale because it has been open for 30 days with no activity. If you would like the PR to remain open, please remove the stale label or comment on the PR. |
|
This PR was closed because it has been inactive for 7 days since being marked as stale. |
Co-authored-by: Isaac Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
|
[F1] PR description and first commit promise a The Summary (bullet 4) and Changes (item 3) say:
The first commit message ( Net: customers/reviewers reading the description will believe the socket-error-on-close case is fixed; it isn't. Operationally this is also the most expensive class of leak the PR was framed around — orphaned server-side queries continue executing on the warehouse until server-side timeout. Suggested fix: either re-add the in-flight cancel in |
| throw new DatabricksSQLException( | ||
| cancelMessage, | ||
| OPERATION_CANCELLED_SQLSTATE, | ||
| DatabricksDriverErrorCode.EXECUTE_STATEMENT_CANCELLED); |
There was a problem hiding this comment.
[F2] EXECUTE_STATEMENT_CANCELLED is silently discarded — getErrorCode() returns 0 — High
The PR's customer-impact section promises:
Customers can now programmatically distinguish cancellations from actual errors by checking: ... Error code:
EXECUTE_STATEMENT_CANCELLED
But the 3-arg DatabricksSQLException(String reason, String sqlState, DatabricksDriverErrorCode internalError) constructor (DatabricksSQLException.java:52-56) is:
public DatabricksSQLException(
String reason, String sqlState, DatabricksDriverErrorCode internalError) {
super(reason, sqlState); // ← internalError NOT passed to vendor code
logTelemetryEvent(sqlState, reason, false); // ← internalError NOT logged either
}The internalError parameter is purely advisory — super(reason, sqlState) does not set SQLException's vendor code, and logTelemetryEvent only takes sqlState/reason. So e.getErrorCode() on a cancellation exception returns 0, and EXECUTE_STATEMENT_CANCELLED never appears anywhere observable to the customer. None of the new tests assert on getErrorCode(), hiding the gap.
Suggested fix: either (a) fix the constructor to use super(reason, sqlState, vendorCode) with a stable mapping for DatabricksDriverErrorCode, then add assertEquals(...) on getErrorCode() in the new tests; or (b) drop the error-code claim from the PR description and rely solely on HY008 + the message string.
| if (statementState == StatementState.CANCELED) { | ||
| String cancelMessage = String.format("Statement [%s] was cancelled", statementId); | ||
| LOGGER.info(cancelMessage); | ||
| throw new DatabricksSQLException( |
There was a problem hiding this comment.
[F3] Cancellation emits ERROR-level telemetry — defeats the PR's "distinguishability" goal — High
This throw (and the symmetric one in DatabricksThriftAccessor.java:859) goes through DatabricksSQLException(reason, sqlState, internalError), which calls:
// DatabricksSQLException.java:52-56
super(reason, sqlState);
logTelemetryEvent(sqlState, reason, false); // silentExceptions = false (hard-coded)logTelemetryEvent with silentExceptions=false unconditionally calls exportFailureLog(..., TelemetryLogLevel.ERROR) (DatabricksSQLException.java:73-79). So every user-initiated Statement.cancel() and every Ctrl-C from a BI tool emits an ERROR-level telemetry event with sqlState=HY008.
Cancellations are common in normal BI workloads (Tableau/Looker speculatively cancel; users close tabs); a customer batch-cancelling 10k queries will look like a 10k-error spike to driver-side monitoring. The PR's goal — making cancellations distinguishable from real failures — is undone by sending them through the same ERROR alert path as real failures.
Suggested fix: introduce a 3-arg silentExceptions=true overload (or directly downgrade to INFO/TRACE inside logTelemetryEvent when sqlState == OPERATION_CANCELLED_SQLSTATE), and use it for both cancellation throw sites. Bonus: this also helps F4-style idempotency (concurrent pollers each observing CANCELED would no longer multi-fire ERROR telemetry).
| response = getOperationStatus(request, statementId); | ||
| TOperationState operationState = response.getOperationState(); | ||
| if (operationState == TOperationState.CANCELED_STATE) { | ||
| throw cancelledStatementException(statementId.toSQLExecStatementId()); |
There was a problem hiding this comment.
[F4] Asymmetric fix — SEA DatabricksSdkClient.getStatementResult does NOT detect CANCELED — High
This Thrift fix is correct, but the SEA equivalent at DatabricksSdkClient.java:395-423 was not updated:
public DatabricksResultSet getStatementResult(...) throws SQLException {
...
response = apiClient.execute(req, GetStatementResponse.class);
...
return new DatabricksResultSet(
response.getStatus(),
typedStatementId,
response.getResult(), // ← null when state is CANCELED
...);
}A user who calls executeStatementAsync (SEA), then Statement.cancel(), then polls via getStatementResult gets a DatabricksResultSet wrapping a CANCELED status with null result data — exactly the original bug, on the SEA async code path. The PR's "programmatically distinguish cancellations" promise fails for SEA async users.
Suggested fix: after the apiClient.execute(...) line, add a CANCELED check using the same HY008 / EXECUTE_STATEMENT_CANCELLED contract used in handleFailedExecution at line 698-705. Ideally, factor cancelledStatementException(statementId) into a shared helper (today the construction is duplicated inline in handleFailedExecution and as a private method in DatabricksThriftAccessor).
|
|
||
| if (statusResp.isSetOperationState() | ||
| && statusResp.getOperationState() == TOperationState.CANCELED_STATE) { | ||
| throw cancelledStatementException(statementId); |
There was a problem hiding this comment.
[F5] New CANCELED branch in checkOperationStatusForErrors is not exercised by any test — High
This branch is on the synchronous polling path — it is reached from pollTillOperationFinished (this file, lines 295 and 315) and is the actual code path for the "real customer" cancellation scenario: query is RUNNING, user calls Statement.cancel(), the next poll returns CANCELED_STATE.
The PR's only Thrift test (testGetStatementResult_cancelled_throwsWithHY008) calls accessor.getStatementResult(...) and exercises the standalone branch at line 407-409 — not this branch. Result: the polling-path CANCELED detection has zero coverage, and a regression in this block (e.g., a future refactor reordering the checks against isErrorStatusCode at line 805) would ship undetected.
Suggested fix: add a test where thriftClient.GetOperationStatus(...) returns RUNNING_STATE then CANCELED_STATE on consecutive calls (precedent: testGetStatementResult_pending already uses multi-call mocks), driving executeStatement and asserting HY008. This single test covers both the polling-loop path and this new branch.
…th (#1425) ## Summary Follow-up to #1291 addressing review feedback: - **F2**: `getErrorCode()` now returns `DatabricksDriverErrorCode.EXECUTE_STATEMENT_CANCELLED.ordinal()` instead of 0. The 3-arg `DatabricksSQLException` constructor was silently discarding the error code. - **F3**: Cancellation exceptions use `silentExceptions=true` so BI-tool cancellations (Tableau, Looker) don't emit ERROR-level telemetry. Cancellations are logged at TRACE level instead. - **F4**: SEA `getStatementResult()` now detects CANCELED state and throws `HY008` — previously the async cancel path returned a `DatabricksResultSet` wrapping null result data. - **F5**: Added polling-path test (RUNNING → CANCELED during `pollTillOperationFinished`) for Thrift, which had zero coverage. ## Test plan - [x] `DatabricksSdkClientTest#testHandleFailedExecution_CancelledState_ThrowsWithHY008` — asserts HY008, message, and error code - [x] `DatabricksSdkClientTest#testHandleFailedExecution_FailedState_ThrowsWithoutHY008` — unchanged - [x] `DatabricksSdkClientTest#testGetStatementResult_CancelledState_ThrowsWithHY008` — **new** (F4) - [x] `DatabricksThriftAccessorTest#testGetStatementResult_cancelled_throwsWithHY008` — now asserts error code - [x] `DatabricksThriftAccessorTest#testPollingPath_cancelledDuringExecution_throwsWithHY008` — **new** (F5) NO_CHANGELOG=true This pull request was AI-assisted by Isaac. --------- Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
Summary
Statement.cancel()orConnection.close()), the driver returned ambiguous errors like"error: [null]"or generic"Statement execution failed"indistinguishable from real connection errorsCANCELED_STATEwithOK_STATUSand nullerrorMessage. The driver didn't detectCANCELED_STATEas a terminal state, causing downstream failures with null dataCANCELEDstate fell through tohandleFailedExecutionwhich threwEXECUTE_STATEMENT_FAILED— same error code as real failuresConnection.close()calledstatement.close()without cancelling in-flight queries first, causing running statements to fail with socket errorsChanges
DatabricksThriftAccessor): DetectCANCELED_STATEin polling and throwDatabricksSQLExceptionwith message"Statement [id] was cancelled", SQL stateHY008, error codeEXECUTE_STATEMENT_CANCELLEDDatabricksSdkClient): DetectCANCELEDstate inhandleFailedExecutionand throw with same clear signalDatabricksConnection): Best-effortcancel()on in-flight statements before closingExecutionStatus): Default to"Statement was cancelled"when state is ABORTED but error message is nullOPERATION_CANCELLED_SQLSTATE = "HY008"(standard SQL state for operation cancelled)Customer impact
Customers can now programmatically distinguish cancellations from actual errors by checking:
HY008= cancellationEXECUTE_STATEMENT_CANCELLED"Statement [id] was cancelled"Test plan
This pull request was AI-assisted by Isaac.