Skip to content

Categorize generic INVALID_STATE telemetry into specific error codes#1447

Draft
samikshya-db wants to merge 1 commit into
databricks:mainfrom
samikshya-db:samikshya/telemetry-error-categorization
Draft

Categorize generic INVALID_STATE telemetry into specific error codes#1447
samikshya-db wants to merge 1 commit into
databricks:mainfrom
samikshya-db:samikshya/telemetry-error-categorization

Conversation

@samikshya-db
Copy link
Copy Markdown
Collaborator

Summary

  • Replace the generic INVALID_STATE driver error code at ~30 call sites with more specific DatabricksDriverErrorCode values so telemetry error-name buckets are actionable. INVALID_STATE remains as a fallback only for genuinely unknown states (~6 sites).
  • 8 new codes added (1045–1052): CURSOR_INVALID_POSITION, COLUMN_INDEX_OUT_OF_BOUNDS, ROW_INDEX_OUT_OF_BOUNDS, THRIFT_RPC_ERROR, THRIFT_RESPONSE_MISMATCH, INVALID_RESPONSE_FORMAT, THREAD_POOL_EXECUTION_ERROR, STREAM_READ_ERROR.
  • Reuses existing INPUT_VALIDATION_ERROR (1015) for null-arg/parameter-validation sites and VOLUME_OPERATION_INVALID_STATE (1028) for DBFSVolumeClient state errors.

Why

INVALID_STATE was being used as a catch-all across very different failure paths — result-set cursor navigation, Thrift RPC failures, null-arg validation, thread-pool failures, stream I/O, response-format errors. Telemetry buckets keyed on the error name were mixing unrelated failures, making it hard to tell whether a spike represented an application bug (cursor misuse), a Thrift server issue, or stream-read trouble.

Behavior change

Applications that catch DatabricksSQLException and key on getSQLState() == "INVALID_STATE" for result-set navigation, Thrift RPC, or stream-read failures will now see the new state names. Documented in NEXT_CHANGELOG.md.

Test plan

  • mvn spotless:apply (done locally)
  • mvn test — particularly the result-set, Thrift, and prepared-statement suites; LazyThriftInlineArrowResultTest assertions were updated to the new codes
  • Verify telemetry payloads in a dev workspace show the new error names

This pull request and its description were written by Isaac.

Replace the generic INVALID_STATE driver error code at ~30 call sites with
more specific DatabricksDriverErrorCode values so telemetry buckets are
actionable. INVALID_STATE remains as a fallback for genuinely unknown states.

New codes (1045-1052):
- CURSOR_INVALID_POSITION       result-set cursor before-first / out-of-batch
- COLUMN_INDEX_OUT_OF_BOUNDS    column index exceeds column count
- ROW_INDEX_OUT_OF_BOUNDS       row index out of bounds (ColumnarRowView)
- THRIFT_RPC_ERROR              TException wrapping in DatabricksThriftAccessor
- THRIFT_RESPONSE_MISMATCH      chunk start row offset mismatch
- INVALID_RESPONSE_FORMAT       empty/unknown response format in manifest
- THREAD_POOL_EXECUTION_ERROR   ExecutionException in parallel work
- STREAM_READ_ERROR             IOException / byte-count mismatch on InputStream

Existing-code reuse:
- INPUT_VALIDATION_ERROR (1015) for null-arg / parameter-validation sites
  in DatabricksMetadataQueryClient, DatabricksDatabaseMetaData,
  DatabricksPreparedStatement, PreparedStatementBatchExecutor
- VOLUME_OPERATION_INVALID_STATE (1028) for DBFSVolumeClient state errors
- NOT_IMPLEMENTED_OPERATION for DatabricksThriftServiceClient
  getResultChunksData not-implemented branch

Updated LazyThriftInlineArrowResultTest assertions to match the new codes.

Co-authored-by: Isaac
Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant