Skip to content

Conversation

@samikshya-db
Copy link
Collaborator

@samikshya-db samikshya-db commented Nov 12, 2025

Description

  • Circuit breaker did not come into play when 429s were coming back to us from the server. This PR fixes this.
  • Internal PostMortem doc : Link
  • why was it not working before? Because telemetry errors were all silenced, no errors were thrown hence CB was never triggered
  • Improve logLevels

Testing

  • Added unit tests to replicate the scenario

Additional Notes to the Reviewer

@samikshya-db samikshya-db changed the title [PECOBLR-1191] Fix circuit breaker for 429 errors [PECOBLR-1191] Fix broken circuit breaker Nov 12, 2025
@samikshya-db samikshya-db marked this pull request as ready for review November 20, 2025 11:27
if (connectionContext == null
|| telemetryDetails == null
|| logLevel.toInt() < connectionContext.getTelemetryLogLevel().toInt()) {
|| logLevel.toInt() > connectionContext.getTelemetryLogLevel().toInt()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

* Exception indicating telemetry export failures. This intentionally does NOT emit telemetry on
* construction to avoid recursive logging loops.
*/
public class DatabricksTelemetryException extends RuntimeException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this unchecked exception?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any random exception should be caught and thrown so that circuit breaker catches it.

assertEquals(CircuitBreaker.State.CLOSED, circuitBreakerClient.getCircuitBreakerState());
}

@Test
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to add integration tests? So, that we can also verify this in end to end scenario

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These would require response mocking (similar to the issue we are facing with retry tests) - we can take it up later if required. I can create a ticket for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, create a tracking ticket

LOGGER.trace(
"Failed to push telemetry logs with error response: {}", response.getStatusLine());
return;
if (connectionContext.isTelemetryCircuitBreakerEnabled()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we change log level for not able to push telemetry? Trace may be too low. We are anyways not blocking critical flow, but at least logging level can be higher than trace or debug

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We initially had error & debug, customer started complaining about verbose logs from telemetry : we changed it to trace : ideally we want to have telemetry as silent as possible. Let me know if that makes sense though.

// 2. When telemetry details are not set.
// 3. When the telemetry logLevel configured in the connection context is higher than the
// logLevel for the event.
// 3. When the event's log level is more verbose than the configured telemetry log level.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct 😅? usually more log levels have lower number in frameworks. For example, log4j

event log level: DEBUG (10000)
connectionContext.getTelemetryLogLevel().toInt(): INFO (20000)

As per comment, since event log level is more verbose, we want to return. But the check 10000 > 20000 won't be true so no return.

Copy link
Collaborator Author

@samikshya-db samikshya-db Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the numbering in this codebase is opposite to log4j's convention Actually it is the same convention, in case of telemetry, trace has a debug of 5 and info has 4 !😅

Even in log4j : it is opposite to what you have cited.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants