Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Internal] Circuit Breaker: Adds Code to Implement Per Partition Circuit Breaker #5023

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

kundadebdatta
Copy link
Member

@kundadebdatta kundadebdatta commented Feb 15, 2025

Pull Request Template

Description

The idea of having a per partition circuit breaker (aka PPCB) is to optimize a) read availability , in a single master account and b) read + write availability in a multi master account, during a time when a specific partition in one of the regions is experiencing an outage/ quorum loss. This feature is independent of the partition level failover triggered by the backend. The per partition circuit breaker is developed behind a feature flag AZURE_COSMOS_PARTITION_LEVEL_CIRCUIT_BREAKER_ENABLED.

However, when the partition level failover is enabled, we will enable the PPCB by default so that the reads can benefits from it.

Scope

  • For single master, only the read requests will use the circuit breaker to add the pk-range to region override mapping, and use this mapping as a source of truth to route the read requests.

  • For multi master, both the read and write requests will use the circuit breaker to add the pk-range to region override mapping and use this mapping as a source of truth to route both read and write requests.

Understanding the Configurations exposed by the environment variables:

  • AZURE_COSMOS_PARTITION_LEVEL_CIRCUIT_BREAKER_ENABLED: This environment variable is used to enable/ disable the partition level circuit breaker feature. The default value is false.

  • AZURE_COSMOS_PPCB_STALE_PARTITION_UNAVAILABILITY_REFRESH_INTERVAL_IN_SECONDS: This environment variable is used to set the background periodic address refresh task interval. The default value for this interval is 60 seconds.

  • AZURE_COSMOS_PPCB_ALLOWED_PARTITION_UNAVAILABILITY_DURATION_IN_SECONDS: This environment variable is used to set the partition unavailability time duration in seconds. The unavailability time indicates how long a partition can remain unhealthy, before it can re-validate it's connection status. The default value for this property is 5 seconds.

  • AZURE_COSMOS_PPCB_CONSECUTIVE_FAILURE_COUNT: This environment variable is used to set the consecutive failure count for reads, before triggering per partition level circuit breaker flow. The default value for this flag is 10 consecutive requests within 1 min window.

Understanding the Working Principle:

On a high level, there are three parts of the circuit breaker implementation:

  • Short Circuit and Failover detection: The failover detection logic will reside in the SDK ClientRetryPolicy, just like we have for PPAF. Ideally the detection logic is based on the below two principles:

    • Status Codes: The status codes that are indicative of partition level circuit breaker would be the following: a) 503 Service Unavailable, b) 408 Request Timeout, c) cancellation token expired.

    • Threshold: Once the failover condition is met, the SDK will look for some consecutive failures, until it hits a particular threshold. Once this threshold is met, the SDK will fail over the read requests to the next preferred region for that offending partition. For example, if the threshold value for read requests is 10, then the SDK will look for 10 consecutive failures. If the threshold is met/ exceeded, the SDK will add the region failover information for that partition.

  • Failover a faulty partition to the next preferred region: Once the failover conditions are met, the ClientRetryPolicy will trigger a partition level override using GlobalPartitionEndpointManagerCore.TryMarkEndpointUnavailableForPartitionKeyRange to the next region in the preferred region list. This failover information will help the current, as well as the subsequent requests (reads in single master and both reads and writes in multi master) to route the request to the next region.

  • Failback the faulty partition to it's original first preferred region: With PPAF enabled, ideally the write requests will rely on 403.3 (Write Forbidden) signal to fail the partition back to the primary write region. However, this is not true for reads. That means SDK doesn’t have a definitive signal to identify when to initiate a failback for read requests.
    Hence, the idea is to create a background task during the time of read failover, which will keep track of the pk-range and region mapping. The task will periodically fetch the address from the gateway address cache for those pk ranges in the faulty region, and it will try to initiate Rntbd connection to all 4 replicas of that partition. The RNTBD open connection attempt will be made similar to that of the replica validation flow. The life cycle of the background task will get initiated during a failover and will remain until the SDK is disposed.
    If the attempt to make the connection to all 4 replicas is successful, then the task will remove/ override the entry with the primary region, resulting the SDK to failback the read requests.

Type of change

  • New feature (non-breaking change which adds functionality)

Closing issues

To automatically close an issue: closes #4981

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good!

@kundadebdatta kundadebdatta changed the title [Internal] PPCB: Implement Per Partition Circuit Breaker [Internal] Circuit Breaker: Adds Code to Implement Per Partition Circuit Breaker Feb 17, 2025
@kundadebdatta kundadebdatta self-assigned this Feb 19, 2025
/// A <see cref="Lazy{T}"/> instance of <see cref="ConcurrentDictionary{K,V}"/> containing the partition key range to failover info mapping.
/// This mapping is primarily used for reads in a single master account, and both reads and writes in a multi master account.
/// </summary>
private readonly Lazy<ConcurrentDictionary<PartitionKeyRange, PartitionKeyRangeFailoverInfo>> PartitionKeyRangeToLocationForReadAndWrite = new (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a method to access information from these dictionaries could be beneficial for work done in the logs in the future. Not blocking for this PR just a note for the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Enables automation to merge PRs PerPartitionAutomaticFailover
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Per Partition Automatic Failover] - Add Circuit Breaker to Improve Read Availability.
3 participants