Skip to content

ZkInterruptedException Causes Inconsistent Segment State in ZooKeeper for segment uploads #16866

@AlexanderKM

Description

@AlexanderKM

Problem Summary

During segment upload operations, we're encountering a race condition that leaves segments in an inconsistent state across different ZooKeeper locations. A segment gets successfully uploaded and added to the IdealState, but a ZkInterruptedException from the Helix client occurs during processing, which causes the segment metadata to be deleted from the PROPERTYSTORE location, while leaving the segment entry in the IdealState.

This results in segments that appear assigned to servers but are missing metadata in zookeeper, making them unavailable for queries and ultimately leading to query failures.
To remediate this issue, we have had to manually delete these segments (to remove them from the IdealState) and re-ingest them later.

Example Failure Timeline

  1. Segment upload succeeds - Segment is uploaded to deep store
  2. Segment assignment proceeds - Segment is assigned to server instances
  3. IdealState update begins - Segment is added to the IdealState in ZooKeeper
  4. ZkInterruptedException occurs - ZooKeeper client interruption happens during processing
  5. Metadata cleanup triggered - Controller deletes segment metadata from PROPERTYSTORE
  6. Inconsistent state remains - Segment exists in IdealState but metadata is missing from PROPERTYSTORE

Root Cause

The issue occurs in the segment upload flow when a ZkInterruptedException is thrown during IdealState updates. The exception handling logic deletes the segment metadata from the PROPERTYSTORE as a cleanup measure, but the segment entry may have already been committed to the IdealState. We do not know if the segment was successfully written to the ideal state successfully or not, because the zookeeper client disconnected.

Key code locations:

Notes:

  • This scenario falls outside of the standard BaseRetryPolicy and associated AttemptsExceededExceptions (used inside the IdealStateGroupCommit code), which if this exception is hit, everything will fail cleanly. The AttemptsExceededException signifies that the ideal state truly could not be updated, so the standard metadata cleanup is sufficient, and the error is propagated back to the caller who initiated the segment upload. The error scenario I am describing above falls out of this flow.

Proposed Fix

We are using a patch internally that essentially catches the ZkInterruptedException specifically, and in turn deletes the segment by calling pinotHelixResourceManager.deleteSegment(tableNameWithType, segmentName) in that scenario, in this area: https://github.com/apache/pinot/blob/master/pinot-controller/src/main/java/org/apache/pinot/controller/api/upload/ZKOperator.java#L449-L453

Essentially doing something like:

    try {
      _pinotHelixResourceManager.assignSegment(tableConfig, segmentZKMetadata);
    } catch (Exception e) {
      // assignTableSegment removes the zk entry.
      // Call deleteSegment to remove the segment from permanent location if needed.
      if (containsException(e, ZkInterruptedException.class)) {
      LOGGER.warn("Caught ZkInterruptedException while assigning segment: {} to table: {}. Deleting segment to prevent inconsistent state.", segmentName, tableNameWithType);

      PinotResourceManagerResponse response =
        _pinotHelixResourceManager.deleteSegment(tableNameWithType, segmentName);
  ...

The error is then propagated back to the caller via the controller, signaling that the segment upload was a failure, and an attempt was made to delete the segment. This approach is working for us well in production today, and has saved us from needing to take manual intervention to go in and delete segments for whose metadata is missing from the property store.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions