-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Problem Summary
During segment upload operations, we're encountering a race condition that leaves segments in an inconsistent state across different ZooKeeper locations. A segment gets successfully uploaded and added to the IdealState, but a ZkInterruptedException from the Helix client occurs during processing, which causes the segment metadata to be deleted from the PROPERTYSTORE location, while leaving the segment entry in the IdealState.
This results in segments that appear assigned to servers but are missing metadata in zookeeper, making them unavailable for queries and ultimately leading to query failures.
To remediate this issue, we have had to manually delete these segments (to remove them from the IdealState) and re-ingest them later.
Example Failure Timeline
- Segment upload succeeds - Segment is uploaded to deep store
- Segment assignment proceeds - Segment is assigned to server instances
- IdealState update begins - Segment is added to the IdealState in ZooKeeper
- ZkInterruptedException occurs - ZooKeeper client interruption happens during processing
- Metadata cleanup triggered - Controller deletes segment metadata from PROPERTYSTORE
- Inconsistent state remains - Segment exists in IdealState but metadata is missing from PROPERTYSTORE
Root Cause
The issue occurs in the segment upload flow when a ZkInterruptedException is thrown during IdealState updates. The exception handling logic deletes the segment metadata from the PROPERTYSTORE as a cleanup measure, but the segment entry may have already been committed to the IdealState. We do not know if the segment was successfully written to the ideal state successfully or not, because the zookeeper client disconnected.
Key code locations:
- Exception caught in PinotHelixResourceManager, segment metadata removed and exception re-thrown: https://github.com/apache/pinot/blob/master/pinot-controller/src/main/java/org/apache/pinot/controller/helix/core/PinotHelixResourceManager.java#L2588-L2598
- Cleanup logic in ZkOperator: https://github.com/apache/pinot/blob/master/pinot-controller/src/main/java/org/apache/pinot/controller/api/upload/ZKOperator.java#L447-L454
- Segment deletion is skipped because there is no longer any metadata record: https://github.com/apache/pinot/blob/master/pinot-controller/src/main/java/org/apache/pinot/controller/api/upload/ZKOperator.java#L601-L607
Notes:
- This scenario falls outside of the standard BaseRetryPolicy and associated AttemptsExceededExceptions (used inside the IdealStateGroupCommit code), which if this exception is hit, everything will fail cleanly. The AttemptsExceededException signifies that the ideal state truly could not be updated, so the standard metadata cleanup is sufficient, and the error is propagated back to the caller who initiated the segment upload. The error scenario I am describing above falls out of this flow.
Proposed Fix
We are using a patch internally that essentially catches the ZkInterruptedException specifically, and in turn deletes the segment by calling pinotHelixResourceManager.deleteSegment(tableNameWithType, segmentName) in that scenario, in this area: https://github.com/apache/pinot/blob/master/pinot-controller/src/main/java/org/apache/pinot/controller/api/upload/ZKOperator.java#L449-L453
Essentially doing something like:
try {
_pinotHelixResourceManager.assignSegment(tableConfig, segmentZKMetadata);
} catch (Exception e) {
// assignTableSegment removes the zk entry.
// Call deleteSegment to remove the segment from permanent location if needed.
if (containsException(e, ZkInterruptedException.class)) {
LOGGER.warn("Caught ZkInterruptedException while assigning segment: {} to table: {}. Deleting segment to prevent inconsistent state.", segmentName, tableNameWithType);
PinotResourceManagerResponse response =
_pinotHelixResourceManager.deleteSegment(tableNameWithType, segmentName);
...The error is then propagated back to the caller via the controller, signaling that the segment upload was a failure, and an attempt was made to delete the segment. This approach is working for us well in production today, and has saved us from needing to take manual intervention to go in and delete segments for whose metadata is missing from the property store.