Delete segment completely when ZkOperator fails to update IdealState with ZkInterruptedException #16867

AlexanderKM · 2025-09-22T14:50:00Z

This PR adds a segment deletion step to the cleanup process when a new segment is uploaded and we encounter a ZkInterruptedException while updating the IdealState. Ultimately because of this exception, we cannot be certain if the IdealState was updated successfully, and since the code will already delete the segment metadata from the propertystore, it is a good idea to delete it from the IdealState as well.

We have been using this patch in our deployment for some months and it has been helpful in preemptively catching this scenario so we don't need to go in and manually delete the segment ourselves.

codecov-commenter · 2025-09-22T15:34:17Z

Codecov Report

❌ Patch coverage is 82.45614% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.50%. Comparing base (d89dc7c) to head (8b980a6).
⚠️ Report is 2 commits behind head on master.

Files with missing lines	Patch %	Lines
...ntroller/helix/core/PinotHelixResourceManager.java	81.48%	6 Missing and 4 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #16867      +/-   ##
============================================
- Coverage     63.52%   63.50%   -0.02%     
  Complexity     1410     1410              
============================================
  Files          3068     3069       +1     
  Lines        180161   180191      +30     
  Branches      27556    27561       +5     
============================================
- Hits         114441   114439       -2     
- Misses        56917    56946      +29     
- Partials       8803     8806       +3

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (ø)`
integration	`100.00% <ø> (ø)`
integration1	`100.00% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-11	`63.48% <82.45%> (+0.02%)`	⬆️
java-21	`63.48% <82.45%> (-0.01%)`	⬇️
temurin	`63.50% <82.45%> (-0.02%)`	⬇️
unittests	`63.50% <82.45%> (-0.02%)`	⬇️
unittests1	`56.40% <0.00%> (-0.04%)`	⬇️
unittests2	`33.57% <82.45%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Jackie-Jiang

Thanks for contributing the fix!

Is there a way to identify if an IS update is persisted? Currently we assume the update not persisted when exception is thrown, which caused the issue. That is the root problem to fix

Jackie-Jiang · 2025-09-22T17:48:29Z

...common/src/main/java/org/apache/pinot/common/exception/SegmentIngestionFailureException.java

+
+public class SegmentIngestionFailureException extends Exception {
+
+    public SegmentIngestionFailureException(String message) {


(format) Please apply Pinot Style

Jackie-Jiang · 2025-09-22T17:51:53Z

pinot-controller/src/main/java/org/apache/pinot/controller/api/upload/ZKOperator.java

      // Call deleteSegment to remove the segment from permanent location if needed.
      LOGGER.error("Caught exception while calling assignTableSegment for adding segment: {} to table: {}", segmentName,
          tableNameWithType, e);
+      if (containsException(e, ZkInterruptedException.class)) {


This should be handled within _pinotHelixResourceManager.assignSegment() where it knows the most context. Basically we should follow the contract that assignSegment() either finish successfully, or revert changes and throw exception.

AlexanderKM · 2025-09-22T20:00:12Z

I have moved the segment deletion into the PinotHelixResourceManager 👍

AlexanderKM · 2025-09-22T20:01:30Z

Is there a way to identify if an IS update is persisted? Currently we assume the update not persisted when exception is thrown, which caused the issue. That is the root problem to fix

When encountering this problem and first adding the fix, I was hesitant to go down this path because I feared with this exception, it is better to just cleanup completely, since the code is already making the decision to delete metadata.
But I suppose it could be nice to do something like:

catch (Exception e) {
  if (idealState contains segment) {
    exit successfully;
  else {
    cleanup metadata;
  }
...
}

Let me know if you would like that in this PR or not @Jackie-Jiang

Jackie-Jiang · 2025-09-22T20:41:16Z

We should also consider retrying for such temporary exceptions.
Clean up is fine, but what if the clean up also fails?

…with ZkInterruptedException Signed-off-by: Alex Maniates <[email protected]>

Signed-off-by: Alex Maniates <[email protected]>

…ncountering ZkInterruptedException Signed-off-by: Alex Maniates <[email protected]>

AlexanderKM · 2025-09-24T20:26:52Z

I took your advice and added a retry in the segment assignment flow! Just only when a ZkInterruptedException is encountered (it probably doesn't make sense to retry in other scenarios since the RetryPolicy is already attempting 20 times in "normal" cases, so keeping this scoped to only ZK exceptions for now). This did require a bit of refactoring but I think it looks better in the end.

To support this, I added some unit tests for the following:

Successful add new segment
Recovering from ZkInterruptedException leads to a successful add new segment
Complete failure of adding new segment in case of multiple ZK exceptions or other normal exceptions like AttemptsExceeded

Clean up is fine, but what if the clean up also fails?

I think there will inevitably be some scenarios that we cannot completely recover from, but we will at least let the caller know that an error occurred and then they can handle cleanup manually as needed. This also hopefully removes the segment from both the PropertyStore AND IdealState, which leaves the zookeeper state in a better place 🤞

Let me know your thoughts when you have time @Jackie-Jiang

AlexanderKM · 2025-09-24T20:29:49Z

...ntroller/src/main/java/org/apache/pinot/controller/helix/core/PinotHelixResourceManager.java

+    HelixHelper.updateIdealState(_helixZkManager, tableNameWithType, idealState -> {
+      assert idealState != null;
+      Map<String, Map<String, String>> currentAssignment = idealState.getRecord().getMapFields();
+      if (currentAssignment.containsKey(segmentName)) {


Also note: by retrying, we are also checking if the segment exists automatically here which is nice

Jackie-Jiang · 2025-09-24T22:02:19Z

Digging more into the code, I'm still trying to figure out when will it get ZkInterruptedException but with IS persisted. So far I have found Helix ZkClient.writeDataReturnStat() -> ZkClient.retryUntilConnected() could throw ZkInterruptedException but it shouldn't persist the IS.
The fix in this PR is not general enough because the core problem is IS update result is no longer consistent, which could impact a lot of logics, in addition to segment upload. We need to fix the IS update logic to ensure it either successfully persist the result, or throw exception.
@xiangfu0 Can you help take a look at this? Can we handle this in IdealStateGroupCommit?

AlexanderKM · 2025-09-24T22:33:06Z

That is fair. From the stacktrace I have in our logs when this happens, it is indeed coming from the retryUntilConnected method. I will dig in further as well.

Edit: looks like this relevant issue was closed without a fix? apache/helix#2685

Running into an issue with https://github.com/apache/pinot using helix version 1.3.1, wherein there seems to be an inconsistency b/w the returned status code of the ZkBaseDataAccessor.set vs the expected outcome. The set call to update the idealState here, fails with retCode != OK, but the IdealState ZNode does get updated successfully. This is causing unexpected behaviour further downstream in the system, which relies on the return code of the set call.

Jackie-Jiang · 2025-09-25T00:17:31Z

@AlexanderKM Great finding of the old Helix issue. I cannot re-open it, but opened a new one here: apache/helix#3072

AlexanderKM · 2025-09-25T17:16:25Z

I am not sure how quickly we would be able to update Helix (or if it is possible at all), so here is another possible approach we can take to fix the problem at the root of the IS update: #16900

Jackie-Jiang added the bugfix label Sep 22, 2025

Jackie-Jiang reviewed Sep 22, 2025

View reviewed changes

Delete segment completely when ZkOperator fails to update IdealState …

b33886d

…with ZkInterruptedException Signed-off-by: Alex Maniates <[email protected]>

AlexanderKM force-pushed the segment-upload-failure-cleanup branch from ff68e26 to 68d156d Compare September 24, 2025 20:09

AlexanderKM added 2 commits September 24, 2025 16:12

Move segment deletion cleanup into PinotHelixResourceManager

196e2d3

Signed-off-by: Alex Maniates <[email protected]>

Refactor PinotHelixResourceManager to retry segment assignment when e…

8b980a6

…ncountering ZkInterruptedException Signed-off-by: Alex Maniates <[email protected]>

AlexanderKM force-pushed the segment-upload-failure-cleanup branch from 68d156d to 8b980a6 Compare September 24, 2025 20:13

AlexanderKM commented Sep 24, 2025

View reviewed changes


		public class SegmentIngestionFailureException extends Exception {

		public SegmentIngestionFailureException(String message) {

Delete segment completely when ZkOperator fails to update IdealState with ZkInterruptedException #16867

Are you sure you want to change the base?

Delete segment completely when ZkOperator fails to update IdealState with ZkInterruptedException #16867

Uh oh!

Conversation

AlexanderKM commented Sep 22, 2025

Uh oh!

codecov-commenter commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Uh oh!

Jackie-Jiang Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Jackie-Jiang Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

AlexanderKM commented Sep 22, 2025

Uh oh!

AlexanderKM commented Sep 22, 2025

Uh oh!

Jackie-Jiang commented Sep 22, 2025

Uh oh!

AlexanderKM commented Sep 24, 2025

Uh oh!

AlexanderKM Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

Jackie-Jiang commented Sep 24, 2025

Uh oh!

AlexanderKM commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jackie-Jiang commented Sep 25, 2025

Uh oh!

AlexanderKM commented Sep 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Sep 22, 2025 •

edited

Loading

AlexanderKM commented Sep 24, 2025 •

edited

Loading