Handle ZkInterruptedException with write verification in IdealStateGroupCommit #16900

AlexanderKM · 2025-09-25T17:14:31Z

Potential solution for #16866

When a ZkInterruptedException occurs during IdealState updates, we now verify whether the write actually succeeded by:

Reading back the current IdealState from ZooKeeper
Checking version advancement: Ensures the version increased from our pre-write baseline
Verifying content equality: Confirms our specific changes are present in the written state

Why both checks are necessary

Version-only check is insufficient: Version advancement only tells us some write occurred, not necessarily ours
Equality check ensures specificity: Verifies our exact changes are present, not overwritten by concurrent updates

There are some tradeoffs here and we are accepting some false negatives when checking for equality, given that the IdealState could be updated outside of this code block (i.e. if some other thread or process updates the ideal state, the equality check may fail, and then we will retry). I think this is an acceptable tradeoff especially given this exception shouldn't occur very often. This ensures we never incorrectly report success when our specific write didn't persist in the final state.

…ruptedException Signed-off-by: Alex Maniates <[email protected]>

codecov-commenter · 2025-09-25T17:58:55Z

Codecov Report

❌ Patch coverage is 0% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.51%. Comparing base (d39f93d) to head (a8d8068).

Files with missing lines	Patch %	Lines
...inot/common/utils/helix/IdealStateGroupCommit.java	0.00%	14 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##             master   #16900   +/-   ##
=========================================
  Coverage     63.50%   63.51%           
  Complexity     1412     1412           
=========================================
  Files          3068     3068           
  Lines        180255   180269   +14     
  Branches      27583    27586    +3     
=========================================
+ Hits         114479   114494   +15     
- Misses        56956    56961    +5     
+ Partials       8820     8814    -6

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (ø)`
integration	`100.00% <ø> (ø)`
integration1	`100.00% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-11	`63.47% <0.00%> (-0.01%)`	⬇️
java-21	`63.48% <0.00%> (+<0.01%)`	⬆️
temurin	`63.51% <0.00%> (+<0.01%)`	⬆️
unittests	`63.50% <0.00%> (+<0.01%)`	⬆️
unittests1	`56.43% <0.00%> (+0.02%)`	⬆️
unittests2	`33.59% <0.00%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull Request Overview

This PR adds improved error handling for ZooKeeper interruptions during IdealState updates in the Pinot controller. The solution implements a write verification mechanism to handle race conditions where a ZkInterruptedException occurs but the write may have actually succeeded.

Adds handling for ZkInterruptedException with verification logic
Implements dual verification: version advancement check and content equality check
Provides graceful recovery from transient ZooKeeper interruptions

pinot-common/src/main/java/org/apache/pinot/common/utils/helix/IdealStateGroupCommit.java

Copilot · 2025-09-25T18:47:15Z

pinot-common/src/main/java/org/apache/pinot/common/utils/helix/IdealStateGroupCommit.java

+                return false;
+              } else {
+                LOGGER.info("IdealState was written successfully after interrupt for resource: {}", resourceName);
+                idealStateWrapper._idealState = updatedIdealState;


The wrapper should be updated with the writtenIdealState that was successfully read back from ZooKeeper, not the updatedIdealState that we attempted to write. This ensures the wrapper reflects the actual persisted state and its version number.

Suggested change

idealStateWrapper._idealState = updatedIdealState;

idealStateWrapper._idealState = writtenIdealState;

Jackie-Jiang

Is the interruption caused by controller disconnecting from the cluster (during shut down)? If so, do we still have access to ZK after disconnection?

cc @xiangfu0 to also review

Jackie-Jiang · 2025-09-25T18:48:24Z

pinot-common/src/main/java/org/apache/pinot/common/utils/helix/IdealStateGroupCommit.java

              LOGGER.warn("Version changed while updating ideal state for resource: {}", resourceName);
              return false;
+            } catch (ZkInterruptedException e) {
+              LOGGER.warn("Caught ZkInterruptedException while updating resource: {}, verifying...",


Please add some comment explaining why we are doing this

Jackie-Jiang · 2025-09-25T18:56:44Z

pinot-common/src/main/java/org/apache/pinot/common/utils/helix/IdealStateGroupCommit.java

+            } catch (ZkInterruptedException e) {
+              LOGGER.warn("Caught ZkInterruptedException while updating resource: {}, verifying...",
+                  resourceName);
+              IdealState writtenIdealState = dataAccessor.getProperty(idealStateKey);


Should we do another try-catch over this handling?
I feel the interruption might mostly be triggered when controller disconnects from the cluster, in which case we probably no longer be able to access the ZK. Can you verify that?

AlexanderKM · 2025-09-25T20:27:13Z

@Jackie-Jiang you are totally right. Upon further investigation, it looks like this is happening during controller shut down, and the ZkInterruptedException is quite intentional on the client side when the Helix Manager is called to disconnect https://github.com/apache/pinot/blob/master/pinot-controller/src/main/java/org/apache/pinot/controller/BaseControllerStarter.java#L1035

ZkHelixManager source
ZkClient source

I need to go back and think how to solve this from that point of view, as we won't be able to read from zk (or delete for that matter) since the client is disconnected and won't be reconnected since the Controller is in shutdown mode. The solution may need a more graceful shutdown to solve this timeline:

Request comes into controller to process upload of new segment
Controller begins to shut down, helix manager is disconnected while IS update is still trying to be updated
Inside the IS update retry loop, we hit ZkInterruptedException (cannot recover here because the client is disconnected)

Perhaps we can attempt to let remaining updates go through with some fixed timeout

Jackie-Jiang · 2025-09-25T21:17:14Z

I still feel the root cause is Helix not giving a clear abstraction on the ZK update call. It should either success, or fail without changing the record

Verify ideal state was updated in IdealStateGroupCommit after ZkInter…

a8d8068

…ruptedException Signed-off-by: Alex Maniates <[email protected]>

AlexanderKM mentioned this pull request Sep 25, 2025

Delete segment completely when ZkOperator fails to update IdealState with ZkInterruptedException #16867

Open

Jackie-Jiang requested review from Copilot and xiangfu0 September 25, 2025 18:46

Jackie-Jiang added the bugfix label Sep 25, 2025

Copilot AI reviewed Sep 25, 2025

View reviewed changes

Jackie-Jiang reviewed Sep 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handle ZkInterruptedException with write verification in IdealStateGroupCommit #16900

Handle ZkInterruptedException with write verification in IdealStateGroupCommit #16900

AlexanderKM commented Sep 25, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Sep 25, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Sep 25, 2025

Uh oh!

Jackie-Jiang left a comment

Uh oh!

Jackie-Jiang Sep 25, 2025

Uh oh!

Jackie-Jiang Sep 25, 2025

Uh oh!

AlexanderKM commented Sep 25, 2025

Uh oh!

Jackie-Jiang commented Sep 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	idealStateWrapper._idealState = updatedIdealState;
	idealStateWrapper._idealState = writtenIdealState;

Handle ZkInterruptedException with write verification in IdealStateGroupCommit #16900

Are you sure you want to change the base?

Handle ZkInterruptedException with write verification in IdealStateGroupCommit #16900

Conversation

AlexanderKM commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why both checks are necessary

Uh oh!

codecov-commenter commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Copilot AI Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Uh oh!

Jackie-Jiang Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Jackie-Jiang Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

AlexanderKM commented Sep 25, 2025

Uh oh!

Jackie-Jiang commented Sep 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AlexanderKM commented Sep 25, 2025 •

edited

Loading

codecov-commenter commented Sep 25, 2025 •

edited

Loading