Skip to content

Conversation

@AlexanderKM
Copy link
Contributor

@AlexanderKM AlexanderKM commented Sep 25, 2025

Potential solution for #16866

When a ZkInterruptedException occurs during IdealState updates, we now verify whether the write actually succeeded by:

  1. Reading back the current IdealState from ZooKeeper
  2. Checking version advancement: Ensures the version increased from our pre-write baseline
  3. Verifying content equality: Confirms our specific changes are present in the written state

Why both checks are necessary

  • Version-only check is insufficient: Version advancement only tells us some write occurred, not necessarily ours
  • Equality check ensures specificity: Verifies our exact changes are present, not overwritten by concurrent updates

There are some tradeoffs here and we are accepting some false negatives when checking for equality, given that the IdealState could be updated outside of this code block (i.e. if some other thread or process updates the ideal state, the equality check may fail, and then we will retry). I think this is an acceptable tradeoff especially given this exception shouldn't occur very often. This ensures we never incorrectly report success when our specific write didn't persist in the final state.

@codecov-commenter
Copy link

codecov-commenter commented Sep 25, 2025

Codecov Report

❌ Patch coverage is 0% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.51%. Comparing base (d39f93d) to head (a8d8068).

Files with missing lines Patch % Lines
...inot/common/utils/helix/IdealStateGroupCommit.java 0.00% 14 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##             master   #16900   +/-   ##
=========================================
  Coverage     63.50%   63.51%           
  Complexity     1412     1412           
=========================================
  Files          3068     3068           
  Lines        180255   180269   +14     
  Branches      27583    27586    +3     
=========================================
+ Hits         114479   114494   +15     
- Misses        56956    56961    +5     
+ Partials       8820     8814    -6     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.47% <0.00%> (-0.01%) ⬇️
java-21 63.48% <0.00%> (+<0.01%) ⬆️
temurin 63.51% <0.00%> (+<0.01%) ⬆️
unittests 63.50% <0.00%> (+<0.01%) ⬆️
unittests1 56.43% <0.00%> (+0.02%) ⬆️
unittests2 33.59% <0.00%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds improved error handling for ZooKeeper interruptions during IdealState updates in the Pinot controller. The solution implements a write verification mechanism to handle race conditions where a ZkInterruptedException occurs but the write may have actually succeeded.

  • Adds handling for ZkInterruptedException with verification logic
  • Implements dual verification: version advancement check and content equality check
  • Provides graceful recovery from transient ZooKeeper interruptions

return false;
} else {
LOGGER.info("IdealState was written successfully after interrupt for resource: {}", resourceName);
idealStateWrapper._idealState = updatedIdealState;
Copy link

Copilot AI Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wrapper should be updated with the writtenIdealState that was successfully read back from ZooKeeper, not the updatedIdealState that we attempted to write. This ensures the wrapper reflects the actual persisted state and its version number.

Suggested change
idealStateWrapper._idealState = updatedIdealState;
idealStateWrapper._idealState = writtenIdealState;

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the interruption caused by controller disconnecting from the cluster (during shut down)? If so, do we still have access to ZK after disconnection?

cc @xiangfu0 to also review

LOGGER.warn("Version changed while updating ideal state for resource: {}", resourceName);
return false;
} catch (ZkInterruptedException e) {
LOGGER.warn("Caught ZkInterruptedException while updating resource: {}, verifying...",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some comment explaining why we are doing this

} catch (ZkInterruptedException e) {
LOGGER.warn("Caught ZkInterruptedException while updating resource: {}, verifying...",
resourceName);
IdealState writtenIdealState = dataAccessor.getProperty(idealStateKey);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do another try-catch over this handling?
I feel the interruption might mostly be triggered when controller disconnects from the cluster, in which case we probably no longer be able to access the ZK. Can you verify that?

@AlexanderKM
Copy link
Contributor Author

@Jackie-Jiang you are totally right. Upon further investigation, it looks like this is happening during controller shut down, and the ZkInterruptedException is quite intentional on the client side when the Helix Manager is called to disconnect https://github.com/apache/pinot/blob/master/pinot-controller/src/main/java/org/apache/pinot/controller/BaseControllerStarter.java#L1035

ZkHelixManager source
ZkClient source

I need to go back and think how to solve this from that point of view, as we won't be able to read from zk (or delete for that matter) since the client is disconnected and won't be reconnected since the Controller is in shutdown mode. The solution may need a more graceful shutdown to solve this timeline:

  1. Request comes into controller to process upload of new segment
  2. Controller begins to shut down, helix manager is disconnected while IS update is still trying to be updated
  3. Inside the IS update retry loop, we hit ZkInterruptedException (cannot recover here because the client is disconnected)

Perhaps we can attempt to let remaining updates go through with some fixed timeout

@Jackie-Jiang
Copy link
Contributor

I still feel the root cause is Helix not giving a clear abstraction on the ZK update call. It should either success, or fail without changing the record

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants