Skip to content

Conversation

AlexanderKM
Copy link
Contributor

For #16866

This PR adds a segment deletion step to the cleanup process when a new segment is uploaded and we encounter a ZkInterruptedException while updating the IdealState. Ultimately because of this exception, we cannot be certain if the IdealState was updated successfully, and since the code will already delete the segment metadata from the propertystore, it is a good idea to delete it from the IdealState as well.

We have been using this patch in our deployment for some months and it has been helpful in preemptively catching this scenario so we don't need to go in and manually delete the segment ourselves.

@codecov-commenter
Copy link

codecov-commenter commented Sep 22, 2025

Codecov Report

❌ Patch coverage is 82.45614% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.50%. Comparing base (d89dc7c) to head (8b980a6).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
...ntroller/helix/core/PinotHelixResourceManager.java 81.48% 6 Missing and 4 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #16867      +/-   ##
============================================
- Coverage     63.52%   63.50%   -0.02%     
  Complexity     1410     1410              
============================================
  Files          3068     3069       +1     
  Lines        180161   180191      +30     
  Branches      27556    27561       +5     
============================================
- Hits         114441   114439       -2     
- Misses        56917    56946      +29     
- Partials       8803     8806       +3     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.48% <82.45%> (+0.02%) ⬆️
java-21 63.48% <82.45%> (-0.01%) ⬇️
temurin 63.50% <82.45%> (-0.02%) ⬇️
unittests 63.50% <82.45%> (-0.02%) ⬇️
unittests1 56.40% <0.00%> (-0.04%) ⬇️
unittests2 33.57% <82.45%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing the fix!

Is there a way to identify if an IS update is persisted? Currently we assume the update not persisted when exception is thrown, which caused the issue. That is the root problem to fix


public class SegmentIngestionFailureException extends Exception {

public SegmentIngestionFailureException(String message) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(format) Please apply Pinot Style

// Call deleteSegment to remove the segment from permanent location if needed.
LOGGER.error("Caught exception while calling assignTableSegment for adding segment: {} to table: {}", segmentName,
tableNameWithType, e);
if (containsException(e, ZkInterruptedException.class)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be handled within _pinotHelixResourceManager.assignSegment() where it knows the most context. Basically we should follow the contract that assignSegment() either finish successfully, or revert changes and throw exception.

@AlexanderKM
Copy link
Contributor Author

I have moved the segment deletion into the PinotHelixResourceManager 👍

@AlexanderKM
Copy link
Contributor Author

Is there a way to identify if an IS update is persisted? Currently we assume the update not persisted when exception is thrown, which caused the issue. That is the root problem to fix

When encountering this problem and first adding the fix, I was hesitant to go down this path because I feared with this exception, it is better to just cleanup completely, since the code is already making the decision to delete metadata.
But I suppose it could be nice to do something like:

catch (Exception e) {
  if (idealState contains segment) {
    exit successfully;
  else {
    cleanup metadata;
  }
...
}

Let me know if you would like that in this PR or not @Jackie-Jiang

@Jackie-Jiang
Copy link
Contributor

We should also consider retrying for such temporary exceptions.
Clean up is fine, but what if the clean up also fails?

…with ZkInterruptedException

Signed-off-by: Alex Maniates <[email protected]>
@AlexanderKM AlexanderKM force-pushed the segment-upload-failure-cleanup branch from ff68e26 to 68d156d Compare September 24, 2025 20:09
@AlexanderKM AlexanderKM force-pushed the segment-upload-failure-cleanup branch from 68d156d to 8b980a6 Compare September 24, 2025 20:13
@AlexanderKM
Copy link
Contributor Author

I took your advice and added a retry in the segment assignment flow! Just only when a ZkInterruptedException is encountered (it probably doesn't make sense to retry in other scenarios since the RetryPolicy is already attempting 20 times in "normal" cases, so keeping this scoped to only ZK exceptions for now). This did require a bit of refactoring but I think it looks better in the end.

To support this, I added some unit tests for the following:

  • Successful add new segment
  • Recovering from ZkInterruptedException leads to a successful add new segment
  • Complete failure of adding new segment in case of multiple ZK exceptions or other normal exceptions like AttemptsExceeded

Clean up is fine, but what if the clean up also fails?

I think there will inevitably be some scenarios that we cannot completely recover from, but we will at least let the caller know that an error occurred and then they can handle cleanup manually as needed. This also hopefully removes the segment from both the PropertyStore AND IdealState, which leaves the zookeeper state in a better place 🤞

Let me know your thoughts when you have time @Jackie-Jiang

HelixHelper.updateIdealState(_helixZkManager, tableNameWithType, idealState -> {
assert idealState != null;
Map<String, Map<String, String>> currentAssignment = idealState.getRecord().getMapFields();
if (currentAssignment.containsKey(segmentName)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note: by retrying, we are also checking if the segment exists automatically here which is nice

@Jackie-Jiang
Copy link
Contributor

Digging more into the code, I'm still trying to figure out when will it get ZkInterruptedException but with IS persisted. So far I have found Helix ZkClient.writeDataReturnStat() -> ZkClient.retryUntilConnected() could throw ZkInterruptedException but it shouldn't persist the IS.
The fix in this PR is not general enough because the core problem is IS update result is no longer consistent, which could impact a lot of logics, in addition to segment upload. We need to fix the IS update logic to ensure it either successfully persist the result, or throw exception.
@xiangfu0 Can you help take a look at this? Can we handle this in IdealStateGroupCommit?

@AlexanderKM
Copy link
Contributor Author

AlexanderKM commented Sep 24, 2025

That is fair. From the stacktrace I have in our logs when this happens, it is indeed coming from the retryUntilConnected method. I will dig in further as well.

Edit: looks like this relevant issue was closed without a fix? apache/helix#2685

Running into an issue with https://github.com/apache/pinot using helix version 1.3.1, wherein there seems to be an inconsistency b/w the returned status code of the ZkBaseDataAccessor.set vs the expected outcome. The set call to update the idealState here, fails with retCode != OK, but the IdealState ZNode does get updated successfully. This is causing unexpected behaviour further downstream in the system, which relies on the return code of the set call.

@Jackie-Jiang
Copy link
Contributor

@AlexanderKM Great finding of the old Helix issue. I cannot re-open it, but opened a new one here: apache/helix#3072

@AlexanderKM
Copy link
Contributor Author

I am not sure how quickly we would be able to update Helix (or if it is possible at all), so here is another possible approach we can take to fix the problem at the root of the IS update: #16900

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants