-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Delete segment completely when ZkOperator fails to update IdealState with ZkInterruptedException #16867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Delete segment completely when ZkOperator fails to update IdealState with ZkInterruptedException #16867
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #16867 +/- ##
============================================
- Coverage 63.52% 63.50% -0.02%
Complexity 1410 1410
============================================
Files 3068 3069 +1
Lines 180161 180191 +30
Branches 27556 27561 +5
============================================
- Hits 114441 114439 -2
- Misses 56917 56946 +29
- Partials 8803 8806 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing the fix!
Is there a way to identify if an IS update is persisted? Currently we assume the update not persisted when exception is thrown, which caused the issue. That is the root problem to fix
|
||
public class SegmentIngestionFailureException extends Exception { | ||
|
||
public SegmentIngestionFailureException(String message) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(format) Please apply Pinot Style
// Call deleteSegment to remove the segment from permanent location if needed. | ||
LOGGER.error("Caught exception while calling assignTableSegment for adding segment: {} to table: {}", segmentName, | ||
tableNameWithType, e); | ||
if (containsException(e, ZkInterruptedException.class)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be handled within _pinotHelixResourceManager.assignSegment()
where it knows the most context. Basically we should follow the contract that assignSegment()
either finish successfully, or revert changes and throw exception.
I have moved the segment deletion into the |
When encountering this problem and first adding the fix, I was hesitant to go down this path because I feared with this exception, it is better to just cleanup completely, since the code is already making the decision to delete metadata.
Let me know if you would like that in this PR or not @Jackie-Jiang |
We should also consider retrying for such temporary exceptions. |
…with ZkInterruptedException Signed-off-by: Alex Maniates <[email protected]>
ff68e26
to
68d156d
Compare
Signed-off-by: Alex Maniates <[email protected]>
…ncountering ZkInterruptedException Signed-off-by: Alex Maniates <[email protected]>
68d156d
to
8b980a6
Compare
I took your advice and added a retry in the segment assignment flow! Just only when a ZkInterruptedException is encountered (it probably doesn't make sense to retry in other scenarios since the RetryPolicy is already attempting 20 times in "normal" cases, so keeping this scoped to only ZK exceptions for now). This did require a bit of refactoring but I think it looks better in the end. To support this, I added some unit tests for the following:
I think there will inevitably be some scenarios that we cannot completely recover from, but we will at least let the caller know that an error occurred and then they can handle cleanup manually as needed. This also hopefully removes the segment from both the PropertyStore AND IdealState, which leaves the zookeeper state in a better place 🤞 Let me know your thoughts when you have time @Jackie-Jiang |
HelixHelper.updateIdealState(_helixZkManager, tableNameWithType, idealState -> { | ||
assert idealState != null; | ||
Map<String, Map<String, String>> currentAssignment = idealState.getRecord().getMapFields(); | ||
if (currentAssignment.containsKey(segmentName)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note: by retrying, we are also checking if the segment exists automatically here which is nice
Digging more into the code, I'm still trying to figure out when will it get |
That is fair. From the stacktrace I have in our logs when this happens, it is indeed coming from the Edit: looks like this relevant issue was closed without a fix? apache/helix#2685
|
@AlexanderKM Great finding of the old Helix issue. I cannot re-open it, but opened a new one here: apache/helix#3072 |
I am not sure how quickly we would be able to update Helix (or if it is possible at all), so here is another possible approach we can take to fix the problem at the root of the IS update: #16900 |
For #16866
This PR adds a segment deletion step to the cleanup process when a new segment is uploaded and we encounter a
ZkInterruptedException
while updating the IdealState. Ultimately because of this exception, we cannot be certain if the IdealState was updated successfully, and since the code will already delete the segment metadata from the propertystore, it is a good idea to delete it from the IdealState as well.We have been using this patch in our deployment for some months and it has been helpful in preemptively catching this scenario so we don't need to go in and manually delete the segment ourselves.