-
Notifications
You must be signed in to change notification settings - Fork 3.3k
HBASE-29265: Batch calls to overloaded cluster can cause meta hotspotting #6961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-2
Are you sure you want to change the base?
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@@ -783,8 +787,7 @@ private void receiveGlobalFailure(MultiAction rsActions, ServerName server, int | |||
// any of the regions in the MultiAction and do not update cache if exception is | |||
// from failing to submit action to thread pool | |||
if (clearServerCache) { | |||
updateCachedLocations(server, regionName, row, | |||
ClientExceptionsUtil.isMetaClearingException(t) ? null : t); | |||
updateCachedLocations(server, regionName, row, t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also solves the frustration of seeing "UnknownException" when inspecting meta cache clear exception metrics. This has made it quite difficult to track down what triggered the meta cache clear.
I think it's always better to provide more context than less. Even if an exception is meta cache clearing (though it will be now), I'd still prefer to know the exact exception type that cleared the meta cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 would be good to preserve the exception for updateCachedLocations
Since we currently pass null to updateCachedLocations
if we have a meta cache clearing exception, does that means that we never update the cache clearing exception metric properly for cache clears coming from receiveGlobalFailure
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we'll do is basically "mask" the cache clearing exception by report an UnknownException
. The code for that lives in the metrics class. It's annoying b/c that coupled with the lack of any logging in this code path makes it really difficult to determine what caused these meta cache clears.
errorsByServer.reportServerError(server); | ||
Retry canRetry = errorsByServer.canTryMore(numAttempt) ? Retry.YES : Retry.NO_RETRIES_EXHAUSTED; | ||
boolean clearServerCache = false; | ||
|
||
if (!(t instanceof RejectedExecutionException)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enforces the constraints added in https://issues.apache.org/jira/browse/HBASE-27491
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better if you instead push RejectedExecutionException
down into ClientExceptionsUtil.isMetaClearingException
.
How about adding another collection of execution-exceptions for the family of various ExecutorService interaction errors, like is done with networking/connection exceptions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted, adding that
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@@ -783,8 +787,7 @@ private void receiveGlobalFailure(MultiAction rsActions, ServerName server, int | |||
// any of the regions in the MultiAction and do not update cache if exception is | |||
// from failing to submit action to thread pool | |||
if (clearServerCache) { | |||
updateCachedLocations(server, regionName, row, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @hgromer . From what I understand , the ClientExceptionsUtil.isMetaClearingException(t) ? null : t
is what prevents the meta cache clear from happening deeper in updateCachedLocations
for exceptions for which we should not clear the meta cache. If its a meta clearing exception we pass null and bypass this check and clear the meta cache , if its not a meta cache clearing exception we pass the exception and the check to bypass the meta cache clear if its not a meta cache clearing exception happens in updateCachedLocations
here
I may have misunderstand or have missed something, could you possibly add a test to show that on batch operation a non meta cache clearing exception is causing a meta cache clear through receiveGlobalFailure
? There are some existing meta cache clear behavior tests you could reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah you're right. This code path is really tricky to reason about. Between the UnknownExceptions in the meta cache clear metrics, and the lack of logging, it's really difficult to identify what is causing these meta cache clears. I think I'm going to re-purpose this PR to simply add some logging + avoid passing in null
as the meta exception in this code path. That should hopefully help us shed some light on the high number of meta cache clears we're seeing at my company, and will illuminate a path forward.
@droudnitsky does that make sense to you?
@@ -783,8 +787,7 @@ private void receiveGlobalFailure(MultiAction rsActions, ServerName server, int | |||
// any of the regions in the MultiAction and do not update cache if exception is | |||
// from failing to submit action to thread pool | |||
if (clearServerCache) { | |||
updateCachedLocations(server, regionName, row, | |||
ClientExceptionsUtil.isMetaClearingException(t) ? null : t); | |||
updateCachedLocations(server, regionName, row, t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 would be good to preserve the exception for updateCachedLocations
Since we currently pass null to updateCachedLocations
if we have a meta cache clearing exception, does that means that we never update the cache clearing exception metric properly for cache clears coming from receiveGlobalFailure
?
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
|
||
// Do not update cache if exception is from failing to submit action to thread pool | ||
if (clearServerCache) { | ||
cleanServerCache(server, t); | ||
|
||
if (LOG.isTraceEnabled()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we tick a client-side metric here too?
No description provided.