Skip to content

HBASE-29265: Batch calls to overloaded cluster can cause meta hotspotting #6961

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: branch-2
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -219,13 +219,14 @@ public void run() {
} catch (IOException e) {
// The service itself failed . It may be an error coming from the communication
// layer, but, as well, a functional error raised by the server.
receiveGlobalFailure(multiAction, server, numAttempt, e, true);

receiveGlobalFailure(multiAction, server, numAttempt, e);
return;
} catch (Throwable t) {
// This should not happen. Let's log & retry anyway.
LOG.error("id=" + asyncProcess.id + ", caught throwable. Unexpected."
+ " Retrying. Server=" + server + ", tableName=" + tableName, t);
receiveGlobalFailure(multiAction, server, numAttempt, t, true);
receiveGlobalFailure(multiAction, server, numAttempt, t);
return;
}
if (res.type() == AbstractResponse.ResponseType.MULTI) {
Expand Down Expand Up @@ -570,7 +571,6 @@ private RegionLocations findAllLocationsOrFail(Action action, boolean useCache)
*/
void sendMultiAction(Map<ServerName, MultiAction> actionsByServer, int numAttempt,
List<Action> actionsForReplicaThread, boolean reuseThread) {
boolean clearServerCache = true;
// Run the last item on the same thread if we are already on a send thread.
// We hope most of the time it will be the only item, so we can cut down on threads.
int actionsRemaining = actionsByServer.size();
Expand Down Expand Up @@ -606,15 +606,14 @@ void sendMultiAction(Map<ServerName, MultiAction> actionsByServer, int numAttemp
LOG.warn("id=" + asyncProcess.id + ", task rejected by pool. Unexpected." + " Server="
+ server.getServerName(), t);
// Do not update cache if exception is from failing to submit action to thread pool
clearServerCache = false;
} else {
// see #HBASE-14359 for more details
LOG.warn("Caught unexpected exception/error: ", t);
}
asyncProcess.decTaskCounters(multiAction.getRegions(), server);
// We're likely to fail again, but this will increment the attempt counter,
// so it will finish.
receiveGlobalFailure(multiAction, server, numAttempt, t, clearServerCache);
receiveGlobalFailure(multiAction, server, numAttempt, t);
}
}
}
Expand Down Expand Up @@ -764,13 +763,18 @@ private void failAll(MultiAction actions, ServerName server, int numAttempt,
* @param t the throwable (if any) that caused the resubmit
*/
private void receiveGlobalFailure(MultiAction rsActions, ServerName server, int numAttempt,
Throwable t, boolean clearServerCache) {
Throwable t) {
errorsByServer.reportServerError(server);
Retry canRetry = errorsByServer.canTryMore(numAttempt) ? Retry.YES : Retry.NO_RETRIES_EXHAUSTED;
boolean clearServerCache = ClientExceptionsUtil.isMetaClearingException(t);

// Do not update cache if exception is from failing to submit action to thread pool
if (clearServerCache) {
cleanServerCache(server, t);

if (LOG.isTraceEnabled()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we tick a client-side metric here too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hernan looks to have that covered through updateCachedLocations - https://github.com/apache/hbase/pull/6961/files#r2096239861

LOG.trace("Cleared meta cache for server {} due to global failure {}", server, t);
}
}

int failed = 0;
Expand All @@ -779,12 +783,8 @@ private void receiveGlobalFailure(MultiAction rsActions, ServerName server, int
for (Map.Entry<byte[], List<Action>> e : rsActions.actions.entrySet()) {
byte[] regionName = e.getKey();
byte[] row = e.getValue().get(0).getAction().getRow();
// Do not use the exception for updating cache because it might be coming from
// any of the regions in the MultiAction and do not update cache if exception is
// from failing to submit action to thread pool
if (clearServerCache) {
updateCachedLocations(server, regionName, row,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @hgromer . From what I understand , the ClientExceptionsUtil.isMetaClearingException(t) ? null : t is what prevents the meta cache clear from happening deeper in updateCachedLocations for exceptions for which we should not clear the meta cache. If its a meta clearing exception we pass null and bypass this check and clear the meta cache , if its not a meta cache clearing exception we pass the exception and the check to bypass the meta cache clear if its not a meta cache clearing exception happens in updateCachedLocations here

I may have misunderstand or have missed something, could you possibly add a test to show that on batch operation a non meta cache clearing exception is causing a meta cache clear through receiveGlobalFailure? There are some existing meta cache clear behavior tests you could reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you're right. This code path is really tricky to reason about. Between the UnknownExceptions in the meta cache clear metrics, and the lack of logging, it's really difficult to identify what is causing these meta cache clears. I think I'm going to re-purpose this PR to simply add some logging + avoid passing in null as the meta exception in this code path. That should hopefully help us shed some light on the high number of meta cache clears we're seeing at my company, and will illuminate a path forward.

@droudnitsky does that make sense to you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah AsyncRequestFutureImpl is tricky .. I have spent more time than I care to admit to understand it better. Passing the exception to updateCachedLocations so it gets accounted for in the cache clearing exception metric sounds good, that is a nice improvement, makes sense to me 👍

ClientExceptionsUtil.isMetaClearingException(t) ? null : t);
updateCachedLocations(server, regionName, row, t);
Copy link
Contributor Author

@hgromer hgromer May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also solves the frustration of seeing "UnknownException" when inspecting meta cache clear exception metrics. This has made it quite difficult to track down what triggered the meta cache clear.

I think it's always better to provide more context than less. Even if an exception is meta cache clearing (though it will be now), I'd still prefer to know the exact exception type that cleared the meta cache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 would be good to preserve the exception for updateCachedLocations

Since we currently pass null to updateCachedLocations if we have a meta cache clearing exception, does that means that we never update the cache clearing exception metric properly for cache clears coming from receiveGlobalFailure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we'll do is basically "mask" the cache clearing exception by report an UnknownException. The code for that lives in the metrics class. It's annoying b/c that coupled with the lack of any logging in this code path makes it really difficult to determine what caused these meta cache clears.

}
for (Action action : e.getValue()) {
Retry retry =
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
import java.net.SocketTimeoutException;
import java.nio.channels.ClosedChannelException;
import java.util.Set;
import java.util.concurrent.RejectedExecutionException;
import java.util.concurrent.TimeoutException;
import org.apache.hadoop.hbase.CallDroppedException;
import org.apache.hadoop.hbase.CallQueueTooBigException;
Expand Down Expand Up @@ -56,8 +57,8 @@ public static boolean isMetaClearingException(Throwable cur) {
if (cur == null) {
return true;
}
return !isSpecialException(cur) || (cur instanceof RegionMovedException)
|| cur instanceof NotServingRegionException;
return (!isExecutorException(cur) && !isSpecialException(cur))
|| (cur instanceof RegionMovedException) || cur instanceof NotServingRegionException;
}

public static boolean isSpecialException(Throwable cur) {
Expand Down Expand Up @@ -177,4 +178,8 @@ public static Throwable translatePFFE(Throwable t) throws IOException {
}
return t;
}

private static boolean isExecutorException(Throwable t) {
return RejectedExecutionException.class.isAssignableFrom(t.getClass());
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
import java.util.Arrays;
import java.util.List;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.RejectedExecutionException;
import java.util.concurrent.TimeUnit;
import java.util.stream.Collectors;
import java.util.stream.IntStream;
Expand Down Expand Up @@ -397,7 +398,7 @@ public static List<Throwable> metaCachePreservingExceptions() {
return Arrays.asList(new RegionOpeningException(" "),
new RegionTooBusyException("Some old message"), new RpcThrottlingException(" "),
new MultiActionResultTooLarge(" "), new RetryImmediatelyException(" "),
new CallQueueTooBigException());
new CallQueueTooBigException(), new RejectedExecutionException(" "));
}

public static class RegionServerWithFakeRpcServices extends HRegionServer {
Expand Down