Skip to content

HBASE-29265: Batch calls to overloaded cluster can cause meta hotspotting #6961

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: branch-2
Choose a base branch
from

Conversation

hgromer
Copy link
Contributor

@hgromer hgromer commented May 5, 2025

No description provided.

@hgromer hgromer marked this pull request as draft May 5, 2025 17:53
@hgromer hgromer changed the title HBASE-29265: Operation timeouts can create a pathological feedback loop with multigets HBASE-29265: Batch calls to overloaded cluster can cause meta hotspotting May 5, 2025
@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@hgromer hgromer marked this pull request as ready for review May 6, 2025 14:11
@@ -783,8 +787,7 @@ private void receiveGlobalFailure(MultiAction rsActions, ServerName server, int
// any of the regions in the MultiAction and do not update cache if exception is
// from failing to submit action to thread pool
if (clearServerCache) {
updateCachedLocations(server, regionName, row,
ClientExceptionsUtil.isMetaClearingException(t) ? null : t);
updateCachedLocations(server, regionName, row, t);
Copy link
Contributor Author

@hgromer hgromer May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also solves the frustration of seeing "UnknownException" when inspecting meta cache clear exception metrics. This has made it quite difficult to track down what triggered the meta cache clear.

I think it's always better to provide more context than less. Even if an exception is meta cache clearing (though it will be now), I'd still prefer to know the exact exception type that cleared the meta cache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 would be good to preserve the exception for updateCachedLocations

Since we currently pass null to updateCachedLocations if we have a meta cache clearing exception, does that means that we never update the cache clearing exception metric properly for cache clears coming from receiveGlobalFailure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we'll do is basically "mask" the cache clearing exception by report an UnknownException. The code for that lives in the metrics class. It's annoying b/c that coupled with the lack of any logging in this code path makes it really difficult to determine what caused these meta cache clears.

@hgromer
Copy link
Contributor Author

hgromer commented May 6, 2025

cc @ndimiduk @rmdmattingly @krconv

errorsByServer.reportServerError(server);
Retry canRetry = errorsByServer.canTryMore(numAttempt) ? Retry.YES : Retry.NO_RETRIES_EXHAUSTED;
boolean clearServerCache = false;

if (!(t instanceof RejectedExecutionException)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enforces the constraints added in https://issues.apache.org/jira/browse/HBASE-27491

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better if you instead push RejectedExecutionException down into ClientExceptionsUtil.isMetaClearingException.

How about adding another collection of execution-exceptions for the family of various ExecutorService interaction errors, like is done with networking/connection exceptions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted, adding that

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@@ -783,8 +787,7 @@ private void receiveGlobalFailure(MultiAction rsActions, ServerName server, int
// any of the regions in the MultiAction and do not update cache if exception is
// from failing to submit action to thread pool
if (clearServerCache) {
updateCachedLocations(server, regionName, row,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @hgromer . From what I understand , the ClientExceptionsUtil.isMetaClearingException(t) ? null : t is what prevents the meta cache clear from happening deeper in updateCachedLocations for exceptions for which we should not clear the meta cache. If its a meta clearing exception we pass null and bypass this check and clear the meta cache , if its not a meta cache clearing exception we pass the exception and the check to bypass the meta cache clear if its not a meta cache clearing exception happens in updateCachedLocations here

I may have misunderstand or have missed something, could you possibly add a test to show that on batch operation a non meta cache clearing exception is causing a meta cache clear through receiveGlobalFailure? There are some existing meta cache clear behavior tests you could reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you're right. This code path is really tricky to reason about. Between the UnknownExceptions in the meta cache clear metrics, and the lack of logging, it's really difficult to identify what is causing these meta cache clears. I think I'm going to re-purpose this PR to simply add some logging + avoid passing in null as the meta exception in this code path. That should hopefully help us shed some light on the high number of meta cache clears we're seeing at my company, and will illuminate a path forward.

@droudnitsky does that make sense to you?

@@ -783,8 +787,7 @@ private void receiveGlobalFailure(MultiAction rsActions, ServerName server, int
// any of the regions in the MultiAction and do not update cache if exception is
// from failing to submit action to thread pool
if (clearServerCache) {
updateCachedLocations(server, regionName, row,
ClientExceptionsUtil.isMetaClearingException(t) ? null : t);
updateCachedLocations(server, regionName, row, t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 would be good to preserve the exception for updateCachedLocations

Since we currently pass null to updateCachedLocations if we have a meta cache clearing exception, does that means that we never update the cache clearing exception metric properly for cache clears coming from receiveGlobalFailure?

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 50s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
_ branch-2 Compile Tests _
+0 🆗 mvndep 0m 13s Maven dependency ordering for branch
+1 💚 mvninstall 3m 17s branch-2 passed
+1 💚 compile 3m 46s branch-2 passed
+1 💚 checkstyle 0m 57s branch-2 passed
+1 💚 spotbugs 2m 22s branch-2 passed
+1 💚 spotless 0m 48s branch has no errors when running spotless:check.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 16s Maven dependency ordering for patch
+1 💚 mvninstall 3m 3s the patch passed
+1 💚 compile 3m 48s the patch passed
+1 💚 javac 3m 48s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 55s the patch passed
+1 💚 spotbugs 2m 38s the patch passed
+1 💚 hadoopcheck 17m 54s Patch does not cause any errors with Hadoop 2.10.2 or 3.3.6 3.4.0.
+1 💚 spotless 0m 46s patch has no errors when running spotless:check.
_ Other Tests _
+1 💚 asflicense 0m 18s The patch does not generate ASF License warnings.
44m 9s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6961/3/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #6961
JIRA Issue HBASE-29265
Optional Tests dupname asflicense javac spotbugs checkstyle codespell detsecrets compile hadoopcheck hbaseanti spotless
uname Linux 41409215adff 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / 9c0220b
Default Java Eclipse Adoptium-11.0.23+9
Max. process+thread count 79 (vs. ulimit of 30000)
modules C: hbase-client hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6961/3/console
versions git=2.34.1 maven=3.9.8 spotbugs=4.7.3
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 43s Docker mode activated.
-0 ⚠️ yetus 0m 6s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
_ Prechecks _
_ branch-2 Compile Tests _
+0 🆗 mvndep 0m 20s Maven dependency ordering for branch
+1 💚 mvninstall 2m 32s branch-2 passed
+1 💚 compile 1m 0s branch-2 passed
+1 💚 javadoc 0m 41s branch-2 passed
+1 💚 shadedjars 5m 18s branch has no errors when building our shaded downstream artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 13s Maven dependency ordering for patch
+1 💚 mvninstall 2m 28s the patch passed
+1 💚 compile 1m 2s the patch passed
+1 💚 javac 1m 2s the patch passed
+1 💚 javadoc 0m 42s the patch passed
+1 💚 shadedjars 5m 18s patch has no errors when building our shaded downstream artifacts.
_ Other Tests _
+1 💚 unit 8m 10s hbase-client in the patch passed.
-1 ❌ unit 19m 54s /patch-unit-hbase-server.txt hbase-server in the patch failed.
50m 37s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6961/3/artifact/yetus-jdk8-hadoop2-check/output/Dockerfile
GITHUB PR #6961
JIRA Issue HBASE-29265
Optional Tests javac javadoc unit compile shadedjars
uname Linux 97346deb8191 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / 9c0220b
Default Java Temurin-1.8.0_412-b08
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6961/3/testReport/
Max. process+thread count 1769 (vs. ulimit of 30000)
modules C: hbase-client hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6961/3/console
versions git=2.34.1 maven=3.9.8
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 48s Docker mode activated.
-0 ⚠️ yetus 0m 5s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
_ Prechecks _
_ branch-2 Compile Tests _
+0 🆗 mvndep 0m 14s Maven dependency ordering for branch
+1 💚 mvninstall 3m 14s branch-2 passed
+1 💚 compile 1m 10s branch-2 passed
+1 💚 javadoc 0m 44s branch-2 passed
+1 💚 shadedjars 6m 25s branch has no errors when building our shaded downstream artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 17s Maven dependency ordering for patch
+1 💚 mvninstall 3m 14s the patch passed
+1 💚 compile 1m 11s the patch passed
+1 💚 javac 1m 11s the patch passed
+1 💚 javadoc 0m 42s the patch passed
+1 💚 shadedjars 6m 23s patch has no errors when building our shaded downstream artifacts.
_ Other Tests _
+1 💚 unit 8m 22s hbase-client in the patch passed.
+1 💚 unit 218m 32s hbase-server in the patch passed.
256m 22s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6961/3/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR #6961
JIRA Issue HBASE-29265
Optional Tests javac javadoc unit compile shadedjars
uname Linux d4cca9a572fa 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / 9c0220b
Default Java Eclipse Adoptium-11.0.23+9
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6961/3/testReport/
Max. process+thread count 4464 (vs. ulimit of 30000)
modules C: hbase-client hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6961/3/console
versions git=2.34.1 maven=3.9.8
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.


// Do not update cache if exception is from failing to submit action to thread pool
if (clearServerCache) {
cleanServerCache(server, t);

if (LOG.isTraceEnabled()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we tick a client-side metric here too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants