Skip to content

HBASE-29216 Recovered replication stuck , when enabled hbase.separate… #6856

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: branch-2
Choose a base branch
from

Conversation

Yiran-wu
Copy link
Contributor

@Yiran-wu Yiran-wu commented Mar 25, 2025

Recovered replication stuck , when enabled “hbase.separate.oldlogdir.by.regionserver”

The WAL location cannot be found after the configuration is enabled.

The execution logic looks like this

  1. Set “hbase.separate.oldlogdir.by.regionserver” to enabled
  2. Restart the RegionServer, the "write a head log" will from /hbase/WALs/servername/

{wal-filename} moves to /hbase/oldWALs/servername/{wal-filename}

  1. WALEntryStream will find archive logs using AbstractFSWALProvider.findArchivedLog

To solve this problem, we can try to improve the findArchiveLog method

@@ -448,10 +497,6 @@ public static boolean isArchivedLogFile(Path p) {
* @throws IOException exception
*/
public static Path findArchivedLog(Path path, Configuration conf) throws IOException {
// If the path contains oldWALs keyword then exit early.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the problem of this if condition?

I think the intention here is that, if the Path is already under the oldWALs directory, we do not need to find it again. In your description, the path should be under the WALs directory, not oldWALs directory? So what is the real problem here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinks @Apache9 for Review.

The main reason is that when we close regionserver, WALs will be moved to oldWALs, at the same time when enabled “hbase.separate.oldlogdir.by.regionserver” , It is moved to the "oldWALs/servername" directory. The RecoverdNode cannot be found it.

Closed regionserver, move wal funcation is "HRegionServer::shutdownWAL()"

The current findArchiveLog does not check for "oldWALs/servername", The WAL file not being found.

This problem can be replicated in existing unit tests. Add "conf.setBoolean(SEPARATE_OLDLOGDIR, true);" to the UT "TestReplicationSource::testServerShutdownRecoveredQueue"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code for checking seperated old wal is right after the check here?

You do not get my point, the condition here is to avoid redundant checking for old wal files, so why you need to remove this check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add this check, the correct location of the WAL file will not be found when the separated oldWALs feature is turned on。

In the process of restoring a WAL, execution node will attempt to open a WAL file, but because a WAL file has been moved to "/hbase/oldWALs/crash_servername_dir/“, In the "AbstractProtobufWALReader.open", will execute the "openArchiveWAL" calls, and into the "findArchivedLong"

If the incoming path contains "oldWALs", null is returned directly. The actual file location in the "/hbase/oldWALs/crash_servername_dir/”

When we cancel this check, we will be able to take advantage of the old directory lookup logic that starts at line 472

// some codes 
 oldLogDir = new Path(walRootDir, new StringBuilder(HConstants.HREGION_OLDLOGDIR_NAME)
      .append(Path.SEPARATOR).append(serverName.getServerName()).toString());
    archivedLogLocation = new Path(oldLogDir, path.getName());
    if (fs.exists(archivedLogLocation)) {
      LOG.info("Log " + path + " was moved to " + archivedLogLocation);
      return archivedLogLocation;
    }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{code}
protected final Pair<FSDataInputStream, FileStatus> open() throws IOException {
try {
return Pair.newPair(fs.open(path), fs.getFileStatus(path));
} catch (FileNotFoundException e) {
Pair<FSDataInputStream, FileStatus> pair = openArchivedWAL();
if (pair != null) {
return pair;
} else {
throw e;
}
} catch (RemoteException re) {
IOException ioe = re.unwrapRemoteException(FileNotFoundException.class);
if (!(ioe instanceof FileNotFoundException)) {
throw ioe;
}
Pair<FSDataInputStream, FileStatus> pair = openArchivedWAL();
if (pair != null) {
return pair;
} else {
throw ioe;
}
}
}
{code}

This is the only method where we call openArchivedWAL, and the design here is that, the path is under the normal WAL directory, and if we can not find it, we will go into the archived wal directory, i.e, oldWALs directory to find it. So we should not pass a Path which is already under the oldWALs to findArchivedLog method. This is my point.

So under which condition, the path here could be a Path which is already under the oldWALs directory? Maybe we need to fix the problem there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the Review and suggestion, We really should unify the logic and design of finding WAL paths. Let me look at why the path which is already under the oldWALs dircetory

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Yiran-wu Yiran-wu force-pushed the HBASE_29216_branch2 branch from 76b511b to f0f9ae2 Compare April 2, 2025 10:03
@Apache-HBase

This comment has been minimized.

@@ -463,6 +509,12 @@ public static Path findArchivedLog(Path path, Configuration conf) throws IOExcep
}

ServerName serverName = getServerNameFromWALDirectoryName(path);
if (serverName == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And why we need to add this logic? I think the design for this method is that, we will only pass a WAL path which is not under the old wal directory, but seems your fix is for passing a WAL path which is already under the old wal directory. So under which condition we could pass a WAL path which is ready under the old wal directory for finding the old wal file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original implementation of this function, the serverName was passed in directly from the outside and could easily locate the separated oldWALs directory. In the current version of the implementation, we can parse the serverName by the directory name or file name.

There are currently two types of incoming paths
1)For Wals
/hbase/WALs/server-name/some...wals
/hbase/WALs/server-name-splitting/some...wals
2) For oldWALs
/hbase/oldWALs/regionserver-130%2C16020%2C1742659271913.regionserver-130%2C16020%2C1742659271913.regiongroup-0.1742659287672

In the oldWALs path, we can resolve serverName by file name .

And of course we can also talk about whether it's necessary to fix it the way the serverName was passed in.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Yiran-wu Yiran-wu force-pushed the HBASE_29216_branch2 branch from f0f9ae2 to c168593 Compare April 3, 2025 11:03
@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Yiran-wu Yiran-wu force-pushed the HBASE_29216_branch2 branch from c168593 to fee94ff Compare April 8, 2025 06:47
@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Yiran-wu Yiran-wu force-pushed the HBASE_29216_branch2 branch from fee94ff to dd64dfc Compare April 9, 2025 06:26
@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 4m 7s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
_ branch-2 Compile Tests _
+1 💚 mvninstall 5m 20s branch-2 passed
+1 💚 compile 4m 12s branch-2 passed
+1 💚 checkstyle 0m 54s branch-2 passed
+1 💚 spotbugs 2m 30s branch-2 passed
+1 💚 spotless 1m 8s branch has no errors when running spotless:check.
_ Patch Compile Tests _
+1 💚 mvninstall 4m 44s the patch passed
+1 💚 compile 3m 58s the patch passed
+1 💚 javac 3m 58s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 56s the patch passed
+1 💚 spotbugs 2m 24s the patch passed
+1 💚 hadoopcheck 22m 21s Patch does not cause any errors with Hadoop 2.10.2 or 3.3.6 3.4.0.
+1 💚 spotless 1m 11s patch has no errors when running spotless:check.
_ Other Tests _
+1 💚 asflicense 0m 15s The patch does not generate ASF License warnings.
56m 41s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6856/5/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #6856
Optional Tests dupname asflicense javac spotbugs checkstyle codespell detsecrets compile hadoopcheck hbaseanti spotless
uname Linux cc7660a313d8 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / dd64dfc
Default Java Eclipse Adoptium-11.0.23+9
Max. process+thread count 79 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6856/5/console
versions git=2.34.1 maven=3.9.8 spotbugs=4.7.3
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 34m 18s Docker mode activated.
-0 ⚠️ yetus 0m 6s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
_ Prechecks _
_ branch-2 Compile Tests _
+1 💚 mvninstall 27m 26s branch-2 passed
+1 💚 compile 1m 4s branch-2 passed
+1 💚 javadoc 1m 6s branch-2 passed
+1 💚 shadedjars 8m 16s branch has no errors when building our shaded downstream artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 8m 1s the patch passed
+1 💚 compile 1m 7s the patch passed
+1 💚 javac 1m 7s the patch passed
+1 💚 javadoc 0m 33s the patch passed
+1 💚 shadedjars 7m 34s patch has no errors when building our shaded downstream artifacts.
_ Other Tests _
-1 ❌ unit 30m 31s /patch-unit-hbase-server.txt hbase-server in the patch failed.
123m 17s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6856/5/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
GITHUB PR #6856
Optional Tests javac javadoc unit compile shadedjars
uname Linux 04fddeed6ae7 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / dd64dfc
Default Java Eclipse Adoptium-17.0.11+9
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6856/5/testReport/
Max. process+thread count 1790 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6856/5/console
versions git=2.34.1 maven=3.9.8
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 9m 45s Docker mode activated.
-0 ⚠️ yetus 0m 5s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
_ Prechecks _
_ branch-2 Compile Tests _
+1 💚 mvninstall 4m 39s branch-2 passed
+1 💚 compile 1m 6s branch-2 passed
+1 💚 javadoc 0m 38s branch-2 passed
+1 💚 shadedjars 7m 42s branch has no errors when building our shaded downstream artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 4m 32s the patch passed
+1 💚 compile 1m 20s the patch passed
+1 💚 javac 1m 20s the patch passed
+1 💚 javadoc 0m 36s the patch passed
+1 💚 shadedjars 8m 34s patch has no errors when building our shaded downstream artifacts.
_ Other Tests _
-1 ❌ unit 252m 34s /patch-unit-hbase-server.txt hbase-server in the patch failed.
297m 13s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6856/5/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR #6856
Optional Tests javac javadoc unit compile shadedjars
uname Linux cc4f91183892 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / dd64dfc
Default Java Eclipse Adoptium-11.0.23+9
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6856/5/testReport/
Max. process+thread count 4049 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6856/5/console
versions git=2.34.1 maven=3.9.8
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 47m 14s Docker mode activated.
-0 ⚠️ yetus 0m 6s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
_ Prechecks _
_ branch-2 Compile Tests _
+1 💚 mvninstall 26m 46s branch-2 passed
+1 💚 compile 0m 42s branch-2 passed
+1 💚 javadoc 0m 56s branch-2 passed
+1 💚 shadedjars 7m 56s branch has no errors when building our shaded downstream artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 7m 11s the patch passed
+1 💚 compile 0m 46s the patch passed
+1 💚 javac 0m 46s the patch passed
+1 💚 javadoc 0m 27s the patch passed
+1 💚 shadedjars 6m 37s patch has no errors when building our shaded downstream artifacts.
_ Other Tests _
-1 ❌ unit 381m 49s /patch-unit-hbase-server.txt hbase-server in the patch failed.
491m 0s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6856/5/artifact/yetus-jdk8-hadoop2-check/output/Dockerfile
GITHUB PR #6856
Optional Tests javac javadoc unit compile shadedjars
uname Linux 644c561261f9 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / dd64dfc
Default Java Temurin-1.8.0_412-b08
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6856/5/testReport/
Max. process+thread count 4381 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6856/5/console
versions git=2.34.1 maven=3.9.8
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Yiran-wu
Copy link
Contributor Author

Jenkins retest please

@@ -104,7 +106,30 @@ public void locateRecoveredPaths(String walGroupId) throws IOException {
// didn't find a new location
LOG.error(
String.format("WAL Path %s doesn't exist and couldn't find its new location", path));
newPaths.add(path);
Path walPath = path;
if (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for RecoveredReplicationSource, we could get a Path under the oldWALs directory? This is by design?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants