-
Notifications
You must be signed in to change notification settings - Fork 3.3k
HBASE-29216 Recovered replication stuck , when enabled hbase.separate… #6856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-2
Are you sure you want to change the base?
Conversation
@@ -448,10 +497,6 @@ public static boolean isArchivedLogFile(Path p) { | |||
* @throws IOException exception | |||
*/ | |||
public static Path findArchivedLog(Path path, Configuration conf) throws IOException { | |||
// If the path contains oldWALs keyword then exit early. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the problem of this if condition?
I think the intention here is that, if the Path is already under the oldWALs directory, we do not need to find it again. In your description, the path should be under the WALs directory, not oldWALs directory? So what is the real problem here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinks @Apache9 for Review.
The main reason is that when we close regionserver, WALs will be moved to oldWALs, at the same time when enabled “hbase.separate.oldlogdir.by.regionserver” , It is moved to the "oldWALs/servername" directory. The RecoverdNode cannot be found it.
Closed regionserver, move wal funcation is "HRegionServer::shutdownWAL()"
The current findArchiveLog does not check for "oldWALs/servername", The WAL file not being found.
This problem can be replicated in existing unit tests. Add "conf.setBoolean(SEPARATE_OLDLOGDIR, true);" to the UT "TestReplicationSource::testServerShutdownRecoveredQueue"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code for checking seperated old wal is right after the check here?
You do not get my point, the condition here is to avoid redundant checking for old wal files, so why you need to remove this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add this check, the correct location of the WAL file will not be found when the separated oldWALs feature is turned on。
In the process of restoring a WAL, execution node will attempt to open a WAL file, but because a WAL file has been moved to "/hbase/oldWALs/crash_servername_dir/“, In the "AbstractProtobufWALReader.open", will execute the "openArchiveWAL" calls, and into the "findArchivedLong"
If the incoming path contains "oldWALs", null is returned directly. The actual file location in the "/hbase/oldWALs/crash_servername_dir/”
When we cancel this check, we will be able to take advantage of the old directory lookup logic that starts at line 472
// some codes
oldLogDir = new Path(walRootDir, new StringBuilder(HConstants.HREGION_OLDLOGDIR_NAME)
.append(Path.SEPARATOR).append(serverName.getServerName()).toString());
archivedLogLocation = new Path(oldLogDir, path.getName());
if (fs.exists(archivedLogLocation)) {
LOG.info("Log " + path + " was moved to " + archivedLogLocation);
return archivedLogLocation;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{code}
protected final Pair<FSDataInputStream, FileStatus> open() throws IOException {
try {
return Pair.newPair(fs.open(path), fs.getFileStatus(path));
} catch (FileNotFoundException e) {
Pair<FSDataInputStream, FileStatus> pair = openArchivedWAL();
if (pair != null) {
return pair;
} else {
throw e;
}
} catch (RemoteException re) {
IOException ioe = re.unwrapRemoteException(FileNotFoundException.class);
if (!(ioe instanceof FileNotFoundException)) {
throw ioe;
}
Pair<FSDataInputStream, FileStatus> pair = openArchivedWAL();
if (pair != null) {
return pair;
} else {
throw ioe;
}
}
}
{code}
This is the only method where we call openArchivedWAL, and the design here is that, the path
is under the normal WAL directory, and if we can not find it, we will go into the archived wal directory, i.e, oldWALs directory to find it. So we should not pass a Path which is already under the oldWALs to findArchivedLog method. This is my point.
So under which condition, the path
here could be a Path which is already under the oldWALs directory? Maybe we need to fix the problem there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the Review and suggestion, We really should unify the logic and design of finding WAL paths. Let me look at why the path which is already under the oldWALs dircetory
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
76b511b
to
f0f9ae2
Compare
This comment has been minimized.
This comment has been minimized.
@@ -463,6 +509,12 @@ public static Path findArchivedLog(Path path, Configuration conf) throws IOExcep | |||
} | |||
|
|||
ServerName serverName = getServerNameFromWALDirectoryName(path); | |||
if (serverName == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And why we need to add this logic? I think the design for this method is that, we will only pass a WAL path which is not under the old wal directory, but seems your fix is for passing a WAL path which is already under the old wal directory. So under which condition we could pass a WAL path which is ready under the old wal directory for finding the old wal file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original implementation of this function, the serverName was passed in directly from the outside and could easily locate the separated oldWALs directory. In the current version of the implementation, we can parse the serverName by the directory name or file name.
There are currently two types of incoming paths
1)For Wals
/hbase/WALs/server-name/some...wals
/hbase/WALs/server-name-splitting/some...wals
2) For oldWALs
/hbase/oldWALs/regionserver-130%2C16020%2C1742659271913.regionserver-130%2C16020%2C1742659271913.regiongroup-0.1742659287672
In the oldWALs path, we can resolve serverName by file name .
And of course we can also talk about whether it's necessary to fix it the way the serverName was passed in.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
f0f9ae2
to
c168593
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
c168593
to
fee94ff
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
…oldlogdir.by.regionserver
fee94ff
to
dd64dfc
Compare
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
Jenkins retest please |
@@ -104,7 +106,30 @@ public void locateRecoveredPaths(String walGroupId) throws IOException { | |||
// didn't find a new location | |||
LOG.error( | |||
String.format("WAL Path %s doesn't exist and couldn't find its new location", path)); | |||
newPaths.add(path); | |||
Path walPath = path; | |||
if ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for RecoveredReplicationSource, we could get a Path under the oldWALs directory? This is by design?
Recovered replication stuck , when enabled “hbase.separate.oldlogdir.by.regionserver”
The WAL location cannot be found after the configuration is enabled.
The execution logic looks like this
{wal-filename} moves to /hbase/oldWALs/servername/{wal-filename}
To solve this problem, we can try to improve the findArchiveLog method