HBASE-28951 rename the wal before retrying the wal-split with another worker #6534

Umeshkumar9414 · 2024-12-12T18:55:48Z

Added an workerchange counter so that each time we can have a new name, that is needed in case the supposed dead RS starts to process the WAL after some time. I checked that wal name pattern, that we use for validating the wal is (.+)\.(\d+)(\.[0-9A-Za-z]+)? . This change is fitting there.

Umeshkumar9414 · 2024-12-12T19:16:10Z

While going through the code I saw some comments and code that are not aligning. As per this comment, WALSplitter.splitLogFile should not be getting used in procedure based splitting but we are using the same method. We are calling SplitLogWorker.splitLog(that is deprecated) in SplitWALCallable. And inside this method here, we call WALSplitter.splitLogFile.
Can someone help me understad this? Is this a miss or it is known? cc @Apache9 , @apurtell

mnpoonia · 2024-12-13T05:10:24Z

@Umeshkumar9414 So the idea here is to have a retry counter attached to the wal name. And whenever split wal fails and another worked picks up same wal, it increments the counter!!

mnpoonia · 2024-12-16T05:51:00Z

@Umeshkumar9414 If the the splitwal proc fails and also root procedure fails the how is that handled?

Apache9

In general I think the approach is OK, renaming is a typical way for fencing. But I suggest we keep the old behavior when there is no retry, so we can get better compatibility.

hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/SplitWALProcedure.java

hbase-server/src/main/java/org/apache/hadoop/hbase/wal/AbstractFSWALProvider.java

Apache9 · 2024-12-16T06:34:21Z

hbase-server/src/main/java/org/apache/hadoop/hbase/master/SplitWALManager.java

+    } else {
+      originalWALPath = walPath.substring(0, walPath.length() - RETRYING_EXT.length() - 3);
+    }
+    String walNewName =


So when retrying number is 0, we also have the '.retrying' suffix? Will this cause trouble when upgrading?

No when we have retrying number (workerChangeCount) 0 we don't have any suffix. This should not cause any trouble in upgrading. As @mnpoonia pointed out I do need to handle one case when SCP rolled back and second SCP created another splitwalProcedure in that case the name will contian retrying suffix.

I do not think the current code reflection your explaination here...

If you do not want to change the wal name when retry count == 0, you should just return at the first if condition?

We call this method in RELEASE_SPLIT_WORKER state. At this time first try of wal split is already complete. We only reach here if fir try is not able to split the wal.

Then please add this comment as a javadoc of this method?

Umeshkumar9414 · 2024-12-16T08:57:18Z

In general I think the approach is OK, renaming is a typical way for fencing. But I suggest we keep the old behavior when there is no retry, so we can get better compatibility.

Yeah I didn't do any changes when there is no retry and kept that as it was.

Umeshkumar9414 · 2024-12-16T08:59:25Z

@Umeshkumar9414 If the the splitwal p

Thanks @mnpoonia to point this out. I need to when the parent SCP fails and lets say we have created another SCP. It will just list all the files in WALDirectory and create SplitWalProcedure for all but yeah I need to handle the first retry with retryCount 0.

Apache9 · 2024-12-17T03:20:28Z

What about add a new step after acquire worker to rename the wal file, where we just append the worker's name to the wal file name as suffix?

And we need to be very careful when dealing with retrying...

There are several problems currently

Renaming failure does not always mean the old file is still there, maybe the renaming is complete but we just do not get the result because of a network issue, then in the current code we will consider the splitting as finished and cause data loss.
We will do the renaming after release worker, if the renaming fails, when retrying we will release the worker again, this will cause trouble...

Umeshkumar9414 · 2024-12-20T11:34:28Z

While going through the code I saw some comments and code that are not aligning. As per this comment, WALSplitter.splitLogFile should not be getting used in procedure based splitting but we are using the same method. We are calling SplitLogWorker.splitLog(that is deprecated) in SplitWALCallable. And inside this method here, we call WALSplitter.splitLogFile. Can someone help me understad this? Is this a miss or it is known? cc @Apache9 , @apurtell

@Apache9 , @apurtell

Umeshkumar9414 · 2024-12-20T20:16:52Z

hbase-server/src/main/java/org/apache/hadoop/hbase/wal/AbstractFSWALProvider.java

@@ -237,6 +237,9 @@ static void requestLogRoll(final WAL wal) {
  /** File Extension used while splitting an WAL into regions (HBASE-2312) */
  public static final String SPLITTING_EXT = "-splitting";

+  // Extension for the WAL where the split failed on one worker and is being retried on another.
+  public static final String RETRYING_EXT = ".retrying";
+
  /**
   * Pattern used to validate a WAL file name see {@link #validateWALFilename(String)} for
   * description.


while splitting the wal for meta table. wal name can be rs.XXX.meta.retrying001. Do you think we should update the WAL_FILE_NAME_PATTERN. Althought in splitting we didn't check for valid wal name.

@Apache9 what do you think?

… worker

…ch time

…uffix

…count retry

jojochuang · 2024-12-31T19:45:37Z

hbase-server/src/main/java/org/apache/hadoop/hbase/master/SplitWALManager.java

+  public boolean ifExistRenameWALForRetry(String walPath, String postRenameWalPath)
+    throws IOException {
+    if (fs.exists(new Path(rootDir, walPath))) {
+      if (!fs.rename(new Path(rootDir, walPath), new Path(rootDir, postRenameWalPath))) {


Hi, I'm not terribly familiar with wal split.
Does the WAL file get closed by this time? I'm asking because Ozone doesn't yet support renaming open files. And supporting that is quite a big project itself.

Even thought that's not yet a huge problem for HBase since HBase isn't default to run on Ozone, it would be great if we don't attempt to rename open files.

Thanks!

At this time there are cases when the WALFile will still be open. In current code we recoverLease at RS once master assign the splitting to the worker RS. @Apache9 do you think we should move the recoverLease to Master ?

Before this rename we also rename the WALdirectory for the rs. @jojochuang is renaming directory is different from file renaming ? If directory rename is also not supported when some file inside the directory is open then we need changes in current code as well.

I think this should have been done before we rename the wal directory?

Thanks guys.

I think what's more relevant for HBase is that it used to cause race conditions if the WAL files are kept open while being renamed. HBASE-27732 fixed one such bug -- because HDFS allows renaming open files, it doesn't fail immediately but it causes NPE later. Ozone fails right away with that bug. Took us a few days to find out.

(I need to check but I think directory rename is fine for Ozone in this case)

As recoverLease doesn't support the directory path (https://github.com/apache/hadoop/blob/fb1bb6429dfb4e45687e0bc507c5a2ed26bd0bb0/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/leaserecoverable.md), so we need to recoverlease for each file. We also have to recoverLease before renaming the walFile.

Btw at least for hadoop I think that both (recoverLease after rename or before rename) are fine. As renaming is a metadata operation and data is linked to INodes.

Apache-HBase · 2024-12-31T20:16:22Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 35s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	buf	0m 0s		buf was not available.
+0 🆗	buf	0m 0s		buf was not available.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	hbaseanti	0m 0s		Patch does not have any anti-patterns.
			_ master Compile Tests _
+0 🆗	mvndep	0m 11s		Maven dependency ordering for branch
+1 💚	mvninstall	2m 58s		master passed
+1 💚	compile	3m 27s		master passed
+1 💚	checkstyle	0m 41s		master passed
+1 💚	spotbugs	3m 40s		master passed
+1 💚	spotless	0m 42s		branch has no errors when running spotless:check.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 11s		Maven dependency ordering for patch
+1 💚	mvninstall	2m 48s		the patch passed
+1 💚	compile	3m 28s		the patch passed
+1 💚	cc	3m 28s		the patch passed
+1 💚	javac	3m 28s		the patch passed
+1 💚	blanks	0m 1s		The patch has no blanks issues.
-0 ⚠️	checkstyle	0m 36s	/results-checkstyle-hbase-server.txt	hbase-server: The patch generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)
+1 💚	spotbugs	3m 53s		the patch passed
+1 💚	hadoopcheck	10m 44s		Patch does not cause any errors with Hadoop 3.3.6 3.4.0.
+1 💚	hbaseprotoc	1m 13s		the patch passed
+1 💚	spotless	0m 40s		patch has no errors when running spotless:check.
			_ Other Tests _
+1 💚	asflicense	0m 17s		The patch does not generate ASF License warnings.
		43m 4s

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6534/5/artifact/yetus-general-check/output/Dockerfile
GITHUB PR	#6534
Optional Tests	dupname asflicense cc buflint bufcompat codespell detsecrets hbaseprotoc spotless javac spotbugs checkstyle compile hadoopcheck hbaseanti
uname	Linux dd88e2ccf603 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `dd576cd`
Default Java	Eclipse Adoptium-17.0.11+9
Max. process+thread count	86 (vs. ulimit of 30000)
modules	C: hbase-protocol-shaded hbase-server U: .
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6534/5/console
versions	git=2.34.1 maven=3.9.8 spotbugs=4.7.3
Powered by	Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2024-12-31T23:34:26Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 37s		Docker mode activated.
-0 ⚠️	yetus	0m 2s		Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
			_ Prechecks _
			_ master Compile Tests _
+0 🆗	mvndep	0m 35s		Maven dependency ordering for branch
+1 💚	mvninstall	3m 26s		master passed
+1 💚	compile	1m 31s		master passed
+1 💚	javadoc	0m 38s		master passed
+1 💚	shadedjars	5m 51s		branch has no errors when building our shaded downstream artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 14s		Maven dependency ordering for patch
+1 💚	mvninstall	3m 3s		the patch passed
+1 💚	compile	1m 30s		the patch passed
+1 💚	javac	1m 30s		the patch passed
+1 💚	javadoc	0m 35s		the patch passed
+1 💚	shadedjars	5m 47s		patch has no errors when building our shaded downstream artifacts.
			_ Other Tests _
+1 💚	unit	0m 32s		hbase-protocol-shaded in the patch passed.
-1 ❌	unit	211m 51s	/patch-unit-hbase-server.txt	hbase-server in the patch failed.
		241m 9s

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6534/5/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
GITHUB PR	#6534
Optional Tests	unit javac javadoc compile shadedjars
uname	Linux 153b6b6e2528 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `dd576cd`
Default Java	Eclipse Adoptium-17.0.11+9
Test Results	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6534/5/testReport/
Max. process+thread count	5332 (vs. ulimit of 30000)
modules	C: hbase-protocol-shaded hbase-server U: .
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6534/5/console
versions	git=2.34.1 maven=3.9.8
Powered by	Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

virajjasani self-requested a review December 12, 2024 19:01

This comment has been minimized.

Sign in to view

Apache9 reviewed Dec 16, 2024

View reviewed changes

Umeshkumar9414 force-pushed the HBASE-28951 branch from 1b4fef4 to 10e9b7c Compare December 16, 2024 17:19

This comment has been minimized.

Sign in to view

Umeshkumar9414 force-pushed the HBASE-28951 branch from cd0af96 to a2f732d Compare December 20, 2024 19:31

Umeshkumar9414 commented Dec 20, 2024

View reviewed changes

This comment has been minimized.

Sign in to view

ukumawat added 6 commits January 1, 2025 00:48

HBASE-28951 rename the wal before retrying the wal-split with another…

71d43fb

… worker

HBASE-28951 add a counter for worker change to keep different name ea…

56423a7

…ch time

HBASE-28951 handle first retry of splitWal for a wal with .retrying s…

cb375de

…uffix

HBASE-28951 remove the counter in procedure and sue wal name only to …

e5f51ea

…count retry

HBASE-28951 introduce and new state to rename the wal

398bc1f

HBASE-28951 added some comments

dd576cd

Umeshkumar9414 force-pushed the HBASE-28951 branch from a2f732d to dd576cd Compare December 31, 2024 19:29

jojochuang reviewed Dec 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HBASE-28951 rename the wal before retrying the wal-split with another worker #6534

HBASE-28951 rename the wal before retrying the wal-split with another worker #6534

Umeshkumar9414 commented Dec 12, 2024

Umeshkumar9414 commented Dec 12, 2024 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

mnpoonia commented Dec 13, 2024

mnpoonia commented Dec 16, 2024 •

edited

Loading

Apache9 left a comment

Apache9 Dec 16, 2024

Umeshkumar9414 Dec 16, 2024

Apache9 Dec 16, 2024

Umeshkumar9414 Dec 16, 2024

Apache9 Dec 17, 2024

Umeshkumar9414 Dec 31, 2024

Umeshkumar9414 commented Dec 16, 2024 •

edited

Loading

Umeshkumar9414 commented Dec 16, 2024

This comment has been minimized.

This comment has been minimized.

Apache9 commented Dec 17, 2024

This comment has been minimized.

This comment has been minimized.

Umeshkumar9414 commented Dec 20, 2024

Umeshkumar9414 Dec 20, 2024 •

edited

Loading

Umeshkumar9414 Jan 7, 2025

This comment has been minimized.

This comment has been minimized.

jojochuang Dec 31, 2024

Umeshkumar9414 Jan 7, 2025

Apache9 Jan 7, 2025

jojochuang Jan 7, 2025

Umeshkumar9414 Jan 7, 2025

Apache-HBase commented Dec 31, 2024

Apache-HBase commented Dec 31, 2024

HBASE-28951 rename the wal before retrying the wal-split with another worker #6534

Are you sure you want to change the base?

HBASE-28951 rename the wal before retrying the wal-split with another worker #6534

Conversation

Umeshkumar9414 commented Dec 12, 2024

Umeshkumar9414 commented Dec 12, 2024 • edited Loading

This comment has been minimized.

This comment has been minimized.

mnpoonia commented Dec 13, 2024

mnpoonia commented Dec 16, 2024 • edited Loading

Apache9 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Umeshkumar9414 commented Dec 16, 2024 • edited Loading

Umeshkumar9414 commented Dec 16, 2024

This comment has been minimized.

This comment has been minimized.

Apache9 commented Dec 17, 2024

This comment has been minimized.

This comment has been minimized.

Umeshkumar9414 commented Dec 20, 2024

Umeshkumar9414 Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Apache-HBase commented Dec 31, 2024

Apache-HBase commented Dec 31, 2024

Umeshkumar9414 commented Dec 12, 2024 •

edited

Loading

mnpoonia commented Dec 16, 2024 •

edited

Loading

Umeshkumar9414 commented Dec 16, 2024 •

edited

Loading

Umeshkumar9414 Dec 20, 2024 •

edited

Loading