Skip to content

YARN-11835: DockerContainerDeletionTask never calls deletionTaskFinished(), causing NM recovery store accumulation#8405

Open
Khrol wants to merge 1 commit intoapache:trunkfrom
Khrol:YARN-11835-docker-deletion-task-finished
Open

YARN-11835: DockerContainerDeletionTask never calls deletionTaskFinished(), causing NM recovery store accumulation#8405
Khrol wants to merge 1 commit intoapache:trunkfrom
Khrol:YARN-11835-docker-deletion-task-finished

Conversation

@Khrol
Copy link
Copy Markdown

@Khrol Khrol commented Apr 3, 2026

YARN-11835

Problem

DockerContainerDeletionTask.run() never calls deletionTaskFinished(), so every Docker container deletion task written to the NM recovery store (RocksDB/LevelDB) accumulates indefinitely and is replayed on every NodeManager restart.

FileDeletionTask.run() has always had this call; it was simply missing from DockerContainerDeletionTask.

Fix

Added the missing deletionTaskFinished() call at the end of DockerContainerDeletionTask.run():

public void run() {
    LinuxContainerExecutor exec = ((LinuxContainerExecutor)
        getDeletionService().getContainerExecutor());
    exec.removeDockerContainer(containerId);
    deletionTaskFinished(); // was missing — tasks accumulated in recovery store forever
}

Note on error handling

The reviewer on a related patch asked whether we should wrap removeDockerContainer in try/catch and set setSuccess(false) on failure, mirroring the pattern in FileDeletionTask.run().

FileDeletionTask uses that pattern because deleteAsUser throws checked exceptions (IOException | InterruptedException) that must be handled. removeDockerContainer, however, already catches ContainerExecutionException internally and never propagates any exception — adding a try/catch here would be dead code that can never trigger. The simpler form is therefore correct.

Testing

Added testRunCallsDeletionTaskFinished() to TestDockerContainerDeletionTask:

  • Verifies removeDockerContainer(containerId) is called.
  • Verifies stateStore.removeDeletionTask(taskId) is called — the observable effect of deletionTaskFinished() — confirming the task removes itself from the recovery store.

Also tested on our production environment.

…hed(), causing NM recovery store accumulation. Contributed by Igor Khrol.
@hadoop-yetus
Copy link
Copy Markdown

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 1m 2s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 41m 8s trunk passed
+1 💚 compile 1m 37s trunk passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 compile 1m 38s trunk passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 checkstyle 1m 12s trunk passed
+1 💚 mvnsite 1m 13s trunk passed
+1 💚 javadoc 1m 9s trunk passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javadoc 1m 5s trunk passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 spotbugs 1m 58s trunk passed
+1 💚 shadedclient 29m 12s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 9s the patch passed
+1 💚 compile 1m 5s the patch passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javac 1m 5s the patch passed
+1 💚 compile 1m 6s the patch passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 javac 1m 6s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 35s the patch passed
+1 💚 mvnsite 0m 42s the patch passed
+1 💚 javadoc 0m 36s the patch passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javadoc 0m 35s the patch passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 spotbugs 1m 37s the patch passed
+1 💚 shadedclient 27m 56s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 26m 39s hadoop-yarn-server-nodemanager in the patch passed.
+1 💚 asflicense 0m 36s The patch does not generate ASF License warnings.
145m 0s
Subsystem Report/Notes
Docker ClientAPI=1.54 ServerAPI=1.54 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8405/1/artifact/out/Dockerfile
GITHUB PR #8405
JIRA Issue YARN-11835
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 63702da65b61 5.15.0-173-generic #183-Ubuntu SMP Fri Mar 6 13:29:34 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / df00b7b
Default Java Ubuntu-17.0.18+8-Ubuntu-124.04.1
Multi-JDK versions /usr/lib/jvm/java-21-openjdk-amd64:Ubuntu-21.0.10+7-Ubuntu-124.04 /usr/lib/jvm/java-17-openjdk-amd64:Ubuntu-17.0.18+8-Ubuntu-124.04.1
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8405/1/testReport/
Max. process+thread count 611 (vs. ulimit of 10000)
modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8405/1/console
versions git=2.43.0 maven=3.9.11 spotbugs=4.9.7
Powered by Apache Yetus 0.14.1 https://yetus.apache.org

This message was automatically generated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants