Skip to content

HDFS-17890. Avoid slow disks datanode when reading data#8338

Open
junjie1233 wants to merge 3 commits intoapache:trunkfrom
junjie1233:HDFS-17890
Open

HDFS-17890. Avoid slow disks datanode when reading data#8338
junjie1233 wants to merge 3 commits intoapache:trunkfrom
junjie1233:HDFS-17890

Conversation

@junjie1233
Copy link

Summary

When a client requests to read a block, the NameNode returns a list of DataNodes holding the replicas of that block.

The current logic sorts these DataNodes based on network topology (rack awareness, distance), without considering the performance of the underlying storage (disk/volume).

If a block replica resides on a slow or overloaded disk (for example, a hot disk with high latency), its DataNode may still be placed at the top of the sorted list and selected first by the client.


Example

Suppose a block has three replicas located on:

dn1: storage1 (slow disk)

dn2: storage1 (normal disk)

dn3: storage1 (normal disk)

Even if the speed of memory 1 is known to be slow, the client will still prioritize reading in the order returned by the NameNode.


Fix

This PR adds disk-level slow storage tracking and deprioritization during block location sorting.

1. Disk-Level Tracking (SlowDiskTracker.java)

  • Track slow storage at StorageID granularity (not DataNode level).
  • Copy-on-Write cache cachedSlowDisksForRead to avoid lock contention on read path.
  • Cache key: IP:PORT:StorageID (e.g., 127.0.0.1:50010:DS-xxx).
  • Dual key modes:
    • CACHE_KEY: IP:PORT:StorageID for read deprioritization.
    • LEGACY_KEY: IP:PORT:volumeName for backward compatibility with Top-N reports.
  • Background cache rebuild with configurable interval (default 30s).
  • Automatic expiration of stale entries (reportValidityMs).

2. Block Location Sorting (FSNamesystem.java)

  • New method sortLocatedBlocksBySlowDisk() reorders replicas after topology-based sorting.
  • Pre-compute slow keys before sorting to avoid string concatenation in comparator hot path.
  • Stable sort: preserves network topology order for non-slow replicas.
  • Slow replicas sorted by latency (higher latency → lower priority).
  • Controlled by config:
    dfs.namenode.deprioritize.slow.disk.datanode.for.read (default: false).

3. Configuration (DFSConfigKeys.java)

  • dfs.namenode.slow.disk.cache.rebuild.interval (default 30s).
  • Decouples cache rebuild frequency from Top-N report generation interval.
  • Allows independent tuning for large clusters.

4. Disk Key Format (DataNodeDiskMetrics.java)

  • Use volumeName|storageID format for slow disk reports.
  • Enables SlowDiskTracker to extract both legacy key (WebUI) and cache key (read path).

Test

Test class: TestSlowDiskBlockLocations.java

Test Coverage

  • testDeprioritizeSlowDiskDatanodeForReadEnabled
    Verifies that slow disk replicas are moved to the end of location list.
    Checks block read path integration.

  • testSlowDiskCacheRebuild
    Tests cache population after DataNode reports slow disk.
    Verifies cache refresh mechanism.

  • testSlowDiskExpiration
    Validates expiration of stale slow disk entries.
    Confirms cache is cleaned after disk recovery.

  • testCacheIntegrationWithReadPath
    End-to-end test: slow disk report → cache update → block location sorting.
    Verifies clients avoid slow replicas.

  • testIndependentCacheRebuildInterval
    Tests independent cache rebuild interval configuration.
    Verifies decoupling from Top-N report generation.

  • testMultipleSlowDisks
    Multiple slow disks across different DataNodes.
    Validates sorting by latency when all replicas are slow.

  • testNoSlowDiskReports
    Baseline test: no sorting when no slow disks reported.
    Ensures feature is non-intrusive when disabled.


Test Configuration

  • DFS_HEARTBEAT_INTERVAL: 1s (fast heartbeat for testing)
  • DFS_NAMENODE_SLOW_DISK_CACHE_REBUILD_INTERVAL: 1s (quick cache rebuild)
  • OUTLIERS_REPORT_INTERVAL: 1s (rapid slow disk detection)
  • Uses GenericTestUtils.waitFor() for async operations.

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 7m 23s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 xmllint 0m 0s xmllint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 27m 36s trunk passed
+1 💚 compile 1m 3s trunk passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 compile 1m 4s trunk passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 checkstyle 1m 4s trunk passed
+1 💚 mvnsite 1m 6s trunk passed
+1 💚 javadoc 0m 56s trunk passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javadoc 0m 57s trunk passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 spotbugs 2m 13s trunk passed
+1 💚 shadedclient 17m 5s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 44s the patch passed
+1 💚 compile 0m 40s the patch passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javac 0m 40s the patch passed
+1 💚 compile 0m 44s the patch passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 javac 0m 44s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 0m 42s /results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt hadoop-hdfs-project/hadoop-hdfs: The patch generated 26 new + 324 unchanged - 0 fixed = 350 total (was 324)
+1 💚 mvnsite 0m 47s the patch passed
+1 💚 javadoc 0m 35s the patch passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javadoc 0m 36s the patch passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 spotbugs 2m 0s the patch passed
+1 💚 shadedclient 16m 28s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 181m 37s hadoop-hdfs in the patch passed.
+1 💚 asflicense 0m 27s The patch does not generate ASF License warnings.
265m 11s
Subsystem Report/Notes
Docker ClientAPI=1.54 ServerAPI=1.54 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8338/3/artifact/out/Dockerfile
GITHUB PR #8338
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint
uname Linux 27173926dac5 5.15.0-171-generic #181-Ubuntu SMP Fri Feb 6 22:44:50 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / f0abd71
Default Java Ubuntu-17.0.18+8-Ubuntu-124.04.1
Multi-JDK versions /usr/lib/jvm/java-21-openjdk-amd64:Ubuntu-21.0.10+7-Ubuntu-124.04 /usr/lib/jvm/java-17-openjdk-amd64:Ubuntu-17.0.18+8-Ubuntu-124.04.1
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8338/3/testReport/
Max. process+thread count 4053 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8338/3/console
versions git=2.43.0 maven=3.9.11 spotbugs=4.9.7
Powered by Apache Yetus 0.14.1 https://yetus.apache.org

This message was automatically generated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants