HDFS-17890. Avoid slow disks datanode when reading data#8338
Open
junjie1233 wants to merge 3 commits intoapache:trunkfrom
Open
HDFS-17890. Avoid slow disks datanode when reading data#8338junjie1233 wants to merge 3 commits intoapache:trunkfrom
junjie1233 wants to merge 3 commits intoapache:trunkfrom
Conversation
|
🎊 +1 overall
This message was automatically generated. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When a client requests to read a block, the NameNode returns a list of DataNodes holding the replicas of that block.
The current logic sorts these DataNodes based on network topology (rack awareness, distance), without considering the performance of the underlying storage (disk/volume).
If a block replica resides on a slow or overloaded disk (for example, a hot disk with high latency), its DataNode may still be placed at the top of the sorted list and selected first by the client.
Example
Suppose a block has three replicas located on:
dn1: storage1 (slow disk)
dn2: storage1 (normal disk)
dn3: storage1 (normal disk)
Even if the speed of memory 1 is known to be slow, the client will still prioritize reading in the order returned by the NameNode.
Fix
This PR adds disk-level slow storage tracking and deprioritization during block location sorting.
1. Disk-Level Tracking (
SlowDiskTracker.java)cachedSlowDisksForReadto avoid lock contention on read path.IP:PORT:StorageID(e.g.,127.0.0.1:50010:DS-xxx).IP:PORT:StorageIDfor read deprioritization.IP:PORT:volumeNamefor backward compatibility with Top-N reports.reportValidityMs).2. Block Location Sorting (
FSNamesystem.java)sortLocatedBlocksBySlowDisk()reorders replicas after topology-based sorting.dfs.namenode.deprioritize.slow.disk.datanode.for.read(default: false).3. Configuration (
DFSConfigKeys.java)dfs.namenode.slow.disk.cache.rebuild.interval(default 30s).4. Disk Key Format (
DataNodeDiskMetrics.java)volumeName|storageIDformat for slow disk reports.SlowDiskTrackerto extract both legacy key (WebUI) and cache key (read path).Test
Test class:
TestSlowDiskBlockLocations.javaTest Coverage
✅ testDeprioritizeSlowDiskDatanodeForReadEnabled
Verifies that slow disk replicas are moved to the end of location list.
Checks block read path integration.
✅ testSlowDiskCacheRebuild
Tests cache population after DataNode reports slow disk.
Verifies cache refresh mechanism.
✅ testSlowDiskExpiration
Validates expiration of stale slow disk entries.
Confirms cache is cleaned after disk recovery.
✅ testCacheIntegrationWithReadPath
End-to-end test: slow disk report → cache update → block location sorting.
Verifies clients avoid slow replicas.
✅ testIndependentCacheRebuildInterval
Tests independent cache rebuild interval configuration.
Verifies decoupling from Top-N report generation.
✅ testMultipleSlowDisks
Multiple slow disks across different DataNodes.
Validates sorting by latency when all replicas are slow.
✅ testNoSlowDiskReports
Baseline test: no sorting when no slow disks reported.
Ensures feature is non-intrusive when disabled.
Test Configuration
DFS_HEARTBEAT_INTERVAL: 1s (fast heartbeat for testing)DFS_NAMENODE_SLOW_DISK_CACHE_REBUILD_INTERVAL: 1s (quick cache rebuild)OUTLIERS_REPORT_INTERVAL: 1s (rapid slow disk detection)GenericTestUtils.waitFor()for async operations.