Firefly Clock Synchronization Added as ChronoSync#95
Open
aimran215 wants to merge 3 commits intoROCm:masterfrom
Open
Firefly Clock Synchronization Added as ChronoSync#95aimran215 wants to merge 3 commits intoROCm:masterfrom
aimran215 wants to merge 3 commits intoROCm:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Distributed computing workloads across multiple AMD nodes require precise clock synchronization to enable accurate performance tracing, event correlation, and distributed debugging. Without synchronized clocks, timestamps from different nodes cannot be meaningfully compared, making it difficult to analyze multi-node workload behavior or identify performance bottlenecks. This PR introduces ChronoSync, a distributed clock synchronization system based on the Firefly algorithm, providing microsecond-level time coordination across nodes for distributed AI training, HPC applications, and other multi-node AMD compute scenarios.
Technical Details
Core Components
ChronoSyncDataSource: Singleton data source that manages worker threads, parses node configuration from RPDT_CLOCKSYNC_IP environment variable, and coordinates synchronization across the cluster.
Firefly Protocol: Implements UDP-based bidirectional timestamp exchange between node pairs. Nodes with lower rank act as server (Role A), higher rank as client (Role B). Captures four timestamps per probe to calculate round-trip time and clock offset.
Synchronization Algorithm: Uses linear regression on measurements to calculate drift rate, updates global atomic variables (chrono_sync::offset, chrono_sync::drift) for lock-free timestamp correction.
Configuration:
RPDT_CLOCKSYNC_IP: Path to config file with format IP_ADDRESS,rank=RANK_NUMBER
RPDT_CLOCKSYNC_RANK: Current node's rank
Key Features
File-based singleton locking prevents duplicate instances
Shared memory IPC between probing processes and main worker thread
Semaphore-protected measurement logging
Weighted updates using CONSENSUS_ALPHA for gradual convergence
Test Plan
Two-Node Testing: Verify singleton lock, worker thread startup, UDP probing, measurement collection, and offset/drift calculations.
Collective Check: Verify all nodes can exchange timestamps and achieve synchronization through causality violations.
Test Result
Two-Node Synchronization: Successfully passes the collective checks for causality violations.