Skip to content

Firefly Clock Synchronization Added as ChronoSync#95

Open
aimran215 wants to merge 3 commits intoROCm:masterfrom
aimran215:master
Open

Firefly Clock Synchronization Added as ChronoSync#95
aimran215 wants to merge 3 commits intoROCm:masterfrom
aimran215:master

Conversation

@aimran215
Copy link

Motivation

Distributed computing workloads across multiple AMD nodes require precise clock synchronization to enable accurate performance tracing, event correlation, and distributed debugging. Without synchronized clocks, timestamps from different nodes cannot be meaningfully compared, making it difficult to analyze multi-node workload behavior or identify performance bottlenecks. This PR introduces ChronoSync, a distributed clock synchronization system based on the Firefly algorithm, providing microsecond-level time coordination across nodes for distributed AI training, HPC applications, and other multi-node AMD compute scenarios.

Technical Details

Core Components

ChronoSyncDataSource: Singleton data source that manages worker threads, parses node configuration from RPDT_CLOCKSYNC_IP environment variable, and coordinates synchronization across the cluster.

Firefly Protocol: Implements UDP-based bidirectional timestamp exchange between node pairs. Nodes with lower rank act as server (Role A), higher rank as client (Role B). Captures four timestamps per probe to calculate round-trip time and clock offset.

Synchronization Algorithm: Uses linear regression on measurements to calculate drift rate, updates global atomic variables (chrono_sync::offset, chrono_sync::drift) for lock-free timestamp correction.

Configuration:

RPDT_CLOCKSYNC_IP: Path to config file with format IP_ADDRESS,rank=RANK_NUMBER
RPDT_CLOCKSYNC_RANK: Current node's rank

Key Features

File-based singleton locking prevents duplicate instances
Shared memory IPC between probing processes and main worker thread
Semaphore-protected measurement logging
Weighted updates using CONSENSUS_ALPHA for gradual convergence

Test Plan

Two-Node Testing: Verify singleton lock, worker thread startup, UDP probing, measurement collection, and offset/drift calculations.
Collective Check: Verify all nodes can exchange timestamps and achieve synchronization through causality violations.

Test Result

Two-Node Synchronization: Successfully passes the collective checks for causality violations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant