Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR adds experimental InfiniBand RDMA (Remote Direct Memory Access) support to Iris for multi-node GPU communication. The implementation provides a symmetric heap model with RDMA operations (put/get/atomics) accessible from Triton kernels, using PyTorch distributed for bootstrapping and InfiniBand for high-performance inter-node communication.
Key changes:
- RDMA backend with InfiniBand support (optional build via CMake)
- CPU-GPU work queue for asynchronous RDMA operations
- Triton device APIs for RDMA put/get/atomic operations with symmetric heap addressing
Reviewed Changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| setup.py | Adds CMake build system for optional RDMA C++ extension with InfiniBand detection |
| iris/experimental/iris_rdma.py | Main Python API providing RDMA context, symmetric heap, and Triton device APIs |
| iris/experimental/iris_rdma/python/bindings.cpp | PyBind11 bindings exposing C++ RDMA backend to Python |
| iris/experimental/iris_rdma/src/*.hpp | C++ implementation: network backend, queue pairs, work queue, proxy thread, logging |
| iris/experimental/init.py | Exports iris_rdma module with optional import handling |
| examples/22-24_rdma_* | Example programs demonstrating producer-consumer, GET, and atomic operations |
| docker/* | Updated Dockerfile and scripts with InfiniBand device support |
| run.sh, rebuild.sh | Helper scripts for running and rebuilding |
Comments suppressed due to low confidence (1)
iris/experimental/iris_rdma/src/iris_manager.hpp:1
- Corrected spelling of 'its' to 'it's' in comment.
// SPDX-License-Identifier: MIT
| void dump_cq_info() const { | ||
| LOG_DEBUG("cq: %p", cq_); | ||
| LOG_DEBUG("handle: %u", cq_->channel); | ||
| LOG_DEBUG("cq_context: %p", cq_->cq_context); | ||
| LOG_DEBUG("context: %p", cq_->context); | ||
| LOG_DEBUG("cqe: %u", cq_->cqe); | ||
| LOG_DEBUG("comp_events_completed: %u", cq_->comp_events_completed); | ||
| LOG_DEBUG("async_events_completed: %u", cq_->async_events_completed); | ||
|
|
||
| } |
There was a problem hiding this comment.
The dump_cq_info() function appears to be a debugging utility that is called in production code (see network_backend.hpp line 489). Consider removing this call from the hot path (poll_cq) or guarding it behind a debug flag to avoid performance overhead in production.
| int sq_length = 64; // Send queue length // TODO: FIX THAT | ||
|
|
There was a problem hiding this comment.
The TODO comment "FIX THAT" is vague and doesn't explain what needs to be fixed. Consider clarifying what specific issue needs to be addressed (e.g., "TODO: Make queue length configurable" or "TODO: Calculate optimal queue length based on workload").
| int sq_length = 64; // Send queue length // TODO: FIX THAT | |
| int sq_length = 64; // Send queue length | |
| // TODO: Make send queue length (sq_length) configurable or calculate based on workload/device capabilities |
| std::this_thread::sleep_for(std::chrono::microseconds(10)); | ||
| } | ||
| if (n <= 0) { | ||
| LOG_DEBUG("Warning: PUT completion not polled (may be OK if async)"); |
There was a problem hiding this comment.
Using LOG_DEBUG for a warning message is inconsistent. Consider using LOG_WARN for warning messages to maintain proper log level semantics.
| LOG_DEBUG("Warning: PUT completion not polled (may be OK if async)"); | |
| LOG_WARN("Warning: PUT completion not polled (may be OK if async)"); |
| std::this_thread::sleep_for(std::chrono::microseconds(10)); | ||
| } | ||
| if (n <= 0) { | ||
| LOG_DEBUG("Warning: GET completion not polled (may be OK if async)"); |
There was a problem hiding this comment.
Similar to Comment 4, these warning messages use LOG_DEBUG instead of LOG_WARN. For consistency and proper log level semantics, warnings should use LOG_WARN.
| LOG_DEBUG("Warning: GET completion not polled (may be OK if async)"); | |
| LOG_WARN("Warning: GET completion not polled (may be OK if async)"); |
| std::this_thread::sleep_for(std::chrono::microseconds(10)); | ||
| } | ||
| if (n <= 0) { | ||
| LOG_DEBUG("Warning: ATOMIC_EXCH completion not polled (may be OK if async)"); |
There was a problem hiding this comment.
Similar to Comment 4, these warning messages use LOG_DEBUG instead of LOG_WARN. For consistency and proper log level semantics, warnings should use LOG_WARN.
| # Extract source address (min of pointer block where data is stored) | ||
| src_ptr_u64 = src_ptr.to(tl.uint64) | ||
| src_ptr_val = tl.min(src_ptr_u64, axis=0) | ||
| max_src_ptr = tl.max(src_ptr_u64, axis=0) |
There was a problem hiding this comment.
Variable max_src_ptr is not used.
| max_src_ptr = tl.max(src_ptr_u64, axis=0) |
|
|
||
| def build_extension(self, ext): | ||
| if not isinstance(ext, CMakeExtension): | ||
| return super().build_extension(ext) |
There was a problem hiding this comment.
Mixing implicit and explicit returns may indicate an error, as implicit returns always return None.
| return super().build_extension(ext) | |
| super().build_extension(ext) |
| import triton | ||
| import triton.language as tl | ||
| import numpy as np | ||
| import sys |
There was a problem hiding this comment.
Import of 'sys' is not used.
| import sys |
| import torch.distributed as dist | ||
| import triton | ||
| import triton.language as tl | ||
| import time |
There was a problem hiding this comment.
Import of 'time' is not used.
| import time |
| import torch.distributed as dist | ||
| import triton | ||
| import triton.language as tl | ||
| import time |
There was a problem hiding this comment.
Import of 'time' is not used.
| import time |
Motivation
Add RDMA + Proxy thread backend
Technical Details
Unclear yet how to merge this backend into Iris RMA backend but would like a single backend for both.
Test Plan
Test Result
Submission Checklist