Improving Relay Selection Strategy Beyond Round-Robin #1043
Replies: 2 comments 1 reply
-
Relay Selection Strategy Review & FeedbackThis review evaluates the proposed relay selection strategy improvements for the py-libp2p CircuitV2 relay system. The current PR #972 implements a round-robin load balancing approach, while Discussion #1043 proposes advancing to a performance-based selection strategy. This document provides technical analysis, identifies strengths and concerns, and offers recommendations for the path forward. Key Findings:
BackgroundOriginal Issue (#735)The original relay selection logic returned the first available relay, leading to:
Current Implementation (PR #972)Status: Open, 6 commits, +798/-4 lines
Implementation Details: # From libp2p/relay/circuit_v2/transport.py
self.relay_counter = 0 # for round robin load balancing
self._relay_counter_lock = trio.Lock() # Thread safety
async with self._relay_counter_lock:
index = self.relay_counter % len(candidate_list)
relay_peer_id = candidate_list[index]
self.relay_counter += 1Proposed Enhancement (Discussion #1043)Proposes moving beyond round-robin to performance-aware relay selection with three strategic options. Discussion Post Analysis (#1043)Strengths1. Well-Structured Problem Definition ✅
2. Thoughtful Strategy Options 📊The three proposed approaches are well-researched and industry-standard: Option 1: Least Response Time
Option 2: Least Connections
Option 3: Hybrid Approach ⭐ (Recommended in discussion)
3. Backward Compatibility Consideration ✅The discussion explicitly mentions maintaining fallback behavior when metrics are unavailable—critical for production stability. 4. Clear Communication 📢
Areas for Discussion1. Implementation Complexity vs. Immediate Value ⚖️Question: Is the performance-based approach necessary now, or should it be a future enhancement? Considerations:
Analysis: The current round-robin approach represents a 80/20 solution—it solves 80% of the problem with 20% of the complexity. The performance-based approach targets the remaining 20% but requires 80% more engineering effort. 2. Missing: Metrics Collection Strategy 🔍The discussion proposes what to measure but not how:
Recommendation: A metrics collection design document should precede implementation. 3. Missing: Performance Benchmarks 📈The discussion lacks:
Recommendation: Establish benchmarks to validate that the additional complexity yields measurable improvements. 4. Risk: Over-Engineering
|
| Aspect | Round-Robin (PR #972) | Performance-Based (Discussion #1043) |
|---|---|---|
| Complexity | Low - Simple modulo arithmetic | High - Metrics collection, scoring, weighting |
| Implementation Time | ✅ Already complete | Estimated 2-4 weeks |
| Maintenance Overhead | Low - Minimal state | Medium-High - Metrics storage, expiry, edge cases |
| Performance Benefit | Evenly distributes load | Optimizes for quality AND distribution |
| User Experience | Good - No overloaded relays | Better - Routes to best-performing relays |
| Failure Handling | Neutral - Doesn't avoid bad relays | Self-healing - Automatically avoids slow relays |
| Resource Overhead | Minimal - Just a counter | Notable - Timing, storage, computation |
| Risk | Low - Battle-tested algorithm | Medium - New system, potential bugs |
| Backward Compatibility | ✅ No breaking changes | ✅ Can maintain fallback (if designed well) |
Recommendations
Phase 1: Merge Current PR #972
-
Rationale:
- Solves the immediate problem (issue Sophisticated Relay Selection in circuitv2 transport #735)
- Well-tested, low-risk implementation
- Significant improvement over "first available" strategy
- Provides foundation for future enhancements
-
Action Items:
- ✅ Address the minor performance concern (consider caching relay categorization)
- ✅ Ensure newsfragment mentions "round-robin" explicitly (already done)
- Merge to production
Impact: Immediate value delivery with minimal risk.
Phase 2: Research & Design
-
Rationale: Performance-based selection requires careful design
-
Action Items:
- Establish performance benchmarks with current round-robin implementation
- Design metrics collection architecture:
- What metrics to collect (latency, connection success rate, active circuits)
- How to store metrics (in-memory with TTL? time-series DB?)
- How to handle cold start and missing data
- Define success criteria (e.g., "30% reduction in average connection time")
- Create detailed design document for maintainer review
- Prototype in a feature branch
Impact: De-risks the implementation, ensures buy-in from maintainers.
Phase 3: Implement Performance-Based Selection
-
Rationale: Build on proven foundation with clear design
-
Implementation Strategy:
- Start with Option 2: Least Connections (simplest performance-based approach)
- Validate improvement with benchmarks
- If successful, iterate to Option 3: Hybrid Approach
-
Feature Flag:
class RelayConfig: selection_strategy: Literal["round_robin", "least_connections", "hybrid"] = "round_robin"
- Allows gradual rollout
- Easy rollback if issues arise
- A/B testing capability
Impact: Incremental improvement with controlled risk.
Recommendation 2: Start with Least Connections
Rationale:
- Easier to implement than hybrid approach (no complex weighting)
- Provides measurable benefit over round-robin
- Stepping stone to hybrid approach
- Lower risk than jumping directly to hybrid
Implementation Sketch:
async def _select_relay_least_connections(self, peer_info: PeerInfo) -> ID | None:
"""Select relay with fewest active circuits."""
relays = self.discovery.get_relays()
if not relays:
return None
# Prioritize relays with reservations
relays_with_reservations = [
r for r in relays
if self.discovery.get_relay_info(r) and
self.discovery.get_relay_info(r).has_reservation
]
candidate_list = relays_with_reservations if relays_with_reservations else relays
# Select relay with minimum active connections
# Tie-breaking with round-robin for equal connection counts
relay_connections = {
relay_id: self._get_active_circuit_count(relay_id)
for relay_id in candidate_list
}
min_connections = min(relay_connections.values())
min_relays = [r for r, c in relay_connections.items() if c == min_connections]
# Round-robin among relays with minimum connections
async with self._relay_counter_lock:
index = self.relay_counter % len(min_relays)
selected = min_relays[index]
self.relay_counter += 1
return selectedRecommendation 3: Metrics to Track
If pursuing performance-based selection, track these metrics per relay:
@dataclass
class RelayPerformanceMetrics:
relay_id: ID
# Connection metrics
total_connection_attempts: int
successful_connections: int
failed_connections: int
# Timing metrics (exponential moving average)
avg_connection_time_ms: float # EMA of connection establishment time
# Load metrics
active_circuits: int
# Reliability metrics
last_successful_connection: float # timestamp
last_failed_connection: float # timestamp
# Computed properties
@property
def success_rate(self) -> float:
if self.total_connection_attempts == 0:
return 0.0
return self.successful_connections / self.total_connection_attempts
@property
def is_healthy(self) -> bool:
"""Consider relay healthy if success rate > 80% and recent success."""
return (
self.success_rate > 0.8 and
time.time() - self.last_successful_connection < 300 # 5 minutes
)Metrics Retention:
- In-memory storage with TTL (e.g., 1 hour)
- Exponential moving average for timing metrics (smooth out spikes)
- Periodic cleanup of stale metrics
Recommendation 4: Acceptance Criteria
Before implementing performance-based selection, define clear success criteria:
| Metric | Current (Round-Robin) | Target (Performance-Based) |
|---|---|---|
| Average connection establishment time | Baseline TBD | 20-30% reduction |
| Failed connection rate | Baseline TBD | 20% reduction |
| Load distribution variance | Low (round-robin is uniform) | Medium (optimizes for quality) |
| 99th percentile connection time | Baseline TBD | 30% reduction |
| Self-healing (avoiding failed relays) | None | Automatic within 2 attempts |
How to measure: Implement telemetry in Phase 1 (round-robin) to establish baselines.
Risk Analysis
Risks of Current PR #972
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Counter overflow after 2^64 selections | Very Low | Low (wraps around) | No action needed (Python int is arbitrary precision) |
| Lock contention under high concurrency | Low | Low (lock held briefly) | Monitor in production; optimize if needed |
| Relay list changes during selection | Low | Low (handled gracefully) | Existing retry logic handles this |
Overall Risk: LOW ✅ - Safe to merge
Risks of Performance-Based Selection
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Metrics collection overhead | Medium | Medium | Use sampling, EMA smoothing |
| Cold start problem (new relay, no metrics) | High | Medium | Default to round-robin for unmeasured relays |
| Metrics storage memory overhead | Medium | Low | Implement TTL, periodic cleanup |
| Scoring algorithm tuning difficulty | High | Medium | Feature flag, A/B testing |
| Avoiding relays that are temporarily slow | Medium | Low | Time-based metrics expiry |
| Complexity increases debugging difficulty | High | Medium | Comprehensive logging, observability |
Overall Risk: MEDIUM
Comparison with Industry Standards
The proposed strategies align well with industry practices:
Round-Robin (Current PR)
- Used by: Nginx, HAProxy (default), AWS ELB (basic)
- Pros: Simple, predictable, widely understood
- Cons: Doesn't adapt to performance differences
Least Connections
- Used by: Nginx, HAProxy, AWS ALB
- Pros: Better than round-robin for long-lived connections
- Cons: Requires connection tracking
Weighted Response Time / Hybrid
- Used by: AWS ALB (with target health), Envoy, Istio
- Pros: Production-grade, self-optimizing
- Cons: Complex, requires tuning
libp2p Context:
- go-libp2p uses a similar relay selection approach but with additional DHT-based relay discovery
- rust-libp2p implements reservation prioritization similar to this PR
- py-libp2p would be on par with other implementations with the current PR, and ahead with performance-based selection
Alternative Approaches (Not in Discussion)
For completeness, here are alternative strategies not mentioned in the discussion:
1. Weighted Random Selection
- Assign weights to relays based on reservation status
- Randomly select with probability proportional to weight
- Pro: Simple, no state needed
- Con: Less predictable than round-robin
2. Power of Two Choices
- Randomly pick 2 relays, select the one with fewer connections
- Pro: Simple, good load distribution in large pools
- Con: Requires randomness, less predictable
3. Consistent Hashing
- Hash the target peer ID to select relay
- Pro: Same target always uses same relay (caching benefits)
- Con: Uneven distribution if target distribution is skewed
Recommendation: Stick with the proposed approaches; these alternatives don't offer significant advantages for this use case.
Detailed Feedback on Discussion Post #1043
What's Excellent ✅
- Contextualizes the change: Links to existing PR, acknowledges current progress
- Problem-first thinking: Clearly articulates why round-robin isn't sufficient
- Multiple options: Presents trade-offs, not just one solution
- Forward-thinking: Considers self-healing, adaptability
- Collaborative tone: CCs maintainers, invites feedback
- Structured format: Easy to read and reference
Suggestions for Improvement 📝
-
Add "Why Now?" Section
- Is this blocking any use case?
- Are users experiencing pain with round-robin?
- What's the opportunity cost of implementing this vs. other features?
-
Include Implementation Estimate
- Development time
- Testing time
- Migration/rollout plan
-
Add Rollback Strategy
- How to detect if performance-based selection is performing worse?
- How to quickly roll back to round-robin if needed?
-
Reference Benchmarks or Studies
- Any data showing performance-based selection benefits in similar systems?
- Links to research or case studies?
-
Address Operational Concerns
- How will this be monitored in production?
- What new metrics/alerts are needed?
- Debugging complexity increase?
-
Consider Simplification
- Could "Least Connections" alone solve 90% of the problem?
- Is the hybrid approach necessary for v1?
Conclusion & Final Recommendations
For PR #972 ✅
Recommendation: APPROVE & MERGE (with minor optional improvements)
Strengths:
- ✅ Solves the stated problem (issue Sophisticated Relay Selection in circuitv2 transport #735)
- ✅ Clean, maintainable code
- ✅ Thread-safe implementation
- ✅ Excellent test coverage
- ✅ Low risk, high value
Optional improvements (non-blocking):
- Consider caching relay categorization to reduce
get_relay_info()calls - Add stress test for concurrent selections (nice-to-have)
Next Steps:
- Address any maintainer feedback
- Merge to main
- Monitor in production
For Discussion #1043 📊
Recommendation: PROCEED WITH PHASED APPROACH
Phase 1: Merge PR #972 ← Do this first
- Delivers immediate value
- Establishes foundation
Phase 2: Research & Design ← Do this next
- Establish benchmarks with current implementation
- Design metrics collection architecture
- Get maintainer buy-in on approach
Phase 3: Implement Least Connections ← Then do this
- Start with simpler performance-based approach
- Validate improvements with benchmarks
Summary Recommendation Matrix
| Recommended Approach | Rationale |
|---|---|
| Merge PR #972 immediately | Solves problem, low risk, ready now |
| Implement telemetry & benchmarks | Establishes data for future decisions |
| Consider Least Connections if data supports | Incremental improvement |
| Consider Hybrid if scale demands | Production-grade for large scale |
References
- PR Improve relay selection to prioritize active reservations #972: Improve relay selection to prioritize active reservations #972
- Discussion Improving Relay Selection Strategy Beyond Round-Robin #1043: Improving Relay Selection Strategy Beyond Round-Robin #1043
- Issue Sophisticated Relay Selection in circuitv2 transport #735: (Referenced in PR)
- libp2p Circuit Relay v2 Spec: https://github.com/libp2p/specs/blob/master/relay/circuit-v2.md
Beta Was this translation helpful? Give feedback.
-
|
@yashksaini-coder : Great, appreciate the feedback. Very comprehensive indeed. Would recommend you to discuss with @parth-soni07 and arrive at a good conclusion on the PR, together. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
🧩 Context
⚙️ Problem Summary
Current behavior
Relays are selected using a simple round-robin rotation.
This spreads traffic evenly but does not account for:
Impact
Round-robin can unintentionally degrade performance because all relays are treated equally even when some are significantly slower or unstable.
This affects:
🔍 Analysis
A selection strategy that is aware of relay performance metrics will naturally self-optimize:
This is a common improvement seen across distributed systems, load balancers, and peer-to-peer networks.
🚀 Proposed Solutions
Below are three strategies worth evaluating:
Option 1: Least Response Time
Track the time required to successfully open a connection through each relay.
Option 2: Least Connections
Maintain a count of active circuits or sessions per relay.
Option 3: Hybrid Approach
Combine both:
Weight relays based on:
This gives a balanced, adaptive, self-healing selection algorithm.
🧠 Recommended Approach
**🔗 Related References **
CCing @paschal533 @seetadev @pacrob @acul71 for feedback & ideas 💡
Beta Was this translation helpful? Give feedback.
All reactions