Improving Relay Selection Strategy Beyond Round-Robin #1043

parth-soni07 · 2025-11-14T12:45:40Z

parth-soni07
Nov 14, 2025

🧩 Context

Recent work on relay selection improvements has introduced a round-robin strategy to replace the earlier “first available relay” method. This change represents a meaningful step toward more predictable and evenly distributed relay usage.
Given that this component is already undergoing revision, it may be an opportune moment to consider a more advanced approach. Specifically, a performance-aware relay selection mechanism could offer significant benefits in terms of user experience, connection reliability, and system adaptability under varying network conditions.
Such a strategy would enable the system to better account for differences in relay performance, helping it dynamically adjust to factors such as latency, load, and relay availability.

⚙️ Problem Summary

Current behavior

Relays are selected using a simple round-robin rotation.
This spreads traffic evenly but does not account for:
- Network latency
- Relay capacity
- Overloaded or unreachable relays
- Geographic distance

Impact

Round-robin can unintentionally degrade performance because all relays are treated equally even when some are significantly slower or unstable.

This affects:

Connection setup time
Circuit reliability
User experience during congestion
Automatic adaptation when a relay becomes unhealthy

🔍 Analysis

A selection strategy that is aware of relay performance metrics will naturally self-optimize:

Fast relays get selected more often.
Slow or struggling relays get deprioritized without manual intervention.
Reserved relays can still be favored when their performance is comparable.

This is a common improvement seen across distributed systems, load balancers, and peer-to-peer networks.

🚀 Proposed Solutions

Below are three strategies worth evaluating:

Option 1: Least Response Time

Track the time required to successfully open a connection through each relay.

Quick feedback loop
Directly reflects network quality
Ideal for latency-sensitive traffic

Option 2: Least Connections

Maintain a count of active circuits or sessions per relay.

Helps with load distribution
Prevents overloading any relay
Similar to classic load-balancer strategies

Option 3: Hybrid Approach

Combine both:

Weight relays based on:
- Response time
- Active connection count
- Reservation priority (reserved relays get bonus weight)

This gives a balanced, adaptive, self-healing selection algorithm.

🧠 Recommended Approach

Evaluate implementing a hybrid weighted selection that considers performance metrics.
Maintain backward compatibility with simple fallback behavior if metrics are unavailable.
Keep reserved relays prioritized, unless their performance is significantly worse.

🔗 Related References

Current PR Improve relay selection to prioritize active reservations #972

CCing @paschal533 @seetadev @pacrob @acul71 for feedback & ideas 💡

yashksaini-coder · 2025-11-15T06:20:16Z

yashksaini-coder
Nov 15, 2025

Relay Selection Strategy Review & Feedback

This review evaluates the proposed relay selection strategy improvements for the py-libp2p CircuitV2 relay system. The current PR #972 implements a round-robin load balancing approach, while Discussion #1043 proposes advancing to a performance-based selection strategy. This document provides technical analysis, identifies strengths and concerns, and offers recommendations for the path forward.

Key Findings:

✅ Current round-robin implementation is solid and addresses the immediate issue
⚠️ Performance-based selection offers significant advantages but adds complexity
📊 Recommend phased approach: merge current work, then iterate on performance-based selection

Background

Original Issue (#735)

The original relay selection logic returned the first available relay, leading to:

Uneven load distribution across relays
Potential overloading of specific relays
No consideration for relay reservations

Current Implementation (PR #972)

Status: Open, 6 commits, +798/-4 lines
Key Changes:

Implemented round-robin selection with relay counter
Added thread-safety using trio.Lock() for concurrent access protection
Prioritizes relays with active reservations
Comprehensive test coverage (3 new test cases)

Implementation Details:

# From libp2p/relay/circuit_v2/transport.py
self.relay_counter = 0  # for round robin load balancing
self._relay_counter_lock = trio.Lock()  # Thread safety

async with self._relay_counter_lock:
    index = self.relay_counter % len(candidate_list)
    relay_peer_id = candidate_list[index]
    self.relay_counter += 1

Proposed Enhancement (Discussion #1043)

Proposes moving beyond round-robin to performance-aware relay selection with three strategic options.

Discussion Post Analysis (#1043)

Strengths

1. Well-Structured Problem Definition ✅

Clear articulation of current limitations
Identifies specific gaps (latency, capacity, geographic distance)
Quantifiable impact areas (connection time, reliability, UX)

2. Thoughtful Strategy Options 📊

The three proposed approaches are well-researched and industry-standard:

Option 1: Least Response Time

Pros: Direct reflection of network quality, quick feedback loop, latency-sensitive
Cons: Requires timing infrastructure, vulnerable to temporary spikes
Use Case: Best for real-time applications

Option 2: Least Connections

Pros: Simple to implement, classic load balancer strategy, prevents overloading
Cons: Doesn't account for connection quality or relay capacity variance
Use Case: Best for basic load distribution

Option 3: Hybrid Approach ⭐ (Recommended in discussion)

Pros: Balanced, adaptive, self-healing, maintains reservation priority
Cons: Most complex, requires tuning weights
Use Case: Production-grade distributed systems

3. Backward Compatibility Consideration ✅

The discussion explicitly mentions maintaining fallback behavior when metrics are unavailable—critical for production stability.

4. Clear Communication 📢

Uses visual markers (emojis) for section organization
References related PR for context
Tags relevant maintainers for feedback

Areas for Discussion

1. Implementation Complexity vs. Immediate Value ⚖️

Question: Is the performance-based approach necessary now, or should it be a future enhancement?

Considerations:

Round-robin already solves the immediate load distribution problem
Performance-based selection adds significant complexity:
- Metrics collection infrastructure
- Storage and expiry of performance data
- Weighted scoring algorithms
- Edge case handling (all relays performing poorly, no metrics available)

Analysis: The current round-robin approach represents a 80/20 solution—it solves 80% of the problem with 20% of the complexity. The performance-based approach targets the remaining 20% but requires 80% more engineering effort.

2. Missing: Metrics Collection Strategy 🔍

The discussion proposes what to measure but not how:

How will response times be tracked? (per connection attempt? moving average?)
Where will metrics be stored? (in-memory? persistent?)
What's the metrics retention/expiry policy?
How to handle cold start (new relay with no metrics)?
What happens when all relays have poor metrics?

Recommendation: A metrics collection design document should precede implementation.

3. Missing: Performance Benchmarks 📈

The discussion lacks:

Current baseline performance metrics
Expected performance improvements with each strategy
Resource overhead of metrics collection

Recommendation: Establish benchmarks to validate that the additional complexity yields measurable improvements.

4. Risk: Over-Engineering ⚠️

The hybrid approach, while sophisticated, may be over-engineering for the current use case:

Is the user base large enough to benefit from this optimization?
Are there real-world scenarios where round-robin fails significantly?
What's the cost/benefit ratio of implementation vs. maintenance?

PR #972 Technical Review

Code Quality Assessment

✅ Strengths:

Thread-Safety Implementation

async with self._relay_counter_lock:
    index = self.relay_counter % len(candidate_list)
    relay_peer_id = candidate_list[index]
    self.relay_counter += 1

Properly addresses concurrent access concerns
Uses appropriate trio.Lock() for async context

Clear Logic Flow

# Prioritize relays with active reservations
relays_with_reservations = []
other_relays = []

for relay_id in relays:
    relay_info = self.discovery.get_relay_info(relay_id)
    if relay_info and relay_info.has_reservation:
        relays_with_reservations.append(relay_id)
    else:
        other_relays.append(relay_id)

candidate_list = (
    relays_with_reservations if relays_with_reservations else other_relays
)

Readable separation of concerns
Reservation prioritization is explicit

Comprehensive Test Coverage
- test_circuit_v2_transport_relay_selection_round_robin_no_reservation(): Tests basic round-robin cycling
- test_circuit_v2_transport_relay_selection_prioritizes_reservations(): Validates reservation priority
- test_circuit_v2_transport_relay_selection_multiple_reservations(): Tests round-robin among reserved relays

⚠️ Potential Improvements:

Performance Concern: Repeated get_relay_info() Calls
```
for relay_id in relays:
    relay_info = self.discovery.get_relay_info(relay_id)  # Called in loop
```
- In each _select_relay() call, this iterates through all relays
- If there are many relays (10+), this could be expensive
- Suggestion: Consider caching the categorization of relays or maintaining a separate list of relays with reservations
Counter Increment Placement
The counter increments inside the lock, which is correct. However, consider whether you want to increment on:
- Every selection attempt (current)
- Only successful returns (alternative)
Current implementation is fine for uniform distribution.
Newsfragment Clarity (Already addressed in commit afc287e)
The newsfragment should explicitly mention "round-robin" for changelog clarity. ✅ This was already fixed.

Test Coverage Analysis

The test suite is excellent and demonstrates thorough thinking:

# Test 1: Basic round-robin (no reservations)
selected1 -> relay1
selected2 -> relay2  
selected3 -> relay3
selected4 -> relay1  # Cycles back

# Test 2: Prioritizes reservations
relay2.has_reservation = True
selected1 -> relay2  # Always picks relay2
selected2 -> relay2

# Test 3: Round-robin among reserved relays
relay1.has_reservation = True
relay2.has_reservation = True
selected1 -> relay1  # Round-robin only among reserved
selected2 -> relay2
selected3 -> relay1

Edge cases covered:

No reservations: Pure round-robin
Single reservation: Always selected
Multiple reservations: Round-robin among reserved only

Missing test cases (nice-to-have, not blockers):

Concurrent relay selection from multiple coroutines (stress test for lock)
Relay list changes during selection (add/remove relay)
Counter overflow behavior (after 2^64 selections)

Comparative Analysis: Round-Robin vs. Performance-Based

Aspect	Round-Robin (PR #972)	Performance-Based (Discussion #1043)
Complexity	Low - Simple modulo arithmetic	High - Metrics collection, scoring, weighting
Implementation Time	✅ Already complete	Estimated 2-4 weeks
Maintenance Overhead	Low - Minimal state	Medium-High - Metrics storage, expiry, edge cases
Performance Benefit	Evenly distributes load	Optimizes for quality AND distribution
User Experience	Good - No overloaded relays	Better - Routes to best-performing relays
Failure Handling	Neutral - Doesn't avoid bad relays	Self-healing - Automatically avoids slow relays
Resource Overhead	Minimal - Just a counter	Notable - Timing, storage, computation
Risk	Low - Battle-tested algorithm	Medium - New system, potential bugs
Backward Compatibility	✅ No breaking changes	✅ Can maintain fallback (if designed well)

Recommendations

Phase 1: Merge Current PR #972

Rationale:
- Solves the immediate problem (issue Sophisticated Relay Selection in circuitv2 transport #735)
- Well-tested, low-risk implementation
- Significant improvement over "first available" strategy
- Provides foundation for future enhancements
Action Items:
1. ✅ Address the minor performance concern (consider caching relay categorization)
2. ✅ Ensure newsfragment mentions "round-robin" explicitly (already done)
3. Merge to production

Impact: Immediate value delivery with minimal risk.

Phase 2: Research & Design

Rationale: Performance-based selection requires careful design
Action Items:
1. Establish performance benchmarks with current round-robin implementation
2. Design metrics collection architecture:
  - What metrics to collect (latency, connection success rate, active circuits)
  - How to store metrics (in-memory with TTL? time-series DB?)
  - How to handle cold start and missing data
3. Define success criteria (e.g., "30% reduction in average connection time")
4. Create detailed design document for maintainer review
5. Prototype in a feature branch

Impact: De-risks the implementation, ensures buy-in from maintainers.

Phase 3: Implement Performance-Based Selection

Rationale: Build on proven foundation with clear design
Implementation Strategy:
- Start with Option 2: Least Connections (simplest performance-based approach)
- Validate improvement with benchmarks
- If successful, iterate to Option 3: Hybrid Approach

Feature Flag:

class RelayConfig:
    selection_strategy: Literal["round_robin", "least_connections", "hybrid"] = "round_robin"

Allows gradual rollout
Easy rollback if issues arise
A/B testing capability

Impact: Incremental improvement with controlled risk.

Recommendation 2: Start with Least Connections

Rationale:

Easier to implement than hybrid approach (no complex weighting)
Provides measurable benefit over round-robin
Stepping stone to hybrid approach
Lower risk than jumping directly to hybrid

Implementation Sketch:

async def _select_relay_least_connections(self, peer_info: PeerInfo) -> ID | None:
    """Select relay with fewest active circuits."""
    relays = self.discovery.get_relays()
    if not relays:
        return None
    
    # Prioritize relays with reservations
    relays_with_reservations = [
        r for r in relays 
        if self.discovery.get_relay_info(r) and 
           self.discovery.get_relay_info(r).has_reservation
    ]
    
    candidate_list = relays_with_reservations if relays_with_reservations else relays
    
    # Select relay with minimum active connections
    # Tie-breaking with round-robin for equal connection counts
    relay_connections = {
        relay_id: self._get_active_circuit_count(relay_id)
        for relay_id in candidate_list
    }
    
    min_connections = min(relay_connections.values())
    min_relays = [r for r, c in relay_connections.items() if c == min_connections]
    
    # Round-robin among relays with minimum connections
    async with self._relay_counter_lock:
        index = self.relay_counter % len(min_relays)
        selected = min_relays[index]
        self.relay_counter += 1
    
    return selected

Recommendation 3: Metrics to Track

If pursuing performance-based selection, track these metrics per relay:

@dataclass
class RelayPerformanceMetrics:
    relay_id: ID
    
    # Connection metrics
    total_connection_attempts: int
    successful_connections: int
    failed_connections: int
    
    # Timing metrics (exponential moving average)
    avg_connection_time_ms: float  # EMA of connection establishment time
    
    # Load metrics
    active_circuits: int
    
    # Reliability metrics
    last_successful_connection: float  # timestamp
    last_failed_connection: float  # timestamp
    
    # Computed properties
    @property
    def success_rate(self) -> float:
        if self.total_connection_attempts == 0:
            return 0.0
        return self.successful_connections / self.total_connection_attempts
    
    @property
    def is_healthy(self) -> bool:
        """Consider relay healthy if success rate > 80% and recent success."""
        return (
            self.success_rate > 0.8 and
            time.time() - self.last_successful_connection < 300  # 5 minutes
        )

Metrics Retention:

In-memory storage with TTL (e.g., 1 hour)
Exponential moving average for timing metrics (smooth out spikes)
Periodic cleanup of stale metrics

Recommendation 4: Acceptance Criteria

Before implementing performance-based selection, define clear success criteria:

Metric	Current (Round-Robin)	Target (Performance-Based)
Average connection establishment time	Baseline TBD	20-30% reduction
Failed connection rate	Baseline TBD	20% reduction
Load distribution variance	Low (round-robin is uniform)	Medium (optimizes for quality)
99th percentile connection time	Baseline TBD	30% reduction
Self-healing (avoiding failed relays)	None	Automatic within 2 attempts

How to measure: Implement telemetry in Phase 1 (round-robin) to establish baselines.

Risk Analysis

Risks of Current PR #972

Risk	Likelihood	Impact	Mitigation
Counter overflow after 2^64 selections	Very Low	Low (wraps around)	No action needed (Python int is arbitrary precision)
Lock contention under high concurrency	Low	Low (lock held briefly)	Monitor in production; optimize if needed
Relay list changes during selection	Low	Low (handled gracefully)	Existing retry logic handles this

Overall Risk: LOW ✅ - Safe to merge

Risks of Performance-Based Selection

Risk	Likelihood	Impact	Mitigation
Metrics collection overhead	Medium	Medium	Use sampling, EMA smoothing
Cold start problem (new relay, no metrics)	High	Medium	Default to round-robin for unmeasured relays
Metrics storage memory overhead	Medium	Low	Implement TTL, periodic cleanup
Scoring algorithm tuning difficulty	High	Medium	Feature flag, A/B testing
Avoiding relays that are temporarily slow	Medium	Low	Time-based metrics expiry
Complexity increases debugging difficulty	High	Medium	Comprehensive logging, observability

Overall Risk: MEDIUM ⚠️ - Requires careful implementation and monitoring

Comparison with Industry Standards

The proposed strategies align well with industry practices:

Round-Robin (Current PR)

Used by: Nginx, HAProxy (default), AWS ELB (basic)
Pros: Simple, predictable, widely understood
Cons: Doesn't adapt to performance differences

Least Connections

Used by: Nginx, HAProxy, AWS ALB
Pros: Better than round-robin for long-lived connections
Cons: Requires connection tracking

Weighted Response Time / Hybrid

Used by: AWS ALB (with target health), Envoy, Istio
Pros: Production-grade, self-optimizing
Cons: Complex, requires tuning

libp2p Context:

go-libp2p uses a similar relay selection approach but with additional DHT-based relay discovery
rust-libp2p implements reservation prioritization similar to this PR
py-libp2p would be on par with other implementations with the current PR, and ahead with performance-based selection

Alternative Approaches (Not in Discussion)

For completeness, here are alternative strategies not mentioned in the discussion:

1. Weighted Random Selection

Assign weights to relays based on reservation status
Randomly select with probability proportional to weight
Pro: Simple, no state needed
Con: Less predictable than round-robin

2. Power of Two Choices

Randomly pick 2 relays, select the one with fewer connections
Pro: Simple, good load distribution in large pools
Con: Requires randomness, less predictable

3. Consistent Hashing

Hash the target peer ID to select relay
Pro: Same target always uses same relay (caching benefits)
Con: Uneven distribution if target distribution is skewed

Recommendation: Stick with the proposed approaches; these alternatives don't offer significant advantages for this use case.

Detailed Feedback on Discussion Post #1043

What's Excellent ✅

Contextualizes the change: Links to existing PR, acknowledges current progress
Problem-first thinking: Clearly articulates why round-robin isn't sufficient
Multiple options: Presents trade-offs, not just one solution
Forward-thinking: Considers self-healing, adaptability
Collaborative tone: CCs maintainers, invites feedback
Structured format: Easy to read and reference

Suggestions for Improvement 📝

Add "Why Now?" Section
- Is this blocking any use case?
- Are users experiencing pain with round-robin?
- What's the opportunity cost of implementing this vs. other features?
Include Implementation Estimate
- Development time
- Testing time
- Migration/rollout plan
Add Rollback Strategy
- How to detect if performance-based selection is performing worse?
- How to quickly roll back to round-robin if needed?
Reference Benchmarks or Studies
- Any data showing performance-based selection benefits in similar systems?
- Links to research or case studies?
Address Operational Concerns
- How will this be monitored in production?
- What new metrics/alerts are needed?
- Debugging complexity increase?
Consider Simplification
- Could "Least Connections" alone solve 90% of the problem?
- Is the hybrid approach necessary for v1?

Conclusion & Final Recommendations

For PR #972 ✅

Recommendation: APPROVE & MERGE (with minor optional improvements)

Strengths:

✅ Solves the stated problem (issue Sophisticated Relay Selection in circuitv2 transport #735)
✅ Clean, maintainable code
✅ Thread-safe implementation
✅ Excellent test coverage
✅ Low risk, high value

Optional improvements (non-blocking):

Consider caching relay categorization to reduce get_relay_info() calls
Add stress test for concurrent selections (nice-to-have)

Next Steps:

Address any maintainer feedback
Merge to main
Monitor in production

For Discussion #1043 📊

Recommendation: PROCEED WITH PHASED APPROACH

Phase 1: Merge PR #972 ← Do this first

Delivers immediate value
Establishes foundation

Phase 2: Research & Design ← Do this next

Establish benchmarks with current implementation
Design metrics collection architecture
Get maintainer buy-in on approach

Phase 3: Implement Least Connections ← Then do this

Start with simpler performance-based approach
Validate improvements with benchmarks

Summary Recommendation Matrix

Recommended Approach	Rationale
Merge PR #972 immediately	Solves problem, low risk, ready now
Implement telemetry & benchmarks	Establishes data for future decisions
Consider Least Connections if data supports	Incremental improvement
Consider Hybrid if scale demands	Production-grade for large scale

References

PR Improve relay selection to prioritize active reservations #972: Improve relay selection to prioritize active reservations #972
Discussion Improving Relay Selection Strategy Beyond Round-Robin #1043: Improving Relay Selection Strategy Beyond Round-Robin #1043
Issue Sophisticated Relay Selection in circuitv2 transport #735: (Referenced in PR)
libp2p Circuit Relay v2 Spec: https://github.com/libp2p/specs/blob/master/relay/circuit-v2.md

@parth-soni07 @seetadev

0 replies

seetadev · 2025-11-16T08:16:44Z

seetadev
Nov 16, 2025
Maintainer

@yashksaini-coder : Great, appreciate the feedback. Very comprehensive indeed.

Would recommend you to discuss with @parth-soni07 and arrive at a good conclusion on the PR, together.

1 reply

yashksaini-coder Nov 16, 2025

Understood sir, as of now the PR is good for us. I have already reviewed waiting for Pacrob too since he requested changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improving Relay Selection Strategy Beyond Round-Robin #1043

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Improving Relay Selection Strategy Beyond Round-Robin #1043

Uh oh!

parth-soni07 Nov 14, 2025

🧩 Context

⚙️ Problem Summary

Current behavior

Impact

🔍 Analysis

🚀 Proposed Solutions

Option 1: Least Response Time

Option 2: Least Connections

Option 3: Hybrid Approach

🧠 Recommended Approach

**🔗 Related References **

CCing @paschal533 @seetadev @pacrob @acul71 for feedback & ideas 💡

Replies: 2 comments · 1 reply

Uh oh!

yashksaini-coder Nov 15, 2025

Relay Selection Strategy Review & Feedback

Background

Original Issue (#735)

Current Implementation (PR #972)

Proposed Enhancement (Discussion #1043)

Discussion Post Analysis (#1043)

Strengths

1. Well-Structured Problem Definition ✅

2. Thoughtful Strategy Options 📊

3. Backward Compatibility Consideration ✅

4. Clear Communication 📢

Areas for Discussion

1. Implementation Complexity vs. Immediate Value ⚖️

2. Missing: Metrics Collection Strategy 🔍

3. Missing: Performance Benchmarks 📈

4. Risk: Over-Engineering ⚠️

PR #972 Technical Review

Code Quality Assessment

✅ Strengths:

⚠️ Potential Improvements:

Test Coverage Analysis

Comparative Analysis: Round-Robin vs. Performance-Based

Recommendations

Phase 1: Merge Current PR #972

Phase 2: Research & Design

Phase 3: Implement Performance-Based Selection

Recommendation 2: Start with Least Connections

Recommendation 3: Metrics to Track

Recommendation 4: Acceptance Criteria

Risk Analysis

Risks of Current PR #972

Risks of Performance-Based Selection

Comparison with Industry Standards

Round-Robin (Current PR)

Least Connections

Weighted Response Time / Hybrid

Alternative Approaches (Not in Discussion)

1. Weighted Random Selection

2. Power of Two Choices

3. Consistent Hashing

Detailed Feedback on Discussion Post #1043

What's Excellent ✅

Suggestions for Improvement 📝

Conclusion & Final Recommendations

For PR #972 ✅

For Discussion #1043 📊

Summary Recommendation Matrix

References

Uh oh!

seetadev Nov 16, 2025 Maintainer

Uh oh!

yashksaini-coder Nov 16, 2025

parth-soni07
Nov 14, 2025

🔗 Related References

Replies: 2 comments 1 reply

yashksaini-coder
Nov 15, 2025

seetadev
Nov 16, 2025
Maintainer