Releases: threefoldtech/zos_rmb
v1.5.1
What's Changed
- Fix: gracefully close websocket writer on cancellation to prevent half-closed sockets by @sameh-farouk in #234
- ci: add override flag to rust toolchain installation in release workflow by @0oM4R in #236
New Contributors
Full Changelog: v1.5.0...v1.5.1
v1.5.0
Release Notes
Version 1.5.0 represents a major architectural overhaul of the RMB Relay, focused on delivering production-grade reliability, significant performance enhancements, and deep observability. The entire federation system and Substrate client have been re-engineered to be self-healing and highly resilient, while the core networking stack has been modernized for future performance gains.
Architectural & Reliability Overhaul
- Federation on Redis Streams: The inter-relay federation mechanism has been completely rebuilt on Redis Streams and consumer groups. This provides at-least-once delivery semantics and automatic recovery of failed messages, ensuring no messages are lost during transit.
- Resilient Substrate Client: The client interacting with the TFChain is now highly resilient to network failures. It uses
arc-swapfor lock-free reconnects and a "single-flight" pattern to prevent "thundering herd" scenarios on cache misses or reconnect attempts. - Self-Healing Event Listener: The
EventListeneris now a robust, self-healing service. It includes a watchdog for stall detection, an intelligent catch-up mechanism to recover from disconnects without full resets, and an indefinite retry loop for maximum uptime.
Performance & Efficiency
- Prioritized Local Routing: Messages are now routed to locally connected peers before attempting federation. This significantly reduces latency for local traffic and lowers the load on external services.
- Batched Message ACKs: WebSocket message acknowledgments are now batched, drastically reducing Redis round-trips under heavy load and improving overall throughput.
- Fair Worker Load Balancing: New connections and registrations are now distributed fairly across all available workers, preventing a single worker from being overloaded during connection bursts.
- Intelligent Redis Pool Sizing: The Redis connection pool is now sized dynamically based on consumer count plus a dedicated headroom, preventing pool exhaustion and ensuring stability under load.
New Features & API Changes
- Fast-Fail Mode: A new
fail-fastoption allows clients to receive an immediate error response if a message's destination is known to be offline, perfect for time-sensitive operations that shouldn't wait for a timeout. - Deprecation of
relaysField: The client-suppliedrelaysfield in the message envelope is now deprecated. Routing is now determined by authoritative chain data. While the field is ignored, backward compatibility is maintained to ensure older clients do not break.
Networking & Modernization
- Hyper 1.0 Upgrade: The entire HTTP stack has been upgraded from Hyper 0.14 to the modern Hyper 1.0 API and its ecosystem of crates (
reqwest,hyper-util). This brings significant performance improvements and aligns the project with the latest industry standards.
Observability
- Expanded Prometheus Metrics: Comprehensive Prometheus metrics have been added to provide deep insight into the relay's health, including detailed metrics for the twin cache (hits, misses, flushes) and the Event Listener's operational status.
Full Changelog: v1.3.4...v1.5.0
v1.5-pre4
Full Changelog: v1.5-pre2...v1.5-pre4
- Registrations are now distributed fairly across workers
v1.5-pre2
v1.5-pre2
This release focuses on hardening the relay's core components, introducing a highly resilient Substrate client, improving federation performance, and modernizing the entire HTTP stack by upgrading to Hyper 1.0.
Highlights
- Resilient Substrate Client: The client for interacting with the TFChain has been completely rewritten to be highly resilient against network failures.
- It now uses
arc-swapfor lock-free, atomic client replacement, allowing for seamless reconnects without blocking other tasks. - A "singleflight" pattern has been implemented to prevent multiple concurrent reconnect attempts during a network outage.
- Another singleflight pattern de-duplicates concurrent cache misses for the same twin, reducing load on the chain.
- It now uses
- Robust Event Listener: The
EventListeneris now a fully self-healing service. It features a watchdog to detect stalled connections, a robust catch-up mechanism, and an indefinite retry loop to ensure it can recover from prolonged disconnections.
Performance & Efficiency
- Federation on Redis Streams: The inter-relay federation mechanism has been rebuilt on top of Redis Streams and consumer groups. This replaces the old list-based queue, providing a more reliable and persistent message queue with automatic recovery of failed messages.
- Smarter Redis Pool Sizing: The Redis connection pool size is now calculated more intelligently based on the number of consumers plus a dedicated headroom for operational commands, preventing pool exhaustion under heavy load.
- Batched ACKs: WebSocket message acknowledgments are now batched, significantly reducing the number of round-trips to Redis under load.
Dependencies
- Hyper 1.0 Upgrade: The entire HTTP stack has been migrated from
hyper 0.14to the modern Hyper 1.0 API, along with key ecosystem crates likereqwestandhyper-tungstenite. This brings significant performance improvements and aligns the project with the latest standards. - General Updates: Dozens of other dependencies have been updated to their latest versions, incorporating numerous bug fixes, security patches, and performance enhancements from across the Rust ecosystem.
✨ Code Quality & Refinements
- The Protobuf schema has been updated to mark legacy fields as deprecated.
- Comprehensive Prometheus metrics have been added for the twin cache (hits, misses, flushes, entries) to improve observability.
Full Changelog: v1.5-pre1...v1.5-pre2
v1.5-pre1
rmb-rs v1.5-pre1 Release Notes
Compare: v1.4-pre3...v1.5-pre1
Overview
This release delivers faster local message delivery, a more reliable and parallel federation pipeline, and major improvements to event listener latency, reconnection robustness, and observability. A previously used field is deprecated but remains backward compatible.
Highlights
- Feature: Prefer local message routing before federation.
- Feature: Fast-fail mode for time-sensitive requests.
- Feature: Federation rewrite with at-least-once delivery and retries.
- Fix: More reliable relay cache with reduced RPC dependency.
- Observability: New metrics, stall detection, clearer separation of concerns.
- Deprecation: Client-supplied relays field no longer influences routing.
Fast Message Routing
- Local sessions are now prioritized for delivery before falling back to federation.
- Improves performance and resilience (especially when the relay is reachable across multiple domains).
- Reduces reliance on external services for routine message delivery.
Event Listener Improvements and Fixes
- Twin update latency reduced from ~12 seconds to almost real-time (a few milliseconds/network latency).
- Faster recovery after outages with reduced dependence on external RPC.
- Avoids silent hangs through stall detection.
- Cleaner separation of responsibilities for easier maintenance and troubleshooting.
- New metrics expand operational visibility.
Federation Rewrite
- At-least-once delivery: messages are acknowledged explicitly to prevent loss.
- Automatic retries: unacknowledged messages remain pending and are reprocessed until they expire.
- High parallelism: workers fetch messages directly and in batches, eliminating bottlenecks.
Fast-Fail Mode
- Clients can opt into immediate failure for time-critical operations when the destination is offline.
Deprecations
- The “relays” field is deprecated:
- Discovery is determined by authoritative sources rather than client hints.
- Reduces the risk of clients steering caching behavior.
- Parsing/serialization remains backward compatible; SDKs warn on use.
Operational Notes
- Expect improved local delivery performance and lower load on external services.
- Federation is more resilient with explicit acknowledgments and periodic reprocessing.
- Additional metrics and stall detection simplify troubleshooting and performance tuning.
v1.4-pre3
Better internal message switching (#213) * Run cargo update + fix compiler warnings * Minor improvments: - Add todos that need to be processed - Rename some structs - Handle callback errors (drop the connection from worker connections subset) * Add TwinID type * Remove some un-necessary async_trait usage * Worker rework - Separate worker to its own strcture - Use cancellation token to track if either of reader or writer routines exited and make sure connection is removed from global sessions set * Increase connection channel size * Use semaphore to track capacity Also run cargo clippy to clean up all warnings * WIP: Better worker loop * Implement back pressure for slow clients * Add some tests * Fix formatting
v1.3.4
update ubuntu image for the release workflow (#218)
v1.4-pre1
Increase connection channel size
v1.3.3
What's Changed
- fix rmb calls timeout by @Nabil-Salah in #202
New Contributors
- @Nabil-Salah made their first contribution in #202
Full Changelog: v1.3.2...v1.3.3