|
| 1 | +# DoubleZero Client Route Liveness Probing |
| 2 | + |
| 3 | +## Summary |
| 4 | + |
| 5 | +This proposal introduces **Route Liveness Probing** to the `doublezerod` client daemon. |
| 6 | + |
| 7 | +The goal is to enable active data-plane validation of BGP-learned routes from DoubleZero Devices (DZDs), ensuring that only reachable routes are installed in the local kernel routing table. |
| 8 | + |
| 9 | +Each route is periodically probed via ICMP echo requests, and transitions between `UP` and `DOWN` states according to a hysteresis-based policy. Routes marked `UP` are installed in the kernel routing table; routes marked `DOWN` are removed from the kernel routing table until they recover. |
| 10 | + |
| 11 | +The feature will initially be available only for the IBRL service type (unicast without allocated IP), where fallback reachability over the public internet path is available. |
| 12 | + |
| 13 | +## Motivation |
| 14 | + |
| 15 | +Currently, routes learned from DZDs over BGP are installed unconditionally. If a DZD or its tunnel fails while the BGP session remains established, these routes can remain in the kernel routing table even when traffic is no longer deliverable — leading to silent blackholing until standard BGP timers expire or manual intervention occurs. |
| 16 | + |
| 17 | +Introducing route liveness probing provides an independent, data-plane-based signal of reachability. This allows `doublezerod` to locally suppress failed routes without disturbing control-plane stability. |
| 18 | + |
| 19 | +It improves operational reliability, reduces convergence time after partial failures, and aligns with the goal of making the DoubleZero client resilient to asymmetric or silent path failures. |
| 20 | + |
| 21 | +## New Terminology |
| 22 | + |
| 23 | +- **Route Liveness Probe** — A periodic ICMP echo request sent by the client to verify that traffic can reach a given BGP-learned destination. |
| 24 | +- **Liveness State** — The local classification of a route as `Unknown`, `Up`, or `Down`, based on recent probe outcomes. |
| 25 | +- **Liveness Policy** — The decision logic (hysteresis-based) that determines when to transition between states, using configurable thresholds for consecutive successes or failures. |
| 26 | +- **Probing Worker** — The component that executes probes on a fixed interval and reports results to the policy tracker. |
| 27 | +- **User-Space ICMP Listener** — A lightweight responder on `doublezero0` that sends echo replies via `doublezero0` even when the route isn’t in the kernel table (where the kernel ICMP stack would otherwise return the reply over the public internet). |
| 28 | +- **Probing Subsystem** — The overall module within `doublezerod` that coordinates probing, evaluation, and route installation/withdrawal. |
| 29 | + |
| 30 | +## Alternatives Considered |
| 31 | + |
| 32 | +### Passive Monitoring (existing `doublezero-monitor-tool`) |
| 33 | + |
| 34 | +A passive approach could infer route health from forwarding statistics such as `nftables` or kernel FIB counters. However, it cannot distinguish between an idle route and an unreachable one and provides no proactive assurance of data-plane reachability. Detection is reactive and only occurs once user traffic has already been impacted. |
| 35 | + |
| 36 | +### BGP-Only (current in-client behavior) |
| 37 | + |
| 38 | +Relying solely on BGP session state and withdrawals, as done today, limits detection to control-plane failures. It cannot detect partial or asymmetric data-plane failures where the session remains established but forwarding has stopped, leading to silent blackholing until standard hold timers expire. |
| 39 | + |
| 40 | +### Active Liveness Probing via TWAMP |
| 41 | + |
| 42 | +TWAMP would provide a standards-based active probing mechanism but requires reflector support on the remote side and coordinated upgrades across all participating devices. Because existing clients already support kernel-space ICMP responders, ICMP-based probing can be deployed incrementally without disrupting reachability between mixed-version peers. |
| 43 | + |
| 44 | +### Active Liveness Probing via ICMP (selected) |
| 45 | + |
| 46 | +ICMP echo probing was selected for its simplicity, universality, and backward-compatible deployment. It leverages existing ICMP handling paths, requires no additional coordination between clients, and provides a reliable binary reachability signal suitable for gating route installation. |
| 47 | + |
| 48 | +## Detailed Design |
| 49 | + |
| 50 | +### Integration Context |
| 51 | + |
| 52 | +The probing subsystem integrates with the existing **BGP plugin** in `doublezerod`. Each service type (IBRL, IBRL with allocated IP, multicast) can declare whether route probing is active. In this proposal, probing is **enabled only for IBRL (without allocated IP)** mode. |
| 53 | + |
| 54 | +<details> |
| 55 | + |
| 56 | +<summary>System context diagram</summary> |
| 57 | + |
| 58 | +```mermaid |
| 59 | +graph TB |
| 60 | + DZD[Connected DZD Peer] |
| 61 | + DESTS[Destinations in Advertised Prefixes] |
| 62 | + INTERNET[Public Internet Path] |
| 63 | +
|
| 64 | + subgraph CLIENT[Client Host] |
| 65 | + DZIF[doublezero0 Interface] |
| 66 | +
|
| 67 | + subgraph DZD_PROC[doublezerod Process] |
| 68 | + BGP[BGP Plugin] |
| 69 | + RM[Route Manager] |
| 70 | + PW[Probing Worker] |
| 71 | + LT[Liveness Tracker] |
| 72 | + UL[User-Space ICMP Listener] |
| 73 | + end |
| 74 | +
|
| 75 | + NL[Netlink API] |
| 76 | + KRT[Kernel Routing Table] |
| 77 | + end |
| 78 | +
|
| 79 | + %% Control Plane |
| 80 | + DZD -->|BGP updates: advertise / withdraw| BGP |
| 81 | + BGP -->|Learned routes| RM |
| 82 | +
|
| 83 | + %% Probing Workflow |
| 84 | + RM --> PW |
| 85 | + PW -->|ICMP echo via doublezero0| DESTS |
| 86 | + DESTS -->|ICMP reply| UL |
| 87 | + UL --> PW |
| 88 | + PW -->|Probe results| LT |
| 89 | + LT -->|State: Up / Down| RM |
| 90 | +
|
| 91 | + %% Routing Integration |
| 92 | + RM -->|Add / Delete route| NL |
| 93 | + NL --> KRT |
| 94 | + DZIF --- KRT |
| 95 | +
|
| 96 | + %% Fallback Path |
| 97 | + DESTS -. "When route is down, kernel replies may return via" .-> INTERNET |
| 98 | +``` |
| 99 | + |
| 100 | +</details> |
| 101 | + |
| 102 | +<details> |
| 103 | + |
| 104 | +<summary>Workflow sequence diagram</summary> |
| 105 | + |
| 106 | +```mermaid |
| 107 | +sequenceDiagram |
| 108 | + autonumber |
| 109 | + participant DZD as DZD Peer |
| 110 | + participant BGP as BGP Plugin |
| 111 | + participant RM as Route Manager |
| 112 | + participant PW as Probing Worker |
| 113 | + participant UL as User-space ICMP Listener |
| 114 | + participant LT as Liveness Tracker |
| 115 | + participant NL as Netlink |
| 116 | + participant KRT as Kernel Routing Table |
| 117 | + participant DST as Destination Host |
| 118 | +
|
| 119 | + DZD->>BGP: BGP UPDATE (new/changed route) |
| 120 | + BGP->>RM: Learned route notification |
| 121 | + RM->>PW: Register route for probing |
| 122 | + RM->>LT: Initialize liveness (Unknown) |
| 123 | +
|
| 124 | + loop every probe interval |
| 125 | + PW->>DST: ICMP Echo via doublezero0 |
| 126 | + alt echo reply received |
| 127 | + DST-->>UL: ICMP Echo Reply on doublezero0 |
| 128 | + UL->>PW: Deliver reply |
| 129 | + PW->>LT: Record success |
| 130 | + else timeout or error |
| 131 | + PW->>LT: Record failure |
| 132 | + end |
| 133 | +
|
| 134 | + alt transition to UP |
| 135 | + LT-->>RM: State = UP |
| 136 | + RM->>NL: Install route |
| 137 | + NL->>KRT: Add route entry |
| 138 | + else transition to DOWN |
| 139 | + LT-->>RM: State = DOWN |
| 140 | + RM->>NL: Withdraw route |
| 141 | + NL->>KRT: Delete route entry |
| 142 | + else no change |
| 143 | + LT-->>RM: No state change |
| 144 | + end |
| 145 | + end |
| 146 | +
|
| 147 | + Note over DST,UL: If route is DOWN, host may reply via public internet path instead of doublezero0 |
| 148 | +``` |
| 149 | + |
| 150 | +</details> |
| 151 | + |
| 152 | +### Workflow |
| 153 | + |
| 154 | +1. **Route Announcement** |
| 155 | + |
| 156 | + When a new route is learned via BGP, it is registered with the route manager, which initializes its liveness state to `Unknown`. |
| 157 | + |
| 158 | +2. **Probing** |
| 159 | + |
| 160 | + The probing worker periodically sends ICMP echo requests toward each destination. |
| 161 | + |
| 162 | + - Echo replies are handled by the **user-space ICMP listener** bound to `doublezero0`. |
| 163 | + - This listener ensures replies return over the overlay interface, since the kernel’s ICMP stack would otherwise send them over the public internet when the route isn’t installed. |
| 164 | + |
| 165 | +3. **Liveness Evaluation** |
| 166 | + |
| 167 | + Results are fed into the liveness policy tracker: |
| 168 | + |
| 169 | + - Consecutive successes above a threshold transition the route to `Up`. |
| 170 | + - Consecutive failures above a threshold transition it to `Down`. |
| 171 | + - Intermediate results cause no state change. |
| 172 | +4. **Routing Synchronization** |
| 173 | + |
| 174 | + The route manager reflects state changes into the kernel routing table: |
| 175 | + |
| 176 | + - Routes marked `Up` are installed. |
| 177 | + - Routes marked `Down` are withdrawn. |
| 178 | + - BGP session state is unaffected. |
| 179 | + |
| 180 | +### Configuration Parameters |
| 181 | + |
| 182 | +| Parameter | Description | Default | |
| 183 | +| --- | --- | --- | |
| 184 | +| `--route-probing-enable` | Enables the probing subsystem | disabled | |
| 185 | +| `--route-probing-interval` | Probe interval per route | 1s | |
| 186 | +| `--route-probing-timeout` | Timeout per probe | 1s | |
| 187 | +| `--route-probing-up-threshold` | Consecutive successes to mark route `Up` | 3 | |
| 188 | +| `--route-probing-down-threshold` | Consecutive failures to mark route `Down` | 3 | |
| 189 | + |
| 190 | +### Policy Design |
| 191 | + |
| 192 | +The initial liveness policy is **hysteresis-based**, trading responsiveness for stability. |
| 193 | + |
| 194 | +The policy layer is designed to be pluggable, enabling future replacement with alternative evaluation strategies such as EWMA-based smoothing, weighted failure scoring, or adaptive thresholds that respond to observed probe variance. |
| 195 | + |
| 196 | +## Failure Scenarios |
| 197 | + |
| 198 | +### Probing Subsystem Failure |
| 199 | + |
| 200 | +If the probing subsystem crashes, deadlocks, or encounters runtime errors (e.g., socket exhaustion), route liveness state stops updating. Routes remain in their last known state — either `UP` or `DOWN` — until the subsystem recovers. This may temporarily cause stale routes to remain installed or withdrawn, but forwarding continuity is preserved. |
| 201 | + |
| 202 | +### ICMP Unavailability on Destination Clients |
| 203 | + |
| 204 | +If a destination DoubleZero client disables ICMP handling or filters echo replies, its peers will mark the associated routes as `DOWN` and withdraw them from their local routing tables. Traffic to that destination will then be sent via the public internet path instead of the `doublezero0` interface. This behavior preserves reachability but bypasses the DoubleZero overlay until the client resumes responding to ICMP. |
| 205 | + |
| 206 | +### False Negatives and Transient Misclassification |
| 207 | + |
| 208 | +ICMP rate limiting, temporary congestion, or asymmetric paths can cause sporadic probe failures and transient misclassification of route state. The hysteresis policy mitigates short-lived noise by requiring consecutive failures or recoveries before transition, but overly aggressive thresholds could still cause unnecessary route churn. |
| 209 | + |
| 210 | +### Resource Exhaustion |
| 211 | + |
| 212 | +In deployments with many routes, the probing loop may open a large number of concurrent ICMP sessions or consume excessive file descriptors. Concurrency limits and probe scheduling mitigate this risk, but misconfiguration or extreme churn could still degrade performance. |
| 213 | + |
| 214 | +## Impact |
| 215 | + |
| 216 | +### Operational Reliability |
| 217 | + |
| 218 | +Ensures that only verifiably reachable routes remain active, preventing blackholes caused by stale BGP state. |
| 219 | + |
| 220 | +### Convergence |
| 221 | + |
| 222 | +Enables faster local convergence following data-plane failures, without affecting BGP session timers or advertisements. |
| 223 | + |
| 224 | +### Resource Usage |
| 225 | + |
| 226 | +Adds lightweight background ICMP traffic and minimal CPU overhead; concurrency and rate limits ensure scalability with large route tables. |
| 227 | + |
| 228 | +### Observability |
| 229 | + |
| 230 | +Exposes route state transitions via logs and metrics, providing operators with visibility into data-plane reachability. |
| 231 | + |
| 232 | +## Security Considerations |
| 233 | + |
| 234 | +The route liveness probing subsystem does not materially alter DoubleZero’s trust or threat model. It operates entirely within the client’s existing control and data plane, using ICMP echo requests to destinations learned through the trusted DZD control plane. |
| 235 | + |
| 236 | +Probes are sent only toward prefixes advertised by connected DZDs, so there is no risk of arbitrary or unscoped network scanning. Probe frequency and concurrency are bounded to prevent overload or amplification. Responses are handled either by the `doublezerod` process (when the user-space ICMP listener is running) or by the kernel’s ICMP stack on remote peers running earlier versions. |
| 237 | + |
| 238 | +The feature introduces no new externally reachable services or credentials, and ICMP payloads contain no sensitive information. The primary operational consideration is that ICMP must be permitted between peers for liveness detection to function accurately. |
| 239 | + |
| 240 | +## Backward Compatibility |
| 241 | + |
| 242 | +Route liveness probing is designed to be **interoperable across mixed client versions**, ensuring that enabling it does not break communication between upgraded and non-upgraded peers. |
| 243 | + |
| 244 | +### Compatibility Matrix |
| 245 | + |
| 246 | +- **Probing enabled on source only:** |
| 247 | + |
| 248 | + The source client can still perform reachability checks, since destinations without probing respond using their kernel-space ICMP stack over the public internet path. Replies are routed normally, so liveness detection continues to function even if the remote side has not yet upgraded. |
| 249 | + |
| 250 | +- **Probing enabled on both source and destination:** |
| 251 | + |
| 252 | + Both clients use the DoubleZero user-space ICMP listener to exchange echo replies over the `doublezero0` interface, even when the route is not installed in the kernel table. This ensures accurate overlay-level reachability and preserves end-to-end validation within the DoubleZero fabric. |
| 253 | + |
| 254 | +- **Probing disabled on both sides:** |
| 255 | + |
| 256 | + Behavior remains unchanged from current deployments—routes are installed and withdrawn solely based on BGP control-plane updates. |
| 257 | + |
| 258 | + |
| 259 | +### Deployment Considerations |
| 260 | + |
| 261 | +Initial testing indicates that **approximately 7% of existing clients do not currently respond to ICMP probes**. |
| 262 | + |
| 263 | +These clients will appear unreachable to peers performing liveness probing, even though routing and forwarding may still function correctly over the control plane. |
| 264 | + |
| 265 | +To ensure consistent behavior, the **first phase of rollout** should focus on enabling ICMP responsiveness across all clients, regardless of whether route probing itself is enabled. |
| 266 | + |
| 267 | +Once universal ICMP handling is confirmed, **subsequent upgrades** can enable route probing selectively or by default. |
| 268 | + |
| 269 | +During this transition: |
| 270 | + |
| 271 | +- Mixed environments remain compatible, as unupgraded peers still respond via the kernel-space ICMP path. |
| 272 | +- Probing-capable clients automatically fall back to the public-internet ICMP path when remote overlay ICMP is unavailable. |
| 273 | +- Full overlay-level reachability validation over `doublezero0` becomes reliable once all clients are ICMP-responsive. |
| 274 | + |
| 275 | +## Open Questions |
| 276 | + |
| 277 | +- **Liveness Policy** — Is the current hysteresis approach good enough, or do we need something smoother like an EWMA or loss-weighted model to better handle intermittent loss and jitter? |
| 278 | +- **Thresholds & Convergence** — What probe interval and success/failure counts give us fast enough convergence without spamming probes or creating churn? |
| 279 | +- **Route Weighting** — Should all routes count the same, or should liveness results be weighted by stake or reputation (like `doublezero-monitor-tool`)? |
| 280 | +- **Probe Concurrency** — With lots of routes, how many probes can safely run at once, and do we need a global rate cap? |
| 281 | +- **Visibility & Monitoring** — How do we detect and debug flapping or systemic probe loss across clients? Should we collect telemetry or metrics from all clients to build an aggregate view of reachability and probe health? |
| 282 | +- **ICMP Reachability Rollout** — About 7% of clients don’t currently answer ICMP. What’s “good enough” coverage before we can safely make probing default? |
0 commit comments