Skip to content

Commit d83b080

Browse files
committed
rfc: client route liveness / bfd / metrics
1 parent b18f4da commit d83b080

File tree

1 file changed

+82
-0
lines changed

1 file changed

+82
-0
lines changed

rfcs/rfc7-client-route-liveness-probing.md

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -252,6 +252,88 @@ After `Down`, TX uses exponential backoff with randomized jitter to avoid synchr
252252

253253
Sessions are created/removed by dynamic route events; restoration is immediate on `Up`, bounded by detection timing—not BGP hold timers.
254254

255+
## Observability
256+
257+
### Metrics
258+
259+
The client daemon MUST expose metrics at the endpoint `/metrics` in Prometheus text format.
260+
261+
The following metrics SHOULD be present at minimum:
262+
263+
| Name | Type | Labels | Meaning |
264+
| --- | --- | --- | --- |
265+
| `doublezero_liveness_sessions` | gauge | `service_type`, `iface`, `src`, `state` | Current number of sessions by FSM state (`admin_down`, `down`, `init`, `up`). |
266+
| `doublezero_liveness_session_transitions_total` | counter | `service_type`, `iface`, `src`, `from`, `to`, `reason` | Count of session state transitions by from (state), to (state), and reason (`detect_timeout`, `rx_down`, `admin_down`). |
267+
| `doublezero_liveness_routes_installed` | gauge | `service_type`, `iface`, `src` | Number of routes currently installed by the liveness process. |
268+
| `doublezero_liveness_route_installs_total` | counter | `service_type`, `iface`, `src` | Total route add operations performed in the kernel. |
269+
| `doublezero_liveness_route_withdraws_total` | counter | `service_type`, `iface`, `src` | Total route delete operations performed in the kernel. |
270+
| `doublezero_liveness_convergence_to_up_seconds` | histogram | `service_type`, `iface`, `src` | Time from the first successful control message while `down` until transition to `up` (includes detect threshold, scheduler delay, and kernel install). |
271+
| `doublezero_liveness_convergence_to_down_seconds` | histogram | `service_type`, `iface`, `src` | Time from the first failed or missing control message while `up` until transition to `down` (includes detect expiry, scheduler delay, and kernel delete). |
272+
273+
The following metrics SHOULD be exposed, but as opt-in due to high cardinality:
274+
275+
| Name | Type | Labels | Meaning |
276+
| --- | --- | --- | --- |
277+
| `doublezero_liveness_peer_sessions` | gauge | `service_type`, `iface`, `src`, `dst`, `state` | Current number of sessions by peer and FSM state (`admin_down`, `down`, `init`, `up`). |
278+
| `doublezero_liveness_peer_session_detect_time_seconds` | gauge | `service_type`, `iface`, `src`, `dst` | Current detect time by session (after clamping with peer value). |
279+
280+
The following metrics MAY be exposed:
281+
282+
| Name | Type | Labels | Meaning |
283+
| --- | --- | --- | --- |
284+
| `doublezero_liveness_scheduler_queue_len` | gauge | `service_type`, `iface`, `src` | Current number of pending events in the scheduler queue. |
285+
| `doublezero_liveness_handle_rx_duration_seconds` | histogram | `service_type`, `iface`, `src` | Distribution of time to handle a valid received packet. |
286+
| `doublezero_liveness_control_packets_tx_total` | counter | `service_type`, `iface`, `src` | Total control packets sent. |
287+
| `doublezero_liveness_control_packets_rx_total` | counter | `service_type`, `iface`, `src` | Total control packets received. |
288+
| `doublezero_liveness_control_packets_rx_invalid_total` | counter | `service_type`, `iface`, `src`, `reason` | Invalid control packets received (e.g. `short`, `bad_version`, `bad_len`, `parse_error`, `not_ipv4`, `reserved_nonzero`). |
289+
| `doublezero_liveness_unknown_peer_packets_total` | counter | `service_type`, `iface`, `src` | Packets received that didn’t match any known session. |
290+
| `doublezero_liveness_io_errors_total` | counter | `service_type`, `iface`, `src`, `op` | Count of non-timeout I/O errors (`read`, `write`, `set_deadline`). |
291+
292+
### API
293+
294+
The client daemon MUST expose an API endpoint `/status/routes` as follows:
295+
296+
```
297+
$ curl --unix-socket /var/run/doublezerod/doublezerod.sock http://localhost/status/routes
298+
299+
[
300+
{
301+
"service_type": "IBRL",
302+
"timestamp": "2025-11-08T12:34:56Z",
303+
"tunnel_src": "10.0.0.1",
304+
"destination": "203.0.113.42/32",
305+
"status": "DOWN",
306+
"network": "devnet"
307+
},
308+
{
309+
"service_type": "IBRL",
310+
"timestamp": "2025-11-08T12:34:56Z",
311+
"tunnel_src": "10.0.0.1",
312+
"destination": "192.0.2.5/32",
313+
"status": "UP",
314+
"network": "devnet"
315+
}
316+
]
317+
```
318+
319+
### CLI
320+
321+
The client CLI MUST expose per-route liveness status using the daemon API:
322+
323+
```
324+
$ doublezero status --routes
325+
326+
Service Type Tunnel Src Destination Status Network Timestamp
327+
-------------- -------------- ----------------- ------- -------- -------------------
328+
IBRL 10.0.0.1 203.0.113.42/32 DOWN devnet 2025-11-08T12:00:00Z
329+
IBRL 10.0.0.1 198.51.100.14/32 DOWN devnet 2025-11-08T12:00:00Z
330+
IBRL 10.0.0.1 192.0.2.18/32 UP devnet 2025-11-08T12:00:00Z
331+
IBRL 10.0.0.1 198.51.100.8/32 UP devnet 2025-11-08T12:00:00Z
332+
IBRL 10.0.0.1 203.0.113.7/32 UP devnet 2025-11-08T12:00:00Z
333+
IBRL 10.0.0.1 198.51.100.2/32 UP devnet 2025-11-08T12:00:00Z
334+
IBRL 10.0.0.1 192.0.2.5/32 UP devnet 2025-11-08T12:00:00Z
335+
```
336+
255337
## Impact
256338

257339
- **Control-plane load**

0 commit comments

Comments
 (0)