You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: rfcs/rfc7-client-route-liveness-probing.md
+82Lines changed: 82 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -252,6 +252,88 @@ After `Down`, TX uses exponential backoff with randomized jitter to avoid synchr
252
252
253
253
Sessions are created/removed by dynamic route events; restoration is immediate on `Up`, bounded by detection timing—not BGP hold timers.
254
254
255
+
## Observability
256
+
257
+
### Metrics
258
+
259
+
The client daemon MUST expose metrics at the endpoint `/metrics` in Prometheus text format.
260
+
261
+
The following metrics SHOULD be present at minimum:
262
+
263
+
| Name | Type | Labels | Meaning |
264
+
| --- | --- | --- | --- |
265
+
|`doublezero_liveness_sessions`| gauge |`service_type`, `iface`, `src`, `state`| Current number of sessions by FSM state (`admin_down`, `down`, `init`, `up`). |
266
+
|`doublezero_liveness_session_transitions_total`| counter |`service_type`, `iface`, `src`, `from`, `to`, `reason`| Count of session state transitions by from (state), to (state), and reason (`detect_timeout`, `rx_down`, `admin_down`). |
267
+
|`doublezero_liveness_routes_installed`| gauge |`service_type`, `iface`, `src`| Number of routes currently installed by the liveness process. |
268
+
|`doublezero_liveness_route_installs_total`| counter |`service_type`, `iface`, `src`| Total route add operations performed in the kernel. |
269
+
|`doublezero_liveness_route_withdraws_total`| counter |`service_type`, `iface`, `src`| Total route delete operations performed in the kernel. |
270
+
|`doublezero_liveness_convergence_to_up_seconds`| histogram |`service_type`, `iface`, `src`| Time from the first successful control message while `down` until transition to `up` (includes detect threshold, scheduler delay, and kernel install). |
271
+
|`doublezero_liveness_convergence_to_down_seconds`| histogram |`service_type`, `iface`, `src`| Time from the first failed or missing control message while `up` until transition to `down` (includes detect expiry, scheduler delay, and kernel delete). |
272
+
273
+
The following metrics SHOULD be exposed, but as opt-in due to high cardinality:
274
+
275
+
| Name | Type | Labels | Meaning |
276
+
| --- | --- | --- | --- |
277
+
|`doublezero_liveness_peer_sessions`| gauge |`service_type`, `iface`, `src`, `dst`, `state`| Current number of sessions by peer and FSM state (`admin_down`, `down`, `init`, `up`). |
278
+
|`doublezero_liveness_peer_session_detect_time_seconds`| gauge |`service_type`, `iface`, `src`, `dst`| Current detect time by session (after clamping with peer value). |
279
+
280
+
The following metrics MAY be exposed:
281
+
282
+
| Name | Type | Labels | Meaning |
283
+
| --- | --- | --- | --- |
284
+
|`doublezero_liveness_scheduler_queue_len`| gauge |`service_type`, `iface`, `src`| Current number of pending events in the scheduler queue. |
285
+
|`doublezero_liveness_handle_rx_duration_seconds`| histogram |`service_type`, `iface`, `src`| Distribution of time to handle a valid received packet. |
286
+
|`doublezero_liveness_control_packets_tx_total`| counter |`service_type`, `iface`, `src`| Total control packets sent. |
287
+
|`doublezero_liveness_control_packets_rx_total`| counter |`service_type`, `iface`, `src`| Total control packets received. |
288
+
|`doublezero_liveness_control_packets_rx_invalid_total`| counter |`service_type`, `iface`, `src`, `reason`| Invalid control packets received (e.g. `short`, `bad_version`, `bad_len`, `parse_error`, `not_ipv4`, `reserved_nonzero`). |
289
+
|`doublezero_liveness_unknown_peer_packets_total`| counter |`service_type`, `iface`, `src`| Packets received that didn’t match any known session. |
0 commit comments