rfc: per-link metrics #898

packethog · 2025-07-23T03:44:22Z

Summary of Changes

This RFC discusses a proposal for setting initial per-link metrics across DoubleZero based on the telemetry pipeline.

bgm-malbeclabs

Seems like option 1 would be the preferred approach and wrt the decoupled nature of the network and on-chain data, seems like while we might not need the solution in this RFC we should adopt a strategy for getting to where we'd like to go.

bgm-malbeclabs · 2025-07-23T06:00:15Z

rfcs/rfc6-per-link-metrics.md

+
+This option is similar to #1 but instead of storing aggregated telemetry data in a seperate table within the telemetry program, the aggregation function could store them within the links table of the serviceability program itself. There is already a delay field (`delay_ms`) and jitter (`jitter_ms`) available in each row and could be reused or new fields created. This allows the controller to poll a single table for both link and aggregated telemetry information.
+
+One downside of this is ownership of these records as contributors have access to edit their links and are currently able to populate delay/jitter values. These could be removed from the doublezero CLI but more nefarious users could bypass this.


more nefarious users could bypass this.

Wouldn't we be able to see who makes updates to on-chain data or would it just show that an update had been made?

I believe each link has an owner pubkey associated and can only be modified by said pubkey (or the foundation, I think). The issue is detecting who changed it but rather it can be changed at will since the modification could immediately influence packat forwarding (setting a metric value to 1 from 100000).

bgm-malbeclabs · 2025-07-23T06:01:47Z

rfcs/rfc6-per-link-metrics.md

+
+The per-link metric calculation can be kept simple to start. We can directly use the p99 latency in microseconds over the previous epoch as the metric value. For example, a p99 latency value of 400us would equate to a metric value of 400 on a link where a p99 latency of 220ms would equate to a metric of 220,000 on a link. One could argue to use a composite metric such as p99 latency (us) + p99 jitter (us) which would select for circuits with min(latency + jitter) and they would not be wrong but we should bias towards a walk/crawl/run approach, especially until we have more confidence on telemetry sampling.
+
+In the event a link has no telemetry data from the previous epoch, the link could either be assigned a maximum metric of 16,777,215 or not placed into protocols altogether.


assigned a maximum metric of 16,777,215 or not placed into protocols altogether.
this seems like a pretty good deterrent for a contributor to try to pull their links in and out of service to gain some advantage or make it seem like the link is used when it's not

bgm-malbeclabs · 2025-07-23T06:05:09Z

rfcs/rfc6-per-link-metrics.md

+## Open Questions
+
+*Items that still need resolution.*
+List outstanding issues, research tasks, or decisions deferred to later milestones. This section helps reviewers focus feedback and signals areas where contributions are welcomed.


bgm-malbeclabs · 2025-07-23T06:17:36Z

rfcs/rfc6-per-link-metrics.md

+*Items that still need resolution.*
+List outstanding issues, research tasks, or decisions deferred to later milestones. This section helps reviewers focus feedback and signals areas where contributions are welcomed.
+
+- Should we include an override field in the link record? We're solely depependent on telemetry behaving correctly and even if it is, there are times where manual traffic engineering will be necessary. This is however dangerous if this can be set by the bandwidth contributor so it may belong somewhere else.


Definitely seems like we should have an override field and as long as there's some sort of audit trail to show why it was overridden, it should be transparent

Should we include an override field in the link record?

Definitely, otherwise we'll be left unable to control routing in case of unexpected issues. But I wouldn't add it to the link record, I'd add it to a new link_telemetry record

I think we should as well but not quite sure where it can go safely and who can actually apply an override.

bgm-malbeclabs · 2025-07-23T06:18:41Z

rfcs/rfc6-per-link-metrics.md

+
+- Should we include an override field in the link record? We're solely depependent on telemetry behaving correctly and even if it is, there are times where manual traffic engineering will be necessary. This is however dangerous if this can be set by the bandwidth contributor so it may belong somewhere else.
+
+- I don't love the circular dependencies we're building with the network decoupled from on-chain data. If a device loses management connectivity, we're at risk of not being able to update metric values in the event of an operational issue. We specifically didn't want to run validators directly on the DZDs since a network issue could also cause us to never reach consensus.


Agreed although I don't have a solution off-hand

snormore · 2025-07-23T11:48:07Z

rfcs/rfc6-per-link-metrics.md

+
+This option is similar to #1 but instead of storing aggregated telemetry data in a seperate table within the telemetry program, the aggregation function could store them within the links table of the serviceability program itself. There is already a delay field (`delay_ms`) and jitter (`jitter_ms`) available in each row and could be reused or new fields created. This allows the controller to poll a single table for both link and aggregated telemetry information.
+
+One downside of this is ownership of these records as contributors have access to edit their links and are currently able to populate delay/jitter values. These could be removed from the doublezero CLI but more nefarious users could bypass this.


It would also add complexity to the access scheme of that serviceability data. The offchain aggregation process would now need write-access to links.

I'm just gonna throw this here since it popped in my head. I wonder if we have two telemetry programs, the existing one already captures device-to-device telemetry. We could introduce a LinkTelemetry program. However, I don't know whether that would require more work vs changing either the serviceability or existing telemetry program.

I defer to you and Steven for guidance on where this should land, whether it be the existing telemetry program vs. elsewhere.

snormore · 2025-07-23T11:57:42Z

rfcs/rfc6-per-link-metrics.md

+### Option #1: Store Aggregated Telemetry In Telemetry Program
+
+1. Telemetry samples are written on-chain by the telemetry agent running on each DZD.
+2. At the end of each epoch, an aggregation function runs to calculate p99 latency and jitter values. Since this is required for rewards calculation, it seems this could be reused here. In order to make these queryable, the output can be written to an on-chain record as part of the telemetry program.


Assuming this is done by some resource provider process that we initially run, similar to (or maybe the same as) the rewards processor. Realize we don't have to definitively answer that in this RFC, just noting it for clarity.

Piggy-backing on the rewards processor was my thought as well for both simplicity (we don't need to build something new) and transparency, in that both metrics and rewards are derived from ideally the same aggregated telemetry output.

Yes, this would significantly reduce one major step which the off-chain rewards calculator has to do which is to derive telemetry data from raw samples. There would be some changes on that side as well to read aggregated data but that will hopefully be an easy lift.

Can we reuse the existing logic you wrote to aggregate telemetry? I'm wondering if we can break that out and be responsible for writing the result to the ledger.

snormore · 2025-07-23T12:07:51Z

rfcs/rfc6-per-link-metrics.md

+
+- Should we include an override field in the link record? We're solely depependent on telemetry behaving correctly and even if it is, there are times where manual traffic engineering will be necessary. This is however dangerous if this can be set by the bandwidth contributor so it may belong somewhere else.
+
+- I don't love the circular dependencies we're building with the network decoupled from on-chain data. If a device loses management connectivity, we're at risk of not being able to update metric values in the event of an operational issue. We specifically didn't want to run validators directly on the DZDs since a network issue could also cause us to never reach consensus.


The scenario here is that device telemetry metrics aren't being written onchain, resulting in out-of-date aggregate metrics used by the controller to render configs? Is the answer there to degrade by assuming max metric value and render the config based on that? We'd also still have metrics from other devices to that device in that situation.

The general problem in my mind is to make any operational change on the network relies on a DoubleZero device having connectivity to a controller instance, or assuming that process moves onto each device, the ability to read on-chain data via the public internet. If that network connectivity is severed (i.e. due to an outage, misconfiguration, etc) and requires reconfiguration to rectify, you're left with manual configuration through bandwidth contributors. In the past we've talked about running the side chain directly on the DZDs but it also has a circular dependency in network partitions could prevent consensus from being reached.

I don't think we have a great solution this early in the project but will keep calling it out.

We could have a more robust on-disk caching system (at least for the raw samples) which could live on device for say 24 hours.

On the same note, I wonder if we could introduce a built-in penalty system for unreachable links. For example:

Fresh: (0-1 epoch old): ideal, uses the p99 metric calculations

Stale: (2-3 epochs old): could use last calculated metrics + 10% penalty

Dead: (4-6 epochs old): could use historical avg + 25% penalty

Unknown: (> 6 epoch old): some default trash value

nikw9944 · 2025-07-23T17:22:34Z

rfcs/rfc6-per-link-metrics.md

+
+This option is similar to #1 but instead of storing aggregated telemetry data in a seperate table within the telemetry program, the aggregation function could store them within the links table of the serviceability program itself. There is already a delay field (`delay_ms`) and jitter (`jitter_ms`) available in each row and could be reused or new fields created. This allows the controller to poll a single table for both link and aggregated telemetry information.
+
+One downside of this is ownership of these records as contributors have access to edit their links and are currently able to populate delay/jitter values. These could be removed from the doublezero CLI but more nefarious users could bypass this.


Can we really re-use delay_ms and jitter_ms? I thought they were there so contributors can inform DZ of their expected latency and jitter. And even if we can re-use them, I think contributors' ability to override the telemetry-derived metrics is a blocker for this approach.

But serviceability feels like the right place for this data. How about add it as a separate serviceability table with a 1:1 mapping to the link table?

Can we really re-use delay_ms and jitter_ms? I thought they were there so contributors can inform DZ of their expected latency and jitter. And even if we can re-use them, I think contributors' ability to override the telemetry-derived metrics is a blocker for this approach.

Who is being informed? I'm not aware of these fields being consumed at the moment. I do agree we can't have these directly overridden by a contributor.

But serviceability feels like the right place for this data. How about add it as a separate serviceability table with a 1:1 mapping to the link table?

That's possible but at that point, I think we should just to option #1 where the data lives in the telemetry program and it's linked from there.

nikw9944 · 2025-07-23T17:26:30Z

rfcs/rfc6-per-link-metrics.md

+*Items that still need resolution.*
+List outstanding issues, research tasks, or decisions deferred to later milestones. This section helps reviewers focus feedback and signals areas where contributions are welcomed.
+
+- Should we include an override field in the link record? We're solely depependent on telemetry behaving correctly and even if it is, there are times where manual traffic engineering will be necessary. This is however dangerous if this can be set by the bandwidth contributor so it may belong somewhere else.


Should we include an override field in the link record?

Definitely, otherwise we'll be left unable to control routing in case of unexpected issues. But I wouldn't add it to the link record, I'd add it to a new link_telemetry record

nikw9944 · 2025-07-23T17:31:04Z

rfcs/rfc6-per-link-metrics.md

+
+- Should we include an override field in the link record? We're solely depependent on telemetry behaving correctly and even if it is, there are times where manual traffic engineering will be necessary. This is however dangerous if this can be set by the bandwidth contributor so it may belong somewhere else.
+
+- I don't love the circular dependencies we're building with the network decoupled from on-chain data. If a device loses management connectivity, we're at risk of not being able to update metric values in the event of an operational issue. We specifically didn't want to run validators directly on the DZDs since a network issue could also cause us to never reach consensus.


Probably a bigger topic but we could treat a loss of management connectivity as an outage and isolate the offending device. But then we'd need to improve controller availability. The circular dependency issue is important but it feels like we need to punt it to post-mainnet-beta.

It seems risky to completely isolate a device if management is down. How do we determine if management is down? If it's connectivity to a controller or on-chain, we're at risk of a cascading failure where all or a portion of the network drains itself if aws/triton/tbd is having a bad day. Definitely a post-mainnet-beta problem to solve.

vihu · 2025-07-24T22:59:49Z

Seems like option 1 would be the preferred approach and wrt the decoupled nature of the network and on-chain data, seems like while we might not need the solution in this RFC we should adopt a strategy for getting to where we'd like to go.

Yup, much prefer Option#1. Arguably the serviceability account data will stabilize over time and presumably would be much less frequently updated than Telemetry accounts.

vihu · 2025-07-24T23:50:44Z

rfcs/rfc6-per-link-metrics.md

+
+The DoubleZero controller could consume the most recent telemetry samples written on-chain within an epoch and immediately apply updated per-link metrics. While this is fairly straightforward to implement, without a smoothing function applied, sample to sample outliers would directly impact packet forwarding within DoubleZero as jitter would potentially cause shortest path recalculations on devices.
+
+The controller could use a windowing function internally to smooth latency outliers but unless the window was done on a strict boundary, it may be difficult to understand which samples contributed to a particular metric value. Additionally, the smoothed latency value and metric would need to be written back on-chain both for transparency and to provide durability to controller restarts, as the architecture is currently stateless and solely based on on-chain data.


We could have some sort of built in hysteresis for somewhat addressing smoothing (i.e. preventing transient latency spikes). So something like: new_metric = 0.8 * current_metric + 0.2 * calculated_metric (when within some hysteresis bounds).

So we only update aggregated metrics if the delta is greater than 5% or something.

packethog added this to the Decentralized Software milestone Jul 23, 2025

packethog requested a review from a team July 23, 2025 03:44

packethog self-assigned this Jul 23, 2025

rfc: per-link metrics

ff853d8

packethog force-pushed the feature/metrics_rfc branch from 6c941dc to ff853d8 Compare July 23, 2025 03:45

packethog linked an issue Jul 23, 2025 that may be closed by this pull request

rfc for calculating per-link metrics #887

Closed

bgm-malbeclabs reviewed Jul 23, 2025

View reviewed changes

snormore reviewed Jul 23, 2025

View reviewed changes

nikw9944 reviewed Jul 23, 2025

View reviewed changes

vihu reviewed Jul 24, 2025

View reviewed changes


		This option is similar to #1 but instead of storing aggregated telemetry data in a seperate table within the telemetry program, the aggregation function could store them within the links table of the serviceability program itself. There is already a delay field (`delay_ms`) and jitter (`jitter_ms`) available in each row and could be reused or new fields created. This allows the controller to poll a single table for both link and aggregated telemetry information.

		One downside of this is ownership of these records as contributors have access to edit their links and are currently able to populate delay/jitter values. These could be removed from the doublezero CLI but more nefarious users could bypass this.


		The per-link metric calculation can be kept simple to start. We can directly use the p99 latency in microseconds over the previous epoch as the metric value. For example, a p99 latency value of 400us would equate to a metric value of 400 on a link where a p99 latency of 220ms would equate to a metric of 220,000 on a link. One could argue to use a composite metric such as p99 latency (us) + p99 jitter (us) which would select for circuits with min(latency + jitter) and they would not be wrong but we should bias towards a walk/crawl/run approach, especially until we have more confidence on telemetry sampling.

		In the event a link has no telemetry data from the previous epoch, the link could either be assigned a maximum metric of 16,777,215 or not placed into protocols altogether.


		- Should we include an override field in the link record? We're solely depependent on telemetry behaving correctly and even if it is, there are times where manual traffic engineering will be necessary. This is however dangerous if this can be set by the bandwidth contributor so it may belong somewhere else.

		- I don't love the circular dependencies we're building with the network decoupled from on-chain data. If a device loses management connectivity, we're at risk of not being able to update metric values in the event of an operational issue. We specifically didn't want to run validators directly on the DZDs since a network issue could also cause us to never reach consensus.


		The DoubleZero controller could consume the most recent telemetry samples written on-chain within an epoch and immediately apply updated per-link metrics. While this is fairly straightforward to implement, without a smoothing function applied, sample to sample outliers would directly impact packet forwarding within DoubleZero as jitter would potentially cause shortest path recalculations on devices.

		The controller could use a windowing function internally to smooth latency outliers but unless the window was done on a strict boundary, it may be difficult to understand which samples contributed to a particular metric value. Additionally, the smoothed latency value and metric would need to be written back on-chain both for transparency and to provide durability to controller restarts, as the architecture is currently stateless and solely based on on-chain data.

rfc: per-link metrics #898

Are you sure you want to change the base?

rfc: per-link metrics #898

Uh oh!

Conversation

packethog commented Jul 23, 2025

Summary of Changes

Uh oh!

bgm-malbeclabs left a comment

Choose a reason for hiding this comment

Uh oh!

bgm-malbeclabs Jul 23, 2025 • edited by nikw9944 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikw9944 Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

snormore Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikw9944 Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vihu commented Jul 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

bgm-malbeclabs Jul 23, 2025 •

edited by nikw9944

Loading

nikw9944 Jul 23, 2025 •

edited

Loading

snormore Jul 23, 2025 •

edited

Loading

nikw9944 Jul 23, 2025 •

edited

Loading