-
Notifications
You must be signed in to change notification settings - Fork 3
rfc: per-link metrics #898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
6c941dc to
ff853d8
Compare
bgm-malbeclabs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like option 1 would be the preferred approach and wrt the decoupled nature of the network and on-chain data, seems like while we might not need the solution in this RFC we should adopt a strategy for getting to where we'd like to go.
|
|
||
| This option is similar to #1 but instead of storing aggregated telemetry data in a seperate table within the telemetry program, the aggregation function could store them within the links table of the serviceability program itself. There is already a delay field (`delay_ms`) and jitter (`jitter_ms`) available in each row and could be reused or new fields created. This allows the controller to poll a single table for both link and aggregated telemetry information. | ||
|
|
||
| One downside of this is ownership of these records as contributors have access to edit their links and are currently able to populate delay/jitter values. These could be removed from the doublezero CLI but more nefarious users could bypass this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more nefarious users could bypass this.
Wouldn't we be able to see who makes updates to on-chain data or would it just show that an update had been made?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe each link has an owner pubkey associated and can only be modified by said pubkey (or the foundation, I think). The issue is detecting who changed it but rather it can be changed at will since the modification could immediately influence packat forwarding (setting a metric value to 1 from 100000).
|
|
||
| The per-link metric calculation can be kept simple to start. We can directly use the p99 latency in microseconds over the previous epoch as the metric value. For example, a p99 latency value of 400us would equate to a metric value of 400 on a link where a p99 latency of 220ms would equate to a metric of 220,000 on a link. One could argue to use a composite metric such as p99 latency (us) + p99 jitter (us) which would select for circuits with min(latency + jitter) and they would not be wrong but we should bias towards a walk/crawl/run approach, especially until we have more confidence on telemetry sampling. | ||
|
|
||
| In the event a link has no telemetry data from the previous epoch, the link could either be assigned a maximum metric of 16,777,215 or not placed into protocols altogether. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assigned a maximum metric of 16,777,215 or not placed into protocols altogether.
this seems like a pretty good deterrent for a contributor to try to pull their links in and out of service to gain some advantage or make it seem like the link is used when it's not
| ## Open Questions | ||
|
|
||
| *Items that still need resolution.* | ||
| List outstanding issues, research tasks, or decisions deferred to later milestones. This section helps reviewers focus feedback and signals areas where contributions are welcomed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✂️
| *Items that still need resolution.* | ||
| List outstanding issues, research tasks, or decisions deferred to later milestones. This section helps reviewers focus feedback and signals areas where contributions are welcomed. | ||
|
|
||
| - Should we include an override field in the link record? We're solely depependent on telemetry behaving correctly and even if it is, there are times where manual traffic engineering will be necessary. This is however dangerous if this can be set by the bandwidth contributor so it may belong somewhere else. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely seems like we should have an override field and as long as there's some sort of audit trail to show why it was overridden, it should be transparent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include an override field in the link record?
Definitely, otherwise we'll be left unable to control routing in case of unexpected issues. But I wouldn't add it to the link record, I'd add it to a new link_telemetry record
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should as well but not quite sure where it can go safely and who can actually apply an override.
|
|
||
| - Should we include an override field in the link record? We're solely depependent on telemetry behaving correctly and even if it is, there are times where manual traffic engineering will be necessary. This is however dangerous if this can be set by the bandwidth contributor so it may belong somewhere else. | ||
|
|
||
| - I don't love the circular dependencies we're building with the network decoupled from on-chain data. If a device loses management connectivity, we're at risk of not being able to update metric values in the event of an operational issue. We specifically didn't want to run validators directly on the DZDs since a network issue could also cause us to never reach consensus. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed although I don't have a solution off-hand
|
|
||
| This option is similar to #1 but instead of storing aggregated telemetry data in a seperate table within the telemetry program, the aggregation function could store them within the links table of the serviceability program itself. There is already a delay field (`delay_ms`) and jitter (`jitter_ms`) available in each row and could be reused or new fields created. This allows the controller to poll a single table for both link and aggregated telemetry information. | ||
|
|
||
| One downside of this is ownership of these records as contributors have access to edit their links and are currently able to populate delay/jitter values. These could be removed from the doublezero CLI but more nefarious users could bypass this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would also add complexity to the access scheme of that serviceability data. The offchain aggregation process would now need write-access to links.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just gonna throw this here since it popped in my head. I wonder if we have two telemetry programs, the existing one already captures device-to-device telemetry. We could introduce a LinkTelemetry program. However, I don't know whether that would require more work vs changing either the serviceability or existing telemetry program.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I defer to you and Steven for guidance on where this should land, whether it be the existing telemetry program vs. elsewhere.
| ### Option #1: Store Aggregated Telemetry In Telemetry Program | ||
|
|
||
| 1. Telemetry samples are written on-chain by the telemetry agent running on each DZD. | ||
| 2. At the end of each epoch, an aggregation function runs to calculate p99 latency and jitter values. Since this is required for rewards calculation, it seems this could be reused here. In order to make these queryable, the output can be written to an on-chain record as part of the telemetry program. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming this is done by some resource provider process that we initially run, similar to (or maybe the same as) the rewards processor. Realize we don't have to definitively answer that in this RFC, just noting it for clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Piggy-backing on the rewards processor was my thought as well for both simplicity (we don't need to build something new) and transparency, in that both metrics and rewards are derived from ideally the same aggregated telemetry output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this would significantly reduce one major step which the off-chain rewards calculator has to do which is to derive telemetry data from raw samples. There would be some changes on that side as well to read aggregated data but that will hopefully be an easy lift.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reuse the existing logic you wrote to aggregate telemetry? I'm wondering if we can break that out and be responsible for writing the result to the ledger.
|
|
||
| - Should we include an override field in the link record? We're solely depependent on telemetry behaving correctly and even if it is, there are times where manual traffic engineering will be necessary. This is however dangerous if this can be set by the bandwidth contributor so it may belong somewhere else. | ||
|
|
||
| - I don't love the circular dependencies we're building with the network decoupled from on-chain data. If a device loses management connectivity, we're at risk of not being able to update metric values in the event of an operational issue. We specifically didn't want to run validators directly on the DZDs since a network issue could also cause us to never reach consensus. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scenario here is that device telemetry metrics aren't being written onchain, resulting in out-of-date aggregate metrics used by the controller to render configs? Is the answer there to degrade by assuming max metric value and render the config based on that? We'd also still have metrics from other devices to that device in that situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general problem in my mind is to make any operational change on the network relies on a DoubleZero device having connectivity to a controller instance, or assuming that process moves onto each device, the ability to read on-chain data via the public internet. If that network connectivity is severed (i.e. due to an outage, misconfiguration, etc) and requires reconfiguration to rectify, you're left with manual configuration through bandwidth contributors. In the past we've talked about running the side chain directly on the DZDs but it also has a circular dependency in network partitions could prevent consensus from being reached.
I don't think we have a great solution this early in the project but will keep calling it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have a more robust on-disk caching system (at least for the raw samples) which could live on device for say 24 hours.
On the same note, I wonder if we could introduce a built-in penalty system for unreachable links. For example:
- Fresh: (0-1 epoch old): ideal, uses the p99 metric calculations
- Stale: (2-3 epochs old): could use last calculated metrics + 10% penalty
- Dead: (4-6 epochs old): could use historical avg + 25% penalty
- Unknown: (> 6 epoch old): some default trash value
|
|
||
| This option is similar to #1 but instead of storing aggregated telemetry data in a seperate table within the telemetry program, the aggregation function could store them within the links table of the serviceability program itself. There is already a delay field (`delay_ms`) and jitter (`jitter_ms`) available in each row and could be reused or new fields created. This allows the controller to poll a single table for both link and aggregated telemetry information. | ||
|
|
||
| One downside of this is ownership of these records as contributors have access to edit their links and are currently able to populate delay/jitter values. These could be removed from the doublezero CLI but more nefarious users could bypass this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we really re-use delay_ms and jitter_ms? I thought they were there so contributors can inform DZ of their expected latency and jitter. And even if we can re-use them, I think contributors' ability to override the telemetry-derived metrics is a blocker for this approach.
But serviceability feels like the right place for this data. How about add it as a separate serviceability table with a 1:1 mapping to the link table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we really re-use delay_ms and jitter_ms? I thought they were there so contributors can inform DZ of their expected latency and jitter. And even if we can re-use them, I think contributors' ability to override the telemetry-derived metrics is a blocker for this approach.
Who is being informed? I'm not aware of these fields being consumed at the moment. I do agree we can't have these directly overridden by a contributor.
But serviceability feels like the right place for this data. How about add it as a separate serviceability table with a 1:1 mapping to the link table?
That's possible but at that point, I think we should just to option #1 where the data lives in the telemetry program and it's linked from there.
| *Items that still need resolution.* | ||
| List outstanding issues, research tasks, or decisions deferred to later milestones. This section helps reviewers focus feedback and signals areas where contributions are welcomed. | ||
|
|
||
| - Should we include an override field in the link record? We're solely depependent on telemetry behaving correctly and even if it is, there are times where manual traffic engineering will be necessary. This is however dangerous if this can be set by the bandwidth contributor so it may belong somewhere else. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include an override field in the link record?
Definitely, otherwise we'll be left unable to control routing in case of unexpected issues. But I wouldn't add it to the link record, I'd add it to a new link_telemetry record
|
|
||
| - Should we include an override field in the link record? We're solely depependent on telemetry behaving correctly and even if it is, there are times where manual traffic engineering will be necessary. This is however dangerous if this can be set by the bandwidth contributor so it may belong somewhere else. | ||
|
|
||
| - I don't love the circular dependencies we're building with the network decoupled from on-chain data. If a device loses management connectivity, we're at risk of not being able to update metric values in the event of an operational issue. We specifically didn't want to run validators directly on the DZDs since a network issue could also cause us to never reach consensus. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a bigger topic but we could treat a loss of management connectivity as an outage and isolate the offending device. But then we'd need to improve controller availability. The circular dependency issue is important but it feels like we need to punt it to post-mainnet-beta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems risky to completely isolate a device if management is down. How do we determine if management is down? If it's connectivity to a controller or on-chain, we're at risk of a cascading failure where all or a portion of the network drains itself if aws/triton/tbd is having a bad day. Definitely a post-mainnet-beta problem to solve.
Yup, much prefer Option#1. Arguably the serviceability account data will stabilize over time and presumably would be much less frequently updated than Telemetry accounts. |
|
|
||
| The DoubleZero controller could consume the most recent telemetry samples written on-chain within an epoch and immediately apply updated per-link metrics. While this is fairly straightforward to implement, without a smoothing function applied, sample to sample outliers would directly impact packet forwarding within DoubleZero as jitter would potentially cause shortest path recalculations on devices. | ||
|
|
||
| The controller could use a windowing function internally to smooth latency outliers but unless the window was done on a strict boundary, it may be difficult to understand which samples contributed to a particular metric value. Additionally, the smoothed latency value and metric would need to be written back on-chain both for transparency and to provide durability to controller restarts, as the architecture is currently stateless and solely based on on-chain data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have some sort of built in hysteresis for somewhat addressing smoothing (i.e. preventing transient latency spikes). So something like: new_metric = 0.8 * current_metric + 0.2 * calculated_metric (when within some hysteresis bounds).
So we only update aggregated metrics if the delta is greater than 5% or something.
Summary of Changes
This RFC discusses a proposal for setting initial per-link metrics across DoubleZero based on the telemetry pipeline.