-
Notifications
You must be signed in to change notification settings - Fork 4.5k
xds: add MetricsReporter for generic xds client #8274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xds: add MetricsReporter for generic xds client #8274
Conversation
@dfawley could you take a look at the MetricsReporter interface proposed here? I have tried to keep it generic enough so that we don't need to add new methods to top level interfaces when adding new metric types and metric handles. Once we have an agreement on the interface, I can write the tests. |
0279ea0
to
bd3ecc8
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #8274 +/- ##
==========================================
+ Coverage 82.22% 82.27% +0.05%
==========================================
Files 419 419
Lines 41954 41974 +20
==========================================
+ Hits 34497 34535 +38
+ Misses 5995 5981 -14
+ Partials 1462 1458 -4
🚀 New features to boost your workflow:
|
bd3ecc8
to
6fa2987
Compare
@@ -121,6 +122,7 @@ type authorityBuildOptions struct { | |||
getChannelForADS xdsChannelForADS // Function to acquire a reference to an xdsChannel | |||
logPrefix string // Prefix for logging | |||
target string // Target for the gRPC Channel that owns xDS Client/Authority | |||
metricsRecorder MetricsRecorder // Metrics recorder for metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try to avoid comments that add no value. Similar for "Prefix for logging" when the name of the variable already tells you what it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed some unnecessary comments
@@ -363,6 +366,9 @@ func (a *authority) handleADSResourceUpdate(serverConfig *ServerConfig, rType Re | |||
// On error, keep previous version of the resource. But update status | |||
// and error. | |||
if uErr.Err != nil { | |||
if a.metricsRecorder != nil { | |||
a.metricsRecorder.Record(xdsClientResourceUpdatesInvalidMetric, int64(1), a.target, serverConfig.ServerIdentifier.ServerURI, rType.TypeName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cast should not be needed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -378,6 +384,10 @@ func (a *authority) handleADSResourceUpdate(serverConfig *ServerConfig, rType Re | |||
continue | |||
} | |||
|
|||
if a.metricsRecorder != nil { | |||
a.metricsRecorder.Record(xdsClientResourceUpdatesValidMetric, int64(1), a.target, serverConfig.ServerIdentifier.ServerURI, rType.TypeName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
// MetricsReporter provides a way for XDSClient to register MetricHandle(s) | ||
// and obtain a MetricsRecorder to record metrics. | ||
type MetricsReporter interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is far more complex than I believe we need for this package. Most of these things here are concerns of OpenTelemetry, not this package. This package can just say what happened. The application can determine based on the outputs of this how to send it to whatever metric system it is using.
E.g.
https://go.dev/play/p/Tk6LKooRIMC
This is if we need to be able to carry information along with each event. If we don't need information in the events, we could just use an enumeration for the events that can occur instead (type Metric int
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we should maybe share one Metrics definition between all the clients?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have simplified to only having MetricsReport that records metrics as suggested above. I have removed the registration part which means the MetricsReporter implementation needs to dynamically consume the reported metric and make sense of it through type casting etc. So, its going to be same for all clients. I added a method to Metric interface to return target because we can't have empty interfaces and target should be common for all types of metrics. Right now for generic client its hardcoded to "xds-client".
I have created a test implementation which is similar to internal/testutils/stats/test_metrics_recorder.go
with the difference being the MetricsReporter implementation adding the lables, description, name etc. We can do the same with internal metrics recorder that is used for grpc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have copied the metrics test and modified to use this new MetricsReporter implementation.
49961d8
to
7114106
Compare
// Metric is type of metric to be reported by XDSClient. | ||
type Metric interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this definition and the MetricsReporter
to clients
instead? Individual metrics could be declared in each of the xds/lrs clients, but the base definition could be shared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
// Metric is type of metric to be reported by XDSClient. | ||
type Metric interface { | ||
Target() string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the source of this information? Isn't it passed to the xdsclient when it's created? If so, I don't think there's any need to give it back to the application now. The MetricsRecorder can be instrumented with the same knowledge.
Actually maybe we can just remove this type, even, and make ReportMetric
take an any
instead. That's probably sufficient. It can help with documentation to have things like this have their own type, but maybe that's not a concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i was also contemplating making it any because vet was complaining of non proto file having interface without methods. Yeah target right now is being hardcoded to "xds-client" by us. I think once we have an interface for logger, we can accept user provided target there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I thought "target" would be the channel's target string. It seems to have no value, then, if it's hard-coded.
What does logging have to do with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah for xds resolver it is a target string but for xds-enabled server it is a constant "#server" as per grfc A71. In internal client, we pass it as a param to the client and mainly use it for refcounting client instances. In external client, we don't have any use case for it except may be wrap in the log messages. So, we can still accept it as param for client via client config may be or just as a prefix in logger? We can have a prefix method in the logger interface.
// the xDS management server for a given resource type. | ||
type MetricResourceUpdateValid struct { | ||
ServerURI string // ServerURI of the xDS management server. | ||
Incr int64 // Count to be incremented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we ever see this being anything but "1"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right now its only being incr by 1 but having it has a parameter makes it explicit to user what to do with it. Otherwise we can document it that way that it reports a single new update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just delete it.
xds/internal/clients/config.go
Outdated
// MetricsReporter is used by the XDSClient to report metrics. Metric can be of | ||
// any type. | ||
// | ||
// For example: see xdsclient/internal/test_metrics_reporter.go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation shouldn't link to internal test code.
If an example is needed, we can add a clients_examples_test.go
and put an example there. Otherwise, just text documentation should normally be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have updated the documentation to close what you suggested below. I was thinking of an example which type cast and print the metric but was not very convinced of adding it in the docstring. I think with docstring sentences now user should be able to figure our how to use
xds/internal/clients/config.go
Outdated
// MetricsReporter is used by the XDSClient to report metrics. Metric can be of | ||
// any type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metric can be of any type.
Not really. The metric will be one of a predefined set of types depending on the client. This should explain that each client will produce different metrics and to see that client's documentation for a list of possible metrics events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Thanks.
@@ -59,6 +59,17 @@ func (c *Channel) Replace(value any) { | |||
} | |||
} | |||
|
|||
// ReceiveOrFail returns the value on the underlying channel and true, or nil | |||
// and false if the channel was empty. | |||
func (c *Channel) ReceiveOrFail() (any, bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this for? It's called once, and its return value isn't used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is draining the existing value from channel if present before we send the new value. Similar to what is being done in test metrics recorder in stats package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just inconvenience i believe. For tests, we usually want to verify the metric that was reported recently, even though the multiple metrics are reported throughout tests. So, draining the channel before reporting a new value makes it easier by only keeping the metric of immediate interest.
For example, if we are looking for a metric that should not be reported, we only have to check the first value instead of all values in the channel.
I have made the channel size to be 1 to make it more obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I agree with this testing philosophy. We shouldn't blindly throw away outputs at any step. If the test needs to say "ignore everything else, I need this metric" then it should do:
loop: for {
select {
case got := <-metricsChan:
if got == want {
break loop // success
}
case <-ctx.Done():
t.Fatalf("timed out waiting for metric %v", want)
}
}
And if we need we can put that in a helper function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, i have just removed ReceiveOrFail
. For the tests we have right now, we are anyway checking for one metric at a time after its reported.
We should probably see if we can do something better in stats one.
@@ -363,6 +367,11 @@ func (a *authority) handleADSResourceUpdate(serverConfig *ServerConfig, rType Re | |||
// On error, keep previous version of the resource. But update status | |||
// and error. | |||
if uErr.Err != nil { | |||
if a.metricsReporter != nil { | |||
a.metricsReporter.ReportMetric(MetricResourceUpdateInvalid{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's pass all of these by pointer instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// MetricResourceUpdateInvalid is a Metric to report invalid resource updates | ||
// from the xDS management server for a given resource type. | ||
type MetricResourceUpdateInvalid struct { | ||
Target string // Target of the metric. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed Target and Incr
// recording events on channels and provides helpers to check if certain events | ||
// have taken place. It also persists metrics data keyed on the metrics | ||
// type. | ||
type TestMetricsReporter struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not export anything defined in test files. Nothing can import them anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not a test file? This is in internal package under xdsclient. But yeah this is only being used for e2e tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyways, I have moved the test metric recorder within e2e tests so that we don't have to export.
@@ -59,6 +59,17 @@ func (c *Channel) Replace(value any) { | |||
} | |||
} | |||
|
|||
// ReceiveOrFail returns the value on the underlying channel and true, or nil | |||
// and false if the channel was empty. | |||
func (c *Channel) ReceiveOrFail() (any, bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to do that?
|
||
package xdsclient | ||
|
||
// MetricResourceUpdateValid is a Metric to report a valid resource update from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
// MetricResourceUpdateValid is a Metric to report a valid resource update from | |
// MetricResourceUpdateValid is a metric to report a valid resource update from |
(And below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
r.mu.Lock() | ||
defer r.mu.Unlock() | ||
r.data[metricName] = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be ++
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
// MetricResourceUpdateValid is a Metric to report a valid resource update from | ||
// the xDS management server for a given resource type. | ||
type MetricResourceUpdateValid struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should move metrics into a child package (xdsclient/metrics
)? Or is that too many packages? It would help keep them nicely organized, though:
// Package metrics defines all metrics that can be produced by an xDS client. All calls to
// the MetricsRecorder by the xDS client will contain a struct from this package passed by
// pointer.
package metrics
type ResourceUpdateValid struct {}
type ResourceUpdateInvalid struct {}
type ServerFailure struct {}
Thoughts? @easwars do you have an opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to child package metrics. Seems fine to me.
Or is that too many packages
this was the only reason i initially didn't do it because this was brought up in during design in general.
7afcefa
to
71c41ed
Compare
71c41ed
to
c59d9ac
Compare
// from this package passed by pointer. | ||
package metrics | ||
|
||
// MetricResourceUpdateValid is a metric to report a valid resource update from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can drop the Metric
prefix from these types since they end up being used currently as metrics.MetricResourceUpdateValid
at callsites, which is harded to read compared to metrics.ResourceUpdateValid
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
ba610ed
to
37ef8c4
Compare
37ef8c4
to
839a683
Compare
lis.Stop() | ||
if ctx.Err() != nil { | ||
t.Fatalf("Timeout when waiting for ADS stream to close") | ||
} | ||
// Restart to prevent the attempt to create a new ADS stream after back off. | ||
lis.Restart() | ||
|
||
// Server failure should still have no recording point. | ||
// Server failure should not have emitted. | ||
if err := tmr.waitForMetric(ctx, &metrics.ServerFailure{ServerURI: mgmtServer.Address}); err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to make the test wait for 10 seconds before it succeeds?
Maybe ctx, cancel :=context.WithTimeout(ctx, defaultTestShortTimeout)
first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
e43b227
to
43f2eb8
Compare
RELEASE NOTES: None