Skip to content

xds: add MetricsReporter for generic xds client #8274

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

purnesh42H
Copy link
Contributor

@purnesh42H purnesh42H commented Apr 24, 2025

  • Add interfaces for xDS client to report different types of metrics.
  • Use provided metrics reporter to register and record metrics for valid, invalid resource updates and xDS server failures.

RELEASE NOTES: None

@purnesh42H purnesh42H requested a review from dfawley April 24, 2025 19:48
@purnesh42H purnesh42H added Area: xDS Includes everything xDS related, including LB policies used with xDS. Type: Feature New features or improvements in behavior labels Apr 24, 2025
@purnesh42H purnesh42H added this to the 1.73 Release milestone Apr 24, 2025
@purnesh42H
Copy link
Contributor Author

@dfawley could you take a look at the MetricsReporter interface proposed here? I have tried to keep it generic enough so that we don't need to add new methods to top level interfaces when adding new metric types and metric handles.

Once we have an agreement on the interface, I can write the tests.

@purnesh42H purnesh42H force-pushed the generic-xds-client-metrics-recorder branch from 0279ea0 to bd3ecc8 Compare April 24, 2025 19:59
Copy link

codecov bot commented Apr 24, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.27%. Comparing base (d00f4ac) to head (43f2eb8).
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8274      +/-   ##
==========================================
+ Coverage   82.22%   82.27%   +0.05%     
==========================================
  Files         419      419              
  Lines       41954    41974      +20     
==========================================
+ Hits        34497    34535      +38     
+ Misses       5995     5981      -14     
+ Partials     1462     1458       -4     
Files with missing lines Coverage Δ
xds/internal/clients/xdsclient/authority.go 71.21% <100.00%> (+0.89%) ⬆️
xds/internal/clients/xdsclient/xdsclient.go 80.45% <100.00%> (+2.25%) ⬆️
xds/internal/clients/xdsclient/xdsconfig.go 100.00% <ø> (ø)

... and 23 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@purnesh42H purnesh42H force-pushed the generic-xds-client-metrics-recorder branch from bd3ecc8 to 6fa2987 Compare April 24, 2025 20:56
@@ -121,6 +122,7 @@ type authorityBuildOptions struct {
getChannelForADS xdsChannelForADS // Function to acquire a reference to an xdsChannel
logPrefix string // Prefix for logging
target string // Target for the gRPC Channel that owns xDS Client/Authority
metricsRecorder MetricsRecorder // Metrics recorder for metrics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try to avoid comments that add no value. Similar for "Prefix for logging" when the name of the variable already tells you what it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed some unnecessary comments

@@ -363,6 +366,9 @@ func (a *authority) handleADSResourceUpdate(serverConfig *ServerConfig, rType Re
// On error, keep previous version of the resource. But update status
// and error.
if uErr.Err != nil {
if a.metricsRecorder != nil {
a.metricsRecorder.Record(xdsClientResourceUpdatesInvalidMetric, int64(1), a.target, serverConfig.ServerIdentifier.ServerURI, rType.TypeName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cast should not be needed here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -378,6 +384,10 @@ func (a *authority) handleADSResourceUpdate(serverConfig *ServerConfig, rType Re
continue
}

if a.metricsRecorder != nil {
a.metricsRecorder.Record(xdsClientResourceUpdatesValidMetric, int64(1), a.target, serverConfig.ServerIdentifier.ServerURI, rType.TypeName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


// MetricsReporter provides a way for XDSClient to register MetricHandle(s)
// and obtain a MetricsRecorder to record metrics.
type MetricsReporter interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is far more complex than I believe we need for this package. Most of these things here are concerns of OpenTelemetry, not this package. This package can just say what happened. The application can determine based on the outputs of this how to send it to whatever metric system it is using.

E.g.
https://go.dev/play/p/Tk6LKooRIMC

This is if we need to be able to carry information along with each event. If we don't need information in the events, we could just use an enumeration for the events that can occur instead (type Metric int)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we should maybe share one Metrics definition between all the clients?

Copy link
Contributor Author

@purnesh42H purnesh42H Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have simplified to only having MetricsReport that records metrics as suggested above. I have removed the registration part which means the MetricsReporter implementation needs to dynamically consume the reported metric and make sense of it through type casting etc. So, its going to be same for all clients. I added a method to Metric interface to return target because we can't have empty interfaces and target should be common for all types of metrics. Right now for generic client its hardcoded to "xds-client".

I have created a test implementation which is similar to internal/testutils/stats/test_metrics_recorder.go with the difference being the MetricsReporter implementation adding the lables, description, name etc. We can do the same with internal metrics recorder that is used for grpc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have copied the metrics test and modified to use this new MetricsReporter implementation.

@dfawley dfawley assigned purnesh42H and unassigned dfawley Apr 24, 2025
@purnesh42H purnesh42H force-pushed the generic-xds-client-metrics-recorder branch from 49961d8 to 7114106 Compare April 25, 2025 18:23
@purnesh42H purnesh42H requested a review from dfawley April 25, 2025 18:24
@purnesh42H purnesh42H assigned dfawley and unassigned purnesh42H Apr 25, 2025
Comment on lines 21 to 22
// Metric is type of metric to be reported by XDSClient.
type Metric interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this definition and the MetricsReporter to clients instead? Individual metrics could be declared in each of the xds/lrs clients, but the base definition could be shared.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


// Metric is type of metric to be reported by XDSClient.
type Metric interface {
Target() string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the source of this information? Isn't it passed to the xdsclient when it's created? If so, I don't think there's any need to give it back to the application now. The MetricsRecorder can be instrumented with the same knowledge.

Actually maybe we can just remove this type, even, and make ReportMetric take an any instead. That's probably sufficient. It can help with documentation to have things like this have their own type, but maybe that's not a concern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i was also contemplating making it any because vet was complaining of non proto file having interface without methods. Yeah target right now is being hardcoded to "xds-client" by us. I think once we have an interface for logger, we can accept user provided target there?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I thought "target" would be the channel's target string. It seems to have no value, then, if it's hard-coded.

What does logging have to do with this?

Copy link
Contributor Author

@purnesh42H purnesh42H Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah for xds resolver it is a target string but for xds-enabled server it is a constant "#server" as per grfc A71. In internal client, we pass it as a param to the client and mainly use it for refcounting client instances. In external client, we don't have any use case for it except may be wrap in the log messages. So, we can still accept it as param for client via client config may be or just as a prefix in logger? We can have a prefix method in the logger interface.

// the xDS management server for a given resource type.
type MetricResourceUpdateValid struct {
ServerURI string // ServerURI of the xDS management server.
Incr int64 // Count to be incremented.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever see this being anything but "1"?

Copy link
Contributor Author

@purnesh42H purnesh42H Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now its only being incr by 1 but having it has a parameter makes it explicit to user what to do with it. Otherwise we can document it that way that it reports a single new update.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just delete it.

@purnesh42H purnesh42H requested a review from dfawley April 28, 2025 10:21
// MetricsReporter is used by the XDSClient to report metrics. Metric can be of
// any type.
//
// For example: see xdsclient/internal/test_metrics_reporter.go
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation shouldn't link to internal test code.

If an example is needed, we can add a clients_examples_test.go and put an example there. Otherwise, just text documentation should normally be fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have updated the documentation to close what you suggested below. I was thinking of an example which type cast and print the metric but was not very convinced of adding it in the docstring. I think with docstring sentences now user should be able to figure our how to use

Comment on lines 104 to 105
// MetricsReporter is used by the XDSClient to report metrics. Metric can be of
// any type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metric can be of any type.

Not really. The metric will be one of a predefined set of types depending on the client. This should explain that each client will produce different metrics and to see that client's documentation for a list of possible metrics events.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks.

@@ -59,6 +59,17 @@ func (c *Channel) Replace(value any) {
}
}

// ReceiveOrFail returns the value on the underlying channel and true, or nil
// and false if the channel was empty.
func (c *Channel) ReceiveOrFail() (any, bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for? It's called once, and its return value isn't used.

Copy link
Contributor Author

@purnesh42H purnesh42H Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is draining the existing value from channel if present before we send the new value. Similar to what is being done in test metrics recorder in stats package.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just inconvenience i believe. For tests, we usually want to verify the metric that was reported recently, even though the multiple metrics are reported throughout tests. So, draining the channel before reporting a new value makes it easier by only keeping the metric of immediate interest.

For example, if we are looking for a metric that should not be reported, we only have to check the first value instead of all values in the channel.

I have made the channel size to be 1 to make it more obvious.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I agree with this testing philosophy. We shouldn't blindly throw away outputs at any step. If the test needs to say "ignore everything else, I need this metric" then it should do:

loop: for {
	select {
	case got := <-metricsChan:
		if got == want {
			break loop // success
		}
	case <-ctx.Done():
		t.Fatalf("timed out waiting for metric %v", want)
	}
}

And if we need we can put that in a helper function.

Copy link
Contributor Author

@purnesh42H purnesh42H Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, i have just removed ReceiveOrFail. For the tests we have right now, we are anyway checking for one metric at a time after its reported.

We should probably see if we can do something better in stats one.

@@ -363,6 +367,11 @@ func (a *authority) handleADSResourceUpdate(serverConfig *ServerConfig, rType Re
// On error, keep previous version of the resource. But update status
// and error.
if uErr.Err != nil {
if a.metricsReporter != nil {
a.metricsReporter.ReportMetric(MetricResourceUpdateInvalid{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's pass all of these by pointer instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// MetricResourceUpdateInvalid is a Metric to report invalid resource updates
// from the xDS management server for a given resource type.
type MetricResourceUpdateInvalid struct {
Target string // Target of the metric.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed Target and Incr

// recording events on channels and provides helpers to check if certain events
// have taken place. It also persists metrics data keyed on the metrics
// type.
type TestMetricsReporter struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not export anything defined in test files. Nothing can import them anyway.

Copy link
Contributor Author

@purnesh42H purnesh42H Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a test file? This is in internal package under xdsclient. But yeah this is only being used for e2e tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyways, I have moved the test metric recorder within e2e tests so that we don't have to export.

@purnesh42H purnesh42H requested a review from dfawley April 29, 2025 06:43
@@ -59,6 +59,17 @@ func (c *Channel) Replace(value any) {
}
}

// ReceiveOrFail returns the value on the underlying channel and true, or nil
// and false if the channel was empty.
func (c *Channel) ReceiveOrFail() (any, bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to do that?


package xdsclient

// MetricResourceUpdateValid is a Metric to report a valid resource update from
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
// MetricResourceUpdateValid is a Metric to report a valid resource update from
// MetricResourceUpdateValid is a metric to report a valid resource update from

(And below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


r.mu.Lock()
defer r.mu.Unlock()
r.data[metricName] = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be ++?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


// MetricResourceUpdateValid is a Metric to report a valid resource update from
// the xDS management server for a given resource type.
type MetricResourceUpdateValid struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should move metrics into a child package (xdsclient/metrics)? Or is that too many packages? It would help keep them nicely organized, though:

// Package metrics defines all metrics that can be produced by an xDS client.  All calls to
// the MetricsRecorder by the xDS client will contain a struct from this package passed by
// pointer.
package metrics

type ResourceUpdateValid struct {}
type ResourceUpdateInvalid struct {}
type ServerFailure struct {}

Thoughts? @easwars do you have an opinion?

Copy link
Contributor Author

@purnesh42H purnesh42H Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to child package metrics. Seems fine to me.

Or is that too many packages

this was the only reason i initially didn't do it because this was brought up in during design in general.

@dfawley dfawley assigned purnesh42H and unassigned dfawley Apr 29, 2025
@purnesh42H purnesh42H force-pushed the generic-xds-client-metrics-recorder branch from 7afcefa to 71c41ed Compare April 30, 2025 13:44
@purnesh42H purnesh42H assigned dfawley and unassigned purnesh42H Apr 30, 2025
@purnesh42H purnesh42H requested a review from dfawley April 30, 2025 13:46
@purnesh42H purnesh42H force-pushed the generic-xds-client-metrics-recorder branch from 71c41ed to c59d9ac Compare April 30, 2025 13:58
// from this package passed by pointer.
package metrics

// MetricResourceUpdateValid is a metric to report a valid resource update from
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can drop the Metric prefix from these types since they end up being used currently as metrics.MetricResourceUpdateValid at callsites, which is harded to read compared to metrics.ResourceUpdateValid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@dfawley dfawley assigned purnesh42H and unassigned easwars and dfawley May 7, 2025
@purnesh42H purnesh42H force-pushed the generic-xds-client-metrics-recorder branch from ba610ed to 37ef8c4 Compare May 7, 2025 16:10
@purnesh42H purnesh42H requested a review from dfawley May 7, 2025 16:13
@purnesh42H purnesh42H assigned dfawley and unassigned purnesh42H May 7, 2025
@purnesh42H purnesh42H force-pushed the generic-xds-client-metrics-recorder branch from 37ef8c4 to 839a683 Compare May 7, 2025 18:25
lis.Stop()
if ctx.Err() != nil {
t.Fatalf("Timeout when waiting for ADS stream to close")
}
// Restart to prevent the attempt to create a new ADS stream after back off.
lis.Restart()

// Server failure should still have no recording point.
// Server failure should not have emitted.
if err := tmr.waitForMetric(ctx, &metrics.ServerFailure{ServerURI: mgmtServer.Address}); err == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to make the test wait for 10 seconds before it succeeds?

Maybe ctx, cancel :=context.WithTimeout(ctx, defaultTestShortTimeout) first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@purnesh42H purnesh42H requested a review from dfawley May 8, 2025 07:00
@purnesh42H purnesh42H force-pushed the generic-xds-client-metrics-recorder branch from e43b227 to 43f2eb8 Compare May 8, 2025 10:13
@purnesh42H purnesh42H merged commit b3d63b1 into grpc:master May 8, 2025
23 of 24 checks passed
vinothkumarr227 pushed a commit to vinothkumarr227/grpc-go that referenced this pull request May 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: xDS Includes everything xDS related, including LB policies used with xDS. Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants