OCPBUGS-59763: enforce client side auth requirement for metrics endpoint #684

ankitathomas · 2025-11-07T21:04:20Z

Description of the change:
Add client cert authentication for the metrics endpoint, as per https://rhobs-handbook.netlify.app/products/openshiftmonitoring/collecting_metrics.md/#roll-your-own-https-server

This PR secures the prometheus metrics HTTPS endpoint at port 8081 by authenticating the requestor through requiring and verifying client side certificates. Client side certificates also reduce the load on the kubernetes apiserver when compared to using bearer token based auth. The client CA bundle is present at the client-ca-file key of the kube-system/extension-apiserver-authentication ConfigMap on OpenShift clusters. The PR augments the existing server-side TLS verification with the client cert requirement, enforcing the client requests to be made with a certificate signed by the expected client CA.

The metrics server will now include the client CA bundle and require and verify client certs. All requests without valid client certs will be rejected with tls: certificate required

The configmap controller will monitor the kube-system/extension-apiserver-authentication ConfigMap to rotate (hot-reload) the client CAs on the metrics server's CA certPool for any change to the CAs. The change includes the appropriate rolebinding to allow the marketplace-operator to reconcile this configmap.

Tests verify unauthenticated rejection, authenticated success, CA rotation.

Motivation for the change:
Avoid potential information leaks through scraping by unauthenticated users

Reviewer Checklist

Implementation matches the proposed design, or proposal is updated to match implementation
Sufficient unit test coverage
Sufficient end-to-end test coverage
Docs updated or added to /docs
Commit messages sensible and descriptive

Signed-off-by: Ankita Thomas <[email protected]>

ankitathomas · 2025-11-10T13:43:00Z

/retest

ankitathomas · 2025-11-11T15:12:24Z

pkg/metrics/metrics.go

+			} else {
+				logrus.Warnf("No client CA configured, continuing without client cert verification")
+			}
 			err := httpsServer.ListenAndServeTLS("", "")


While the catalogsource, configmap and operatorhub reconcilers are managed by controller-runtime, the metrics server is a standalone HTTPS server on port 8081 exposed via the marketplace-operator-metrics service. This is the minimum set of changes required to secure this server

Thanks for the additional context @ankitathomas.

Wdyt about also cleaning up the deployment in this PR to document these ports (and help with service discovery/NetworkPolicy etc)

ports: - containerPort: 8081 name: https-metrics - containerPort: 8383 name: metrics - containerPort: 8080 name: healthz

Right now I see ports that are unused and misleading

ref: https://github.com/operator-framework/operator-marketplace/blob/master/manifests/09_operator.yaml#L52-L56

anik120 · 2025-11-11T16:16:58Z

pkg/metrics/metrics.go

+			} else {
+				logrus.Warnf("No client CA configured, continuing without client cert verification")
+			}
 			err := httpsServer.ListenAndServeTLS("", "")


Thanks for the additional context @ankitathomas.

Wdyt about also cleaning up the deployment in this PR to document these ports (and help with service discovery/NetworkPolicy etc)

ports: - containerPort: 8081 name: https-metrics - containerPort: 8383 name: metrics - containerPort: 8080 name: healthz

Right now I see ports that are unused and misleading

pkg/controller/configmap/configmap_controller.go

pkg/certificateauthority/clientcaconfigmap.go

anik120 · 2025-11-11T18:30:04Z

pkg/metrics/metrics.go

+				httpsServer.TLSConfig.ClientCAs = clientCAStore.GetCA()
+				httpsServer.TLSConfig.ClientAuth = tls.RequireAndVerifyClientCert
+			} else {
+				logrus.Warnf("No client CA configured, continuing without client cert verification")


This line is never reached right?

In main.go, I see clientCAStore := ca.NewClientCAStore(x509.NewCertPool()), so clientCAStore is always initialized - it's never nil (so this is dead code)

I see a potential bigger problem though. Since the clientCAStore is initialized with empty pool initially (that gets filled in once the ConfigMap controller reconciles successfully the first time), the prometheus scrapes during startup will fail with "unknown certificate authority" until the ConfigMap has had the time to reconcile. (firing prometheus alerts)

Which isn't a big issue during startup, coz eventually things will reconcile and alerts will die down.

Problem is every time pod restarts, the prometheus alert will fire, and will need to be ignored by the admin (so documented by us as "maybe don't worry about it for 5/10 mins and then start worrying" - which isn't great UX to begin with).

Wdyt about synchronous initialization of the cert pool to begin with, and then let the ConfigMap reconciler do its job of keeping it updated?

So something like

// main.go - before starting metrics clientCAStore := ca.NewClientCAStore(x509.NewCertPool()) // Read the CA ConfigMap synchronously before starting server clientCAConfigMap := &corev1.ConfigMap{} if err := mgr.GetClient().Get(ctx, types.NamespacedName{ Name: configmap.ClientCAConfigMap, Namespace: configmap.ClientCANamespace, }, clientCAConfigMap); err == nil { if caPEM, ok := clientCAConfigMap.Data[configmap.ClientCAKey]; ok { clientCAStore.Update([]byte(caPEM)) } } // Now start metrics with populated CA store if err := metrics.ServePrometheus(tlsCertPath, tlsKeyPath, clientCAStore); err != nil { logger.Fatalf("failed to serve prometheus metrics: %s", err) }

The reconcile should trigger immediately when the controller is first created. It can still emit unknown certificate authority issues when running with a cluster with no secret, or in between the metrics server starting and the controllers starting.

The earliest we can populate the certpool would be immediately after the manager gets created, so this will require moving some initializations around.

Provided we're ok with moving the metrics server start to later in the startup, this should be doable.

I don't see a reason we can't move things around if we need to. As long as they don't break existing functionalities we're always free to move things around

This else block is dead code (but more importantly, if it ever does reach this else block, then everything this PR is adding is bypassed, rendering this PR's code not useful. We don't want this to be nil ever, and if for whatever reason if it ever is, we want to error out.

Probably best to return and error from here instead of logging it and continuing

anik120

Also here's a suggestion for the PR description:

This PR adds mutual TLS (mTLS) client certificate authentication to the metrics HTTPS endpoint. Previously, the metrics endpoint at port 8081 only used server-side TLS (encrypted connection), but didn't verify who was connecting. With this PR, clients are required to present a valid certificate signed by a trusted CA.

Key point: The ServiceMonitor was already configured to send client certificates, but the server wasn't enforcing verification. This PR makes the server actually check those certificates.

Solution overview

Client CA Discovery

Openshift stores the cluster's client CA bundle in kube-system/extension-apiserver-authentication ConfigMap
This is the same CA that Prometheus uses to authenticate to other cluster services
The PR adds a new RoleBinding so marketplace-operator can read this ConfigMap

Dynamic CA Management

Creates a ClientCAStore to hold the CA certificate pool in memory
A ConfigMap controller watches kube-system/extension-apiserver-authentication
When the ConfigMap changes, it automatically updates the ClientCAStore
This allows CA rotation without restarting the operator

Metrics Server Enforcement

The HTTPS metrics server now configures:

TLSConfig.ClientCAs = clientCAStore.GetCA()
TLSConfig.ClientAuth = tls.RequireAndVerifyClientCert

Requests without valid client certs are rejected with tls: certificate required

Test Coverage

Tests added to test unauthenticated rejection, authenticated success, and CA rotation

ankitathomas · 2025-11-11T21:02:35Z

#684 (comment)

Considering this is already present on the service manifests, it seems unnecessary to include in this PR.

ankitathomas · 2025-11-11T21:25:53Z

Also here's a suggestion for the PR description:

This PR adds mutual TLS (mTLS) client certificate authentication to the metrics HTTPS endpoint. Previously, the metrics endpoint at port 8081 only used server-side TLS (encrypted connection), but didn't verify who was connecting. With this PR, clients are required to present a valid certificate signed by a trusted CA.

Key point: The ServiceMonitor was already configured to send client certificates, but the server wasn't enforcing verification. This PR makes the server actually check those certificates.

Solution overview
1. Client CA Discovery


* Openshift stores the cluster's client CA bundle in `kube-system/extension-apiserver-authentication` ConfigMap

* This is the same CA that Prometheus uses to authenticate to other cluster services

* The PR adds a new RoleBinding so `marketplace-operator` can read this ConfigMap


2. Dynamic CA Management


* Creates a ClientCAStore to hold the CA certificate pool in memory

* A ConfigMap controller watches kube-system/extension-apiserver-authentication

* When the ConfigMap changes, it automatically updates the ClientCAStore

* This allows CA rotation without restarting the operator


3. Metrics Server Enforcement


* The HTTPS metrics server now configures:
TLSConfig.ClientCAs = clientCAStore.GetCA()
TLSConfig.ClientAuth = tls.RequireAndVerifyClientCert
* Requests without valid client certs are rejected with `tls: certificate required`


4. Test Coverage


* Tests added to test unauthenticated rejection, authenticated success, and CA rotation

Updated the description.

Signed-off-by: Ankita Thomas <[email protected]>

jianzhangbjz · 2025-11-12T00:54:15Z

/assign @Xia-Zhao-rh

jianzhangbjz · 2025-11-12T01:09:51Z

/retitle OCPBUGS-59763: enforce client side auth requirement for metrics endpoint

openshift-ci-robot · 2025-11-12T01:10:01Z

@ankitathomas: This pull request references Jira Issue OCPBUGS-59763, which is invalid:

expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Description of the change:
Add client cert authentication for the metrics endpoint, as per https://rhobs-handbook.netlify.app/products/openshiftmonitoring/collecting_metrics.md/#roll-your-own-https-server

This PR secures the prometheus metrics HTTPS endpoint at port 8081 by authenticating the requestor through requiring and verifying client side certificates. Client side certificates also reduce the load on the kubernetes apiserver when compared to using bearer token based auth. The client CA bundle is present at the client-ca-file key of the kube-system/extension-apiserver-authentication ConfigMap on OpenShift clusters. The PR augments the existing server-side TLS verification with the client cert requirement, enforcing the client requests to be made with a certificate signed by the expected client CA.

The metrics server will now include the client CA bundle and require and verify client certs. All requests without valid client certs will be rejected with tls: certificate required

The configmap controller will monitor the kube-system/extension-apiserver-authentication ConfigMap to rotate (hot-reload) the client CAs on the metrics server's CA certPool for any change to the CAs. The change includes the appropriate rolebinding to allow the marketplace-operator to reconcile this configmap.

Tests verify unauthenticated rejection, authenticated success, CA rotation.

Motivation for the change:
Avoid potential information leaks through scraping by unauthenticated users

Reviewer Checklist

Implementation matches the proposed design, or proposal is updated to match implementation

Sufficient unit test coverage

Sufficient end-to-end test coverage

Docs updated or added to /docs

Commit messages sensible and descriptive

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

anik120 · 2025-11-12T14:24:08Z

#684 (comment)

Considering this is already present on the service manifests, it seems unnecessary to include in this PR.

That's totally fair.....it's just reaaally cathartic to leave a space cleaner than you found it, wouldn't you agree? 😁 🙏🏽

anik120

Looking great so far @ankitathomas, I've left some additional comments

ankitathomas · 2025-11-12T21:52:54Z

#684 (comment)
Considering this is already present on the service manifests, it seems unnecessary to include in this PR.

That's totally fair.....it's just reaaally cathartic to leave a space cleaner than you found it, wouldn't you agree? 😁 🙏🏽

#685

Created an issue so we don't lose track of it

cmd/manager/main.go

anik120 · 2025-11-13T14:01:22Z

pkg/metrics/metrics.go

+				httpsServer.TLSConfig.ClientCAs = clientCAStore.GetCA()
+				httpsServer.TLSConfig.ClientAuth = tls.RequireAndVerifyClientCert
+			} else {
+				logrus.Warnf("No client CA configured, continuing without client cert verification")


This else block is dead code (but more importantly, if it ever does reach this else block, then everything this PR is adding is bypassed, rendering this PR's code not useful. We don't want this to be nil ever, and if for whatever reason if it ever is, we want to error out.

Probably best to return and error from here instead of logging it and continuing

pkg/certificateauthority/clientcaconfigmap.go

pkg/controller/configmap/configmap_controller.go

ankitathomas · 2025-11-13T15:36:16Z

#684 (comment)

The unreachable else block was mostly meant for tests, though considering we don't actually have any tests for the metrics server on its own, its more out of habit than anything specific to the implementation. I can update it to error on a nil CAStore with a comment.

Signed-off-by: Ankita Thomas <[email protected]>

anik120

/lgtm
/approve

openshift-ci · 2025-11-13T16:39:13Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: anik120

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [anik120]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Xia-Zhao-rh · 2025-11-14T08:41:40Z

/verified by @Xia-Zhao-rh

openshift-ci-robot · 2025-11-14T08:41:44Z

@Xia-Zhao-rh: This PR has been marked as verified by @Xia-Zhao-rh.

In response to this:

/verified by @Xia-Zhao-rh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Xia-Zhao-rh · 2025-11-14T08:41:47Z

/retest

anik120 · 2025-11-14T19:07:22Z

/test e2e-gcp

ankitathomas assigned anik120 Nov 7, 2025

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 7, 2025

openshift-ci bot requested review from anik120 and kevinrizza November 7, 2025 21:04

enforce client side auth requirement for metrics endpoint

7d48ac7

Signed-off-by: Ankita Thomas <[email protected]>

ankitathomas force-pushed the clientCA branch from 4358598 to 7d48ac7 Compare November 10, 2025 07:45

ankitathomas changed the title ~~WIP: enforce client side auth requirement for metrics endpoint~~ enforce client side auth requirement for metrics endpoint Nov 10, 2025

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 10, 2025

vendor update

941656e

Signed-off-by: Ankita Thomas <[email protected]>

ankitathomas requested review from grokspawn and perdasilva November 10, 2025 08:46

ankitathomas assigned ankitathomas and unassigned anik120 Nov 10, 2025

ankitathomas commented Nov 11, 2025

View reviewed changes

anik120 requested changes Nov 11, 2025

View reviewed changes

openshift-ci bot assigned anik120 Nov 11, 2025

anik120 requested changes Nov 11, 2025

View reviewed changes

anik120 reviewed Nov 11, 2025

View reviewed changes

certpool mutex, variable rename

e76f4f0

Signed-off-by: Ankita Thomas <[email protected]>

ankitathomas requested a review from anik120 November 11, 2025 21:31

openshift-ci bot assigned Xia-Zhao-rh Nov 12, 2025

openshift-ci bot changed the title ~~enforce client side auth requirement for metrics endpoint~~ OCPBUGS-59763: enforce client side auth requirement for metrics endpoint Nov 12, 2025

anik120 reviewed Nov 12, 2025

View reviewed changes

ankitathomas requested a review from anik120 November 12, 2025 21:53

anik120 reviewed Nov 13, 2025

View reviewed changes

initialize client ca certpool

3872e87

Signed-off-by: Ankita Thomas <[email protected]>

anik120 approved these changes Nov 13, 2025

View reviewed changes

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 13, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 13, 2025

openshift-ci-robot added the verified label Nov 14, 2025

OCPBUGS-59763: enforce client side auth requirement for metrics endpoint #684

Are you sure you want to change the base?

OCPBUGS-59763: enforce client side auth requirement for metrics endpoint #684

Uh oh!

Conversation

ankitathomas commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ankitathomas commented Nov 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anik120 left a comment

Choose a reason for hiding this comment

Uh oh!

ankitathomas commented Nov 11, 2025

Uh oh!

ankitathomas commented Nov 11, 2025

Uh oh!

jianzhangbjz commented Nov 12, 2025

Uh oh!

jianzhangbjz commented Nov 12, 2025

Uh oh!

openshift-ci-robot commented Nov 12, 2025

Uh oh!

anik120 commented Nov 12, 2025

Uh oh!

anik120 left a comment

Choose a reason for hiding this comment

Uh oh!

ankitathomas commented Nov 12, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ankitathomas commented Nov 13, 2025

Uh oh!

anik120 left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Nov 13, 2025

Uh oh!

Xia-Zhao-rh commented Nov 14, 2025

Uh oh!

openshift-ci-robot commented Nov 14, 2025

Uh oh!

Xia-Zhao-rh commented Nov 14, 2025

Uh oh!

anik120 commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ankitathomas commented Nov 7, 2025 •

edited

Loading