Skip to content

Conversation

@ankitathomas
Copy link
Contributor

@ankitathomas ankitathomas commented Nov 7, 2025

Description of the change:
Add client cert authentication for the metrics endpoint, as per https://rhobs-handbook.netlify.app/products/openshiftmonitoring/collecting_metrics.md/#roll-your-own-https-server

This PR secures the prometheus metrics HTTPS endpoint at port 8081 by authenticating the requestor through requiring and verifying client side certificates. Client side certificates also reduce the load on the kubernetes apiserver when compared to using bearer token based auth. The client CA bundle is present at the client-ca-file key of the kube-system/extension-apiserver-authentication ConfigMap on OpenShift clusters. The PR augments the existing server-side TLS verification with the client cert requirement, enforcing the client requests to be made with a certificate signed by the expected client CA.

The metrics server will now include the client CA bundle and require and verify client certs. All requests without valid client certs will be rejected with tls: certificate required

The configmap controller will monitor the kube-system/extension-apiserver-authentication ConfigMap to rotate (hot-reload) the client CAs on the metrics server's CA certPool for any change to the CAs. The change includes the appropriate rolebinding to allow the marketplace-operator to reconcile this configmap.

Tests verify unauthenticated rejection, authenticated success, CA rotation.

Motivation for the change:
Avoid potential information leaks through scraping by unauthenticated users

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Docs updated or added to /docs
  • Commit messages sensible and descriptive

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 7, 2025
@openshift-ci openshift-ci bot requested review from anik120 and kevinrizza November 7, 2025 21:04
@ankitathomas ankitathomas changed the title WIP: enforce client side auth requirement for metrics endpoint enforce client side auth requirement for metrics endpoint Nov 10, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 10, 2025
Signed-off-by: Ankita Thomas <[email protected]>
@ankitathomas
Copy link
Contributor Author

/retest

} else {
logrus.Warnf("No client CA configured, continuing without client cert verification")
}
err := httpsServer.ListenAndServeTLS("", "")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the catalogsource, configmap and operatorhub reconcilers are managed by controller-runtime, the metrics server is a standalone HTTPS server on port 8081 exposed via the marketplace-operator-metrics service. This is the minimum set of changes required to secure this server

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the additional context @ankitathomas.

Wdyt about also cleaning up the deployment in this PR to document these ports (and help with service discovery/NetworkPolicy etc)

ports:
  - containerPort: 8081
    name: https-metrics
  - containerPort: 8383
    name: metrics
  - containerPort: 8080
    name: healthz

Right now I see ports that are unused and misleading

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

} else {
logrus.Warnf("No client CA configured, continuing without client cert verification")
}
err := httpsServer.ListenAndServeTLS("", "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the additional context @ankitathomas.

Wdyt about also cleaning up the deployment in this PR to document these ports (and help with service discovery/NetworkPolicy etc)

ports:
  - containerPort: 8081
    name: https-metrics
  - containerPort: 8383
    name: metrics
  - containerPort: 8080
    name: healthz

Right now I see ports that are unused and misleading

httpsServer.TLSConfig.ClientCAs = clientCAStore.GetCA()
httpsServer.TLSConfig.ClientAuth = tls.RequireAndVerifyClientCert
} else {
logrus.Warnf("No client CA configured, continuing without client cert verification")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is never reached right?

In main.go, I see clientCAStore := ca.NewClientCAStore(x509.NewCertPool()), so clientCAStore is always initialized - it's never nil (so this is dead code)

I see a potential bigger problem though. Since the clientCAStore is initialized with empty pool initially (that gets filled in once the ConfigMap controller reconciles successfully the first time), the prometheus scrapes during startup will fail with "unknown certificate authority" until the ConfigMap has had the time to reconcile. (firing prometheus alerts)

Which isn't a big issue during startup, coz eventually things will reconcile and alerts will die down.

Problem is every time pod restarts, the prometheus alert will fire, and will need to be ignored by the admin (so documented by us as "maybe don't worry about it for 5/10 mins and then start worrying" - which isn't great UX to begin with).

Wdyt about synchronous initialization of the cert pool to begin with, and then let the ConfigMap reconciler do its job of keeping it updated?

So something like

// main.go - before starting metrics
  clientCAStore := ca.NewClientCAStore(x509.NewCertPool())

  // Read the CA ConfigMap synchronously before starting server
  clientCAConfigMap := &corev1.ConfigMap{}
  if err := mgr.GetClient().Get(ctx, types.NamespacedName{
      Name: configmap.ClientCAConfigMap,
      Namespace: configmap.ClientCANamespace,
  }, clientCAConfigMap); err == nil {
      if caPEM, ok := clientCAConfigMap.Data[configmap.ClientCAKey]; ok {
          clientCAStore.Update([]byte(caPEM))
      }
  }

  // Now start metrics with populated CA store
  if err := metrics.ServePrometheus(tlsCertPath, tlsKeyPath, clientCAStore); err != nil {
      logger.Fatalf("failed to serve prometheus metrics: %s", err)
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reconcile should trigger immediately when the controller is first created. It can still emit unknown certificate authority issues when running with a cluster with no secret, or in between the metrics server starting and the controllers starting.

The earliest we can populate the certpool would be immediately after the manager gets created, so this will require moving some initializations around.

Provided we're ok with moving the metrics server start to later in the startup, this should be doable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a reason we can't move things around if we need to. As long as they don't break existing functionalities we're always free to move things around

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This else block is dead code (but more importantly, if it ever does reach this else block, then everything this PR is adding is bypassed, rendering this PR's code not useful. We don't want this to be nil ever, and if for whatever reason if it ever is, we want to error out.

Probably best to return and error from here instead of logging it and continuing

Copy link
Member

@anik120 anik120 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here's a suggestion for the PR description:

This PR adds mutual TLS (mTLS) client certificate authentication to the metrics HTTPS endpoint. Previously, the metrics endpoint at port 8081 only used server-side TLS (encrypted connection), but didn't verify who was connecting. With this PR, clients are required to present a valid certificate signed by a trusted CA.

Key point: The ServiceMonitor was already configured to send client certificates, but the server wasn't enforcing verification. This PR makes the server actually check those certificates.

Solution overview

  1. Client CA Discovery
  • Openshift stores the cluster's client CA bundle in kube-system/extension-apiserver-authentication ConfigMap
  • This is the same CA that Prometheus uses to authenticate to other cluster services
  • The PR adds a new RoleBinding so marketplace-operator can read this ConfigMap
  1. Dynamic CA Management
  • Creates a ClientCAStore to hold the CA certificate pool in memory
  • A ConfigMap controller watches kube-system/extension-apiserver-authentication
  • When the ConfigMap changes, it automatically updates the ClientCAStore
  • This allows CA rotation without restarting the operator
  1. Metrics Server Enforcement
  • The HTTPS metrics server now configures:
TLSConfig.ClientCAs = clientCAStore.GetCA()
TLSConfig.ClientAuth = tls.RequireAndVerifyClientCert
  • Requests without valid client certs are rejected with tls: certificate required
  1. Test Coverage
  • Tests added to test unauthenticated rejection, authenticated success, and CA rotation

@ankitathomas
Copy link
Contributor Author

#684 (comment)

Considering this is already present on the service manifests, it seems unnecessary to include in this PR.

@ankitathomas
Copy link
Contributor Author

Also here's a suggestion for the PR description:

This PR adds mutual TLS (mTLS) client certificate authentication to the metrics HTTPS endpoint. Previously, the metrics endpoint at port 8081 only used server-side TLS (encrypted connection), but didn't verify who was connecting. With this PR, clients are required to present a valid certificate signed by a trusted CA.

Key point: The ServiceMonitor was already configured to send client certificates, but the server wasn't enforcing verification. This PR makes the server actually check those certificates.

Solution overview

1. Client CA Discovery


* Openshift stores the cluster's client CA bundle in `kube-system/extension-apiserver-authentication` ConfigMap

* This is the same CA that Prometheus uses to authenticate to other cluster services

* The PR adds a new RoleBinding so `marketplace-operator` can read this ConfigMap


2. Dynamic CA Management


* Creates a ClientCAStore to hold the CA certificate pool in memory

* A ConfigMap controller watches kube-system/extension-apiserver-authentication

* When the ConfigMap changes, it automatically updates the ClientCAStore

* This allows CA rotation without restarting the operator


3. Metrics Server Enforcement


* The HTTPS metrics server now configures:
TLSConfig.ClientCAs = clientCAStore.GetCA()
TLSConfig.ClientAuth = tls.RequireAndVerifyClientCert
* Requests without valid client certs are rejected with `tls: certificate required`


4. Test Coverage


* Tests added to test unauthenticated rejection, authenticated success, and CA rotation

Updated the description.

Signed-off-by: Ankita Thomas <[email protected]>
@ankitathomas ankitathomas requested a review from anik120 November 11, 2025 21:31
@jianzhangbjz
Copy link
Contributor

/assign @Xia-Zhao-rh

@jianzhangbjz
Copy link
Contributor

/retitle OCPBUGS-59763: enforce client side auth requirement for metrics endpoint

@openshift-ci openshift-ci bot changed the title enforce client side auth requirement for metrics endpoint OCPBUGS-59763: enforce client side auth requirement for metrics endpoint Nov 12, 2025
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 12, 2025
@openshift-ci-robot
Copy link
Contributor

@ankitathomas: This pull request references Jira Issue OCPBUGS-59763, which is invalid:

  • expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Description of the change:
Add client cert authentication for the metrics endpoint, as per https://rhobs-handbook.netlify.app/products/openshiftmonitoring/collecting_metrics.md/#roll-your-own-https-server

This PR secures the prometheus metrics HTTPS endpoint at port 8081 by authenticating the requestor through requiring and verifying client side certificates. Client side certificates also reduce the load on the kubernetes apiserver when compared to using bearer token based auth. The client CA bundle is present at the client-ca-file key of the kube-system/extension-apiserver-authentication ConfigMap on OpenShift clusters. The PR augments the existing server-side TLS verification with the client cert requirement, enforcing the client requests to be made with a certificate signed by the expected client CA.

The metrics server will now include the client CA bundle and require and verify client certs. All requests without valid client certs will be rejected with tls: certificate required

The configmap controller will monitor the kube-system/extension-apiserver-authentication ConfigMap to rotate (hot-reload) the client CAs on the metrics server's CA certPool for any change to the CAs. The change includes the appropriate rolebinding to allow the marketplace-operator to reconcile this configmap.

Tests verify unauthenticated rejection, authenticated success, CA rotation.

Motivation for the change:
Avoid potential information leaks through scraping by unauthenticated users

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Docs updated or added to /docs
  • Commit messages sensible and descriptive

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@anik120
Copy link
Member

anik120 commented Nov 12, 2025

#684 (comment)

Considering this is already present on the service manifests, it seems unnecessary to include in this PR.

That's totally fair.....it's just reaaally cathartic to leave a space cleaner than you found it, wouldn't you agree? 😁 🙏🏽

Copy link
Member

@anik120 anik120 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great so far @ankitathomas, I've left some additional comments

@ankitathomas
Copy link
Contributor Author

#684 (comment)
Considering this is already present on the service manifests, it seems unnecessary to include in this PR.

That's totally fair.....it's just reaaally cathartic to leave a space cleaner than you found it, wouldn't you agree? 😁 🙏🏽

#685

Created an issue so we don't lose track of it

@ankitathomas ankitathomas requested a review from anik120 November 12, 2025 21:53
httpsServer.TLSConfig.ClientCAs = clientCAStore.GetCA()
httpsServer.TLSConfig.ClientAuth = tls.RequireAndVerifyClientCert
} else {
logrus.Warnf("No client CA configured, continuing without client cert verification")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This else block is dead code (but more importantly, if it ever does reach this else block, then everything this PR is adding is bypassed, rendering this PR's code not useful. We don't want this to be nil ever, and if for whatever reason if it ever is, we want to error out.

Probably best to return and error from here instead of logging it and continuing

@ankitathomas
Copy link
Contributor Author

#684 (comment)

The unreachable else block was mostly meant for tests, though considering we don't actually have any tests for the metrics server on its own, its more out of habit than anything specific to the implementation. I can update it to error on a nil CAStore with a comment.

Signed-off-by: Ankita Thomas <[email protected]>
Copy link
Member

@anik120 anik120 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 13, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 13, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: anik120

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 13, 2025
@Xia-Zhao-rh
Copy link

/verified by @Xia-Zhao-rh

@openshift-ci-robot
Copy link
Contributor

@Xia-Zhao-rh: This PR has been marked as verified by @Xia-Zhao-rh.

In response to this:

/verified by @Xia-Zhao-rh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@Xia-Zhao-rh
Copy link

/retest

@anik120
Copy link
Member

anik120 commented Nov 14, 2025

/test e2e-gcp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants