-
Notifications
You must be signed in to change notification settings - Fork 119
OCPBUGS-59763: enforce client side auth requirement for metrics endpoint #684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Ankita Thomas <[email protected]>
4358598 to
7d48ac7
Compare
Signed-off-by: Ankita Thomas <[email protected]>
|
/retest |
| } else { | ||
| logrus.Warnf("No client CA configured, continuing without client cert verification") | ||
| } | ||
| err := httpsServer.ListenAndServeTLS("", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the catalogsource, configmap and operatorhub reconcilers are managed by controller-runtime, the metrics server is a standalone HTTPS server on port 8081 exposed via the marketplace-operator-metrics service. This is the minimum set of changes required to secure this server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the additional context @ankitathomas.
Wdyt about also cleaning up the deployment in this PR to document these ports (and help with service discovery/NetworkPolicy etc)
ports:
- containerPort: 8081
name: https-metrics
- containerPort: 8383
name: metrics
- containerPort: 8080
name: healthz
Right now I see ports that are unused and misleading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| } else { | ||
| logrus.Warnf("No client CA configured, continuing without client cert verification") | ||
| } | ||
| err := httpsServer.ListenAndServeTLS("", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the additional context @ankitathomas.
Wdyt about also cleaning up the deployment in this PR to document these ports (and help with service discovery/NetworkPolicy etc)
ports:
- containerPort: 8081
name: https-metrics
- containerPort: 8383
name: metrics
- containerPort: 8080
name: healthz
Right now I see ports that are unused and misleading
pkg/metrics/metrics.go
Outdated
| httpsServer.TLSConfig.ClientCAs = clientCAStore.GetCA() | ||
| httpsServer.TLSConfig.ClientAuth = tls.RequireAndVerifyClientCert | ||
| } else { | ||
| logrus.Warnf("No client CA configured, continuing without client cert verification") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is never reached right?
In main.go, I see clientCAStore := ca.NewClientCAStore(x509.NewCertPool()), so clientCAStore is always initialized - it's never nil (so this is dead code)
I see a potential bigger problem though. Since the clientCAStore is initialized with empty pool initially (that gets filled in once the ConfigMap controller reconciles successfully the first time), the prometheus scrapes during startup will fail with "unknown certificate authority" until the ConfigMap has had the time to reconcile. (firing prometheus alerts)
Which isn't a big issue during startup, coz eventually things will reconcile and alerts will die down.
Problem is every time pod restarts, the prometheus alert will fire, and will need to be ignored by the admin (so documented by us as "maybe don't worry about it for 5/10 mins and then start worrying" - which isn't great UX to begin with).
Wdyt about synchronous initialization of the cert pool to begin with, and then let the ConfigMap reconciler do its job of keeping it updated?
So something like
// main.go - before starting metrics
clientCAStore := ca.NewClientCAStore(x509.NewCertPool())
// Read the CA ConfigMap synchronously before starting server
clientCAConfigMap := &corev1.ConfigMap{}
if err := mgr.GetClient().Get(ctx, types.NamespacedName{
Name: configmap.ClientCAConfigMap,
Namespace: configmap.ClientCANamespace,
}, clientCAConfigMap); err == nil {
if caPEM, ok := clientCAConfigMap.Data[configmap.ClientCAKey]; ok {
clientCAStore.Update([]byte(caPEM))
}
}
// Now start metrics with populated CA store
if err := metrics.ServePrometheus(tlsCertPath, tlsKeyPath, clientCAStore); err != nil {
logger.Fatalf("failed to serve prometheus metrics: %s", err)
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reconcile should trigger immediately when the controller is first created. It can still emit unknown certificate authority issues when running with a cluster with no secret, or in between the metrics server starting and the controllers starting.
The earliest we can populate the certpool would be immediately after the manager gets created, so this will require moving some initializations around.
Provided we're ok with moving the metrics server start to later in the startup, this should be doable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a reason we can't move things around if we need to. As long as they don't break existing functionalities we're always free to move things around
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This else block is dead code (but more importantly, if it ever does reach this else block, then everything this PR is adding is bypassed, rendering this PR's code not useful. We don't want this to be nil ever, and if for whatever reason if it ever is, we want to error out.
Probably best to return and error from here instead of logging it and continuing
anik120
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also here's a suggestion for the PR description:
This PR adds mutual TLS (mTLS) client certificate authentication to the metrics HTTPS endpoint. Previously, the metrics endpoint at port 8081 only used server-side TLS (encrypted connection), but didn't verify who was connecting. With this PR, clients are required to present a valid certificate signed by a trusted CA.
Key point: The ServiceMonitor was already configured to send client certificates, but the server wasn't enforcing verification. This PR makes the server actually check those certificates.
Solution overview
- Client CA Discovery
- Openshift stores the cluster's client CA bundle in
kube-system/extension-apiserver-authenticationConfigMap - This is the same CA that Prometheus uses to authenticate to other cluster services
- The PR adds a new RoleBinding so
marketplace-operatorcan read this ConfigMap
- Dynamic CA Management
- Creates a ClientCAStore to hold the CA certificate pool in memory
- A ConfigMap controller watches kube-system/extension-apiserver-authentication
- When the ConfigMap changes, it automatically updates the ClientCAStore
- This allows CA rotation without restarting the operator
- Metrics Server Enforcement
- The HTTPS metrics server now configures:
TLSConfig.ClientCAs = clientCAStore.GetCA()
TLSConfig.ClientAuth = tls.RequireAndVerifyClientCert
- Requests without valid client certs are rejected with
tls: certificate required
- Test Coverage
- Tests added to test unauthenticated rejection, authenticated success, and CA rotation
|
Considering this is already present on the service manifests, it seems unnecessary to include in this PR. |
Updated the description. |
Signed-off-by: Ankita Thomas <[email protected]>
|
/assign @Xia-Zhao-rh |
|
/retitle OCPBUGS-59763: enforce client side auth requirement for metrics endpoint |
|
@ankitathomas: This pull request references Jira Issue OCPBUGS-59763, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
That's totally fair.....it's just reaaally cathartic to leave a space cleaner than you found it, wouldn't you agree? 😁 🙏🏽 |
anik120
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great so far @ankitathomas, I've left some additional comments
Created an issue so we don't lose track of it |
pkg/metrics/metrics.go
Outdated
| httpsServer.TLSConfig.ClientCAs = clientCAStore.GetCA() | ||
| httpsServer.TLSConfig.ClientAuth = tls.RequireAndVerifyClientCert | ||
| } else { | ||
| logrus.Warnf("No client CA configured, continuing without client cert verification") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This else block is dead code (but more importantly, if it ever does reach this else block, then everything this PR is adding is bypassed, rendering this PR's code not useful. We don't want this to be nil ever, and if for whatever reason if it ever is, we want to error out.
Probably best to return and error from here instead of logging it and continuing
|
The unreachable else block was mostly meant for tests, though considering we don't actually have any tests for the metrics server on its own, its more out of habit than anything specific to the implementation. I can update it to error on a nil CAStore with a comment. |
Signed-off-by: Ankita Thomas <[email protected]>
anik120
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: anik120 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/verified by @Xia-Zhao-rh |
|
@Xia-Zhao-rh: This PR has been marked as verified by In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest |
|
/test e2e-gcp |
Description of the change:
Add client cert authentication for the metrics endpoint, as per https://rhobs-handbook.netlify.app/products/openshiftmonitoring/collecting_metrics.md/#roll-your-own-https-server
This PR secures the prometheus metrics HTTPS endpoint at port 8081 by authenticating the requestor through requiring and verifying client side certificates. Client side certificates also reduce the load on the kubernetes apiserver when compared to using bearer token based auth. The client CA bundle is present at the client-ca-file key of the kube-system/extension-apiserver-authentication ConfigMap on OpenShift clusters. The PR augments the existing server-side TLS verification with the client cert requirement, enforcing the client requests to be made with a certificate signed by the expected client CA.
The metrics server will now include the client CA bundle and require and verify client certs. All requests without valid client certs will be rejected with
tls: certificate requiredThe configmap controller will monitor the kube-system/extension-apiserver-authentication ConfigMap to rotate (hot-reload) the client CAs on the metrics server's CA certPool for any change to the CAs. The change includes the appropriate rolebinding to allow the marketplace-operator to reconcile this configmap.
Tests verify unauthenticated rejection, authenticated success, CA rotation.
Motivation for the change:
Avoid potential information leaks through scraping by unauthenticated users
Reviewer Checklist
/docs