Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Add check for watch_namespace before mutating Pod #2666

Closed

Conversation

janario
Copy link
Contributor

@janario janario commented Feb 26, 2024

Description:

We have a scenario where in some environment (e.g. dev) the same helm chart is deployed across many different personal namespaces. While we keep the exact same helm chart config all around we were using the WATCH_NAMESPACES env var to restrict the operator to only a few namespaces that were relevant.

However, we noticed that the operator was still trying to get other namespaces and failing with them because it was not in the list of watched (cache key). This was leaving the pod with instrumentation that seemed incomplete.

Stack message:

{"level":"error","ts":"2024-02-23T07:19:40Z","msg":"failed to get replicaset","replicaset":"...","namespace":"..","error":"unable to get:.../... because of unknown namespace for the cache","stacktrace":"github.com/open-telemetry/opentelemetry-operator/pkg/instrumentation.(*sdkInjector).addParentResourceLabels
pkg/instrumentation/sdk.go:481
github.com/open-telemetry/opentelemetry-operator/pkg/instrumentation.(*sdkInjector).createResourceMap
pkg/instrumentation/sdk.go:448
github.com/open-telemetry/opentelemetry-operator/pkg/instrumentation.(*sdkInjector).injectCommonSDKConfig
pkg/instrumentation/sdk.go:255
github.com/open-telemetry/opentelemetry-operator/pkg/instrumentation.(*sdkInjector).inject
pkg/instrumentation/sdk.go:74
github.com/open-telemetry/opentelemetry-operator/pkg/instrumentation.(*instPodMutator).Mutate
pkg/instrumentation/podmutator.go:360
github.com/open-telemetry/opentelemetry-operator/internal/webhook/podmutation.(*podMutationWebhook).Handle
internal/webhook/podmutation/webhookhandler.go:92
sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*Webhook).Handle
/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/webhook/admission/webhook.go:169
sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*Webhook).ServeHTTP
/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/webhook/admission/http.go:119
sigs.k8s.io/controller-runtime/pkg/webhook/internal/metrics.InstrumentedHook.InstrumentHandlerInFlight.func1
/home/runner/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/promhttp/instrument_server.go:60
net/http.HandlerFunc.ServeHTTP
/opt/hostedtoolcache/go/1.21.6/x64/src/net/http/server.go:2136
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerCounter.func1
/home/runner/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/promhttp/instrument_server.go:147
net/http.HandlerFunc.ServeHTTP
/opt/hostedtoolcache/go/1.21.6/x64/src/net/http/server.go:2136
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerDuration.func2
/home/runner/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/promhttp/instrument_server.go:109
net/http.HandlerFunc.ServeHTTP
/opt/hostedtoolcache/go/1.21.6/x64/src/net/http/server.go:2136
net/http.(*ServeMux).ServeHTTP
/opt/hostedtoolcache/go/1.21.6/x64/src/net/http/server.go:2514
net/http.serverHandler.ServeHTTP
/opt/hostedtoolcache/go/1.21.6/x64/src/net/http/server.go:2938
net/http.(*conn).serve
/opt/hostedtoolcache/go/1.21.6/x64/src/net/http/server.go:2009"}

In this PR we will start to check the watch_namespaces before trying to mutate the pods.

Link to tracking Issue(s):

Testing:

Documentation:

@janario janario requested a review from a team February 26, 2024 08:50
@janario
Copy link
Contributor Author

janario commented Feb 26, 2024

btw I'm not an specialist in golang, so any suggestion/comments are super welcome ;-)

@yuriolisa
Copy link
Contributor

@janario, thank you for raising this topic, but could you please open an issue with the manifest you used to deploy the operator? So, we could do a triage and then proceed with the PR review.

@janario
Copy link
Contributor Author

janario commented Feb 26, 2024

@janario, thank you for raising this topic, but could you please open an issue with the manifest you used to deploy the operator? So, we could do a triage and then proceed with the PR review.

Perfect, issue created #2668

@yuriolisa
Copy link
Contributor

Please check the failing CI jobs.

@janario janario changed the title Add check for watch_namespace before mutating Pod Draft: Add check for watch_namespace before mutating Pod Mar 6, 2024
@janario
Copy link
Contributor Author

janario commented Mar 6, 2024

Marking this as a draft, I plan to get back on this and a few other issues soon after some internal migration.

@janario janario force-pushed the feature/watch-namespace-mutator branch from 255ea3d to 440ca38 Compare March 26, 2024 06:33
@janario janario force-pushed the feature/watch-namespace-mutator branch from 440ca38 to 87dba94 Compare April 4, 2024 20:47
@janario
Copy link
Contributor Author

janario commented Apr 5, 2024

Added an e2e case, not perfect but it can exemplify better the issue we have.

Scenario:


Case 1: Deploy in the watch-ns which is listened by the operator
It will mutate the pod and changed it ✅


Case 2: Deploy in the not-watch-ns which was not listened by the operator
It will try to mutate the pod but end up with something "incomplete" 🙅

When I try to filter the mutators list it doesn't try to change it.

Not sure if the best approach since it was already expected to be filtered out.

Test log:

=== NAME  chainsaw/not-watch-ns
    | 12:44:38 | not-watch-ns | step-01  | ASSERT    | ERROR | v1/Pod @ not-watch-ns/*
        === ERROR
        --------------------------------------------
        v1/Pod/not-watch-ns/my-deploy-ccf8d9b9-c6qt2
        --------------------------------------------
        * spec.(initContainers == null): Invalid value: false: Expected value: true
        * spec.containers[0].(env == null): Invalid value: false: Expected value: true

        --- expected
        +++ actual
        @@ -5,11 +5,56 @@
             instrumentation.opentelemetry.io/inject-java: opentelemetry-operator-system/deployment
           labels:
             app: my-deploy
        +  name: my-deploy-ccf8d9b9-c6qt2
           namespace: not-watch-ns
        +  ownerReferences:
        +  - apiVersion: apps/v1
        +    blockOwnerDeletion: true
        +    controller: true
        +    kind: ReplicaSet
        +    name: my-deploy-ccf8d9b9
        +    uid: d2b331c0-a830-4095-a8e9-5ac9f260ec76
         spec:
        -  (initContainers == null): true
        -  (length(containers)): 1
           containers:
        -  - (env == null): true
        +  - env:
        +    - name: OTEL_NODE_IP
        +      valueFrom:
        +        fieldRef:
        +          apiVersion: v1
        +          fieldPath: status.hostIP
        +    - name: OTEL_POD_IP
        +      valueFrom:
        +        fieldRef:
        +          apiVersion: v1
        +          fieldPath: status.podIP
        +    - name: JAVA_TOOL_OPTIONS
        +      value: ' -javaagent:/otel-auto-instrumentation-java/javaagent.jar'
        +    - name: OTEL_SERVICE_NAME
        +      value: my-deploy-ccf8d9b9
        +    - name: OTEL_EXPORTER_OTLP_ENDPOINT
        +      value: http://deployment-collector.opentelemetry-operator-system:4317
        +    - name: OTEL_RESOURCE_ATTRIBUTES_POD_NAME
        +      valueFrom:
        +        fieldRef:
        +          apiVersion: v1
        +          fieldPath: metadata.name
        +    - name: OTEL_RESOURCE_ATTRIBUTES_NODE_NAME
        +      valueFrom:
        +        fieldRef:
        +          apiVersion: v1
        +          fieldPath: spec.nodeName
        +    - name: OTEL_RESOURCE_ATTRIBUTES
        +      value: k8s.container.name=myapp,k8s.namespace.name=not-watch-ns,k8s.node.name=$(OTEL_RESOURCE_ATTRIBUTES_NODE_NAME),k8s.pod.name=$(OTEL_RESOURCE_ATTRIBUTES_POD_NAME),k8s.replicaset.name=my-deploy-ccf8d9b9,service.version=main
        +    image: ghcr.io/open-telemetry/opentelemetry-operator/e2e-test-app-java:main
        +    imagePullPolicy: IfNotPresent
             name: myapp
        +    resources: {}
        +    terminationMessagePath: /dev/termination-log
        +    terminationMessagePolicy: File
        +    volumeMounts:
        +    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        +      name: kube-api-access-hwzvw
        +      readOnly: true
        +    - mountPath: /otel-auto-instrumentation-java
        +      name: opentelemetry-auto-instrumentation-java

Operator error log:

{"level":"error","ts":"2024-04-05T10:43:38.448544859Z","msg":"failed to get replicaset","replicaset":"my-deploy-ccf8d9b9","namespace":"not-watch-ns","error":"unable to get: not-watch-ns/my-deploy-ccf8d9b9 because of unknown namespace for the cache","stacktrace":"github.com/open-telemetry/opentelemetry-operator/pkg/instrumentation.(*sdkInjector).addParentResourceLabels
	/Users/joliveira/Projects/open/github/open-telemetry/opentelemetry-operator/pkg/instrumentation/sdk.go:510
github.com/open-telemetry/opentelemetry-operator/pkg/instrumentation.(*sdkInjector).createResourceMap
	/Users/joliveira/Projects/open/github/open-telemetry/opentelemetry-operator/pkg/instrumentation/sdk.go:477
github.com/open-telemetry/opentelemetry-operator/pkg/instrumentation.(*sdkInjector).injectCommonSDKConfig
	/Users/joliveira/Projects/open/github/open-telemetry/opentelemetry-operator/pkg/instrumentation/sdk.go:281
github.com/open-telemetry/opentelemetry-operator/pkg/instrumentation.(*sdkInjector).inject
	/Users/joliveira/Projects/open/github/open-telemetry/opentelemetry-operator/pkg/instrumentation/sdk.go:75
github.com/open-telemetry/opentelemetry-operator/pkg/instrumentation.(*instPodMutator).Mutate
	/Users/joliveira/Projects/open/github/open-telemetry/opentelemetry-operator/pkg/instrumentation/podmutator.go:363
github.com/open-telemetry/opentelemetry-operator/internal/webhook/podmutation.(*podMutationWebhook).Handle
	/Users/joliveira/Projects/open/github/open-telemetry/opentelemetry-operator/internal/webhook/podmutation/webhookhandler.go:96
sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*Webhook).Handle
	/Users/joliveira/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/webhook/admission/webhook.go:169
sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*Webhook).ServeHTTP
	/Users/joliveira/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/webhook/admission/http.go:119
sigs.k8s.io/controller-runtime/pkg/webhook/internal/metrics.InstrumentedHook.InstrumentHandlerInFlight.func1
	/Users/joliveira/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/promhttp/instrument_server.go:60
net/http.HandlerFunc.ServeHTTP
	/usr/local/go/src/net/http/server.go:2136
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerCounter.func1
	/Users/joliveira/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/promhttp/instrument_server.go:147
net/http.HandlerFunc.ServeHTTP
	/usr/local/go/src/net/http/server.go:2136
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerDuration.func2
	/Users/joliveira/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/promhttp/instrument_server.go:109
net/http.HandlerFunc.ServeHTTP
	/usr/local/go/src/net/http/server.go:2136
net/http.(*ServeMux).ServeHTTP
	/usr/local/go/src/net/http/server.go:2514
net/http.serverHandler.ServeHTTP
	/usr/local/go/src/net/http/server.go:2938
net/http.(*conn).serve
	/usr/local/go/src/net/http/server.go:2009"}

Executed with:

make prepare-e2e-watch-namespace
make e2e-watch-namespace

janario added 3 commits April 5, 2024 13:52
Signed-off-by: Janario Oliveira <[email protected]>
Signed-off-by: Janario Oliveira <[email protected]>
@janario janario force-pushed the feature/watch-namespace-mutator branch from e475580 to 8a1cca1 Compare April 5, 2024 11:54
@janario
Copy link
Contributor Author

janario commented Apr 6, 2024

I got more on this.

I've added some log lines to the other CRDs like OpenTelemetryCollector.

  • Webhook is always triggered for all independent of the watch ns list
  • controller(opentelemetrycollector_controller.go) not since ns was not listed

So I tried to understand how other operators do such filter.

Turns out, the ones I've checked, they use the namespaceSelector at MutatingWebhookConfiguration.

https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#matching-requests-namespaceselector

So when I try:

  namespaceSelector:
    matchLabels:
      otel-injection: enabled

It gets as I was expecting for only namespaces with the label.


I wasn't that familiar with the MutatingWebhookConfiguration and how to limit its webhook scope.

At the helm chart I can achieve that, so I'm fine with it/
But it raises some questions,

Do we need WATCH_NAMESPACE env var?

Shouldn't we point to customize and use the namespaceSelector instead? 🤔

@swiatekm
Copy link
Contributor

swiatekm commented Apr 8, 2024

WATCH_NAMESPACE isn't about the webhook (which, as you observed, is really an API Server mechanism and configured via the appropriate resource), but about reconciliation. Sometimes, you want the operator to only reconcile resources in some subset of namespaces, and that's what the variable is for.

@janario
Copy link
Contributor Author

janario commented Apr 11, 2024

WATCH_NAMESPACE isn't about the webhook (which, as you observed, is really an API Server mechanism and configured via the appropriate resource), but about reconciliation. Sometimes, you want the operator to only reconcile resources in some subset of namespaces, and that's what the variable is for.

I got it.

I just think they are two things that ideally should be configured together, but I see it is not strongly attached.

So thinking about what drove me here, I'm thinking of improving this log line
https://github.com/open-telemetry/opentelemetry-operator/blob/main/main.go#L248

so whoever reads it has a better idea on how to limit the operator to specific namespaces.

wdyt? I can add more details in the log line in another PR

@swiatekm
Copy link
Contributor

@janario that sounds good to me.

@@ -24,6 +24,8 @@ import (
"strings"
"time"

metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be merged with the group below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2666 (comment)

I'll work in another PR with log improvements since it is kind of expected behavior

@janario janario closed this Apr 11, 2024
@janario janario deleted the feature/watch-namespace-mutator branch May 9, 2024 09:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Watch namespace is still trying to change Pod from other namespaces
4 participants