feat(controllers): optionally do not cache resources created without CommonLabels #1818

Baarsgaard · 2025-01-11T17:59:35Z

I read a blog post on operator memory pitfalls mentioning Owns() being a footgun, which is used in the grafana_reconciler SetupWithManager.

TLDR: By declaring Owns() or using Get/List you tell the the controller-runtime to watch and cache all instances of the client.Object, which on large clusters could result in a lot of ConfigMaps, Secrets and Deployments in the Grafana-Operators case.

I expected this to be a problem due to the pprof profiles uploaded in #1622 which was verified by following the steps outlined below.

The post linked to an Operator SDK trick for configuring the client.Object cache with labels.

mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
  Cache: cache.Options{
    ByObject: map[client.Object]cache.ByObject{
      &corev1.Secret{}: cache.ByObject{
	Label: labels.SelectorFromSet(labels.Set{"app": "app-name"}),
      },
    },
  },
})

I remembered that #1661 added common labels to resources created by the operator to reduce memory consumption.

Verifying cache issues:

Start a local kind cluster with some default resources (oneliner)

make start-kind && \
kind export kubeconfig --name kind-grafana && \
make ko-build-kind && \
IMG=ko.local/grafana/grafana-operator make deploy && \
kubectl patch deploy -n grafana-operator-system grafana-operator-controller-manager-v5  --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/imagePullPolicy", "value":"IfNotPresent"}]'

Get a baseline heap reading

kubectl port-forward -n grafana-operator-system deploy/grafana-operator-controller-manager-v5 8888 &
go tool pprof -top -nodecount 20 http://localhost:8888/debug/pprof/heap

Create empty test file: fallocate -l 393216 large_file
Create a couple hundred ConfigMaps

for i in {0..200}; do kubectl create cm test-cm-$i --from-file=./large_file ; done

Get Updated heap

go tool pprof -top -nodecount 20 http://localhost:8888/debug/pprof/heap

# Output on master branch
ile: v5
Type: inuse_space
Time: Jan 11, 2025 at 8:34pm (CET)
Showing nodes accounting for 54.72MB, 100% of 54.72MB total
Showing top 20 nodes out of 107
      flat  flat%   sum%        cum   cum%
   46.91MB 85.72% 85.72%    46.91MB 85.72%  k8s.io/api/core/v1.(*ConfigMap).Unmarshal # <--- this one
       2MB  3.66% 89.38%        2MB  3.66%  runtime.malg
       1MB  1.83% 91.20%        1MB  1.83%  encoding/json.typeFields
    0.75MB  1.37% 92.58%     0.75MB  1.37%  go.uber.org/zap/zapcore.newCounters
    0.54MB  0.99% 93.56%     0.54MB  0.99%  github.com/gogo/protobuf/proto.RegisterType
    0.52MB  0.94% 94.51%     0.52MB  0.94%  k8s.io/apimachinery/pkg/watch.(*Broadcaster).Watch.func1
    0.50MB  0.92% 95.43%     0.50MB  0.92%  unicode.map.init.1
    0.50MB  0.92% 96.34%     0.50MB  0.92%  k8s.io/apimachinery/pkg/runtime.(*Scheme).AddKnownTypeWithName
    0.50MB  0.91% 97.26%     0.50MB  0.91%  github.com/go-openapi/swag.(*indexOfInitialisms).sorted.func1
    0.50MB  0.91% 98.17%     0.50MB  0.91%  go.mongodb.org/mongo-driver/bson/bsoncodec.(*kindDecoderCache).Clone
....

Current progress

Watching and caching has been limited to resources controlled by the operator of Kind:

Deployment
Ingress
Service
ServiceAccount
PersistentVolumeClaim
Route if IsOpenShift

This is done with the existing CommonLabels selector introduced in #1661:
app.kubernetes.io/managed-by: "grafana-operator"

Memory consumption in an empty kind cluster after ~1 minute¹:

Change	Heap (kb)	% reduction²	Note
None (master)	8579.22	0%
Limited resource listed above	5743.40	-33%
Disabling `ConfigMaps` and `Secrets` Cache	3585.48	-58%	New default

An option to cache ConfigMaps and Secrets has been added

Heap will increase over time as the operator stabilizes. ↩
The reduction is by no means representative of real deployments.
For clusters mixing the Grafana-Operator and other workloads in cluster scoped mode, the reduction is likely significantly higher.
Even if the Grafana-Operator was the only Deployment in a cluster, this should still reduce memory as it won't cache itself 😉 ↩

Baarsgaard · 2025-01-24T12:47:29Z

I marked this ready, but I forgot it is blocked by #1833

Baarsgaard · 2025-02-22T22:26:40Z

Rebased on master and worked it into the additions from #1832.
Should be ready review now!

Baarsgaard · 2025-02-24T22:04:42Z

I've added an experimental feature toggle in the form of the EXPERIMENTAL_ENABLE_CACHE_LABEL_LIMITS environment variable.

This ended up bit complex to validate, but I think I managed to cover most of it.

Ensure that the PR works and can be enabled

Get memory baseline

kubectl port-forward -n grafana-operator-system deploy/grafana-operator-controller-manager-v5 8888 &
go tool pprof -top -nodecount 20 http://localhost:8888/debug/pprof/heap

Provoke memory climb

fallocate -l 393216 /tmp/large_file # Create a large file for testing
for i in {0..200}; do kubectl create cm test-cm-$i -n test --from-file=/tmp/large_file; done
go tool pprof -top -nodecount 20 http://localhost:8888/debug/pprof/heap

Enable cache limits and see memory decrease

kubectl patch deploy -n grafana-operator-system grafana-operator-controller-manager-v5  --type='json' -p='[{"op": "add", "path": "/spec/template/spec/containers/0/env/-", "value": {"name": "EXPERIMENTAL_ENABLE_CACHE_LABEL_LIMITS", "value": "1"}}]'
go tool pprof -top -nodecount 20 http://localhost:8888/debug/pprof/heap

Continue with testing it does not break the watch label selector(sharding)

Enable sharding, memory should reset as ConfigMaps are not labeled

kubectl patch deploy -n grafana-operator-system grafana-operator-controller-manager-v5  --type='json' -p='[{"op": "add", "path": "/spec/template/spec/containers/0/env/-", "value": {"name": "WATCH_LABEL_SELECTORS", "value": "manual=test"}}]'
go tool pprof -top -nodecount 20 http://localhost:8888/debug/pprof/heap

Label ConfigMaps and see memory jump

for i in {0..200}; do kubectl label cm -n test test-cm-$i manual=test &; done
go tool pprof -top -nodecount 20 http://localhost:8888/debug/pprof/heap

Create Sharded and unsharded grafana instances

---
apiVersion: grafana.integreatly.org/v1beta1
kind: Grafana
metadata:
  name: grafana-normal
spec:
  config:
    log:
      mode: "console"
    auth:
      disable_login_form: "false"
    security:
      admin_user: root
      admin_password: secret
---
apiVersion: grafana.integreatly.org/v1beta1
kind: Grafana
metadata:
  name: grafana-shard
  labels:
    manual: test
spec:
  config:
    log:
      mode: "console"
    auth:
      disable_login_form: "false"
    security:
      admin_user: root
      admin_password: secret

Verify only one instance has a status is reconciled
```
kubectl get grafanas -A -o yaml
```

Manually remove the only the EXPERIMENTAL_ENABLE_CACHE_LABEL_LIMITS env var on the deploy and see memory increase while it's sharded

kubectl edit deploy -n grafana-operator-system grafana-operator-controller-manager-v5
go tool pprof -top -nodecount 20 http://localhost:8888/debug/pprof/heap

Every edit of the deployment will require stopping and re-opening the port-forward before the pprof, left out for brevity

main.go

weisdd · 2025-03-09T11:51:32Z

main.go

-		setupLog.Error(err, fmt.Sprintf("unable to parse %s", watchLabelSelectorsEnvVar))
-		os.Exit(1) //nolint
+	// Allow users to enable the above cache limits before a full rollout
+	if enableCacheLabelLimits == "" {


Name and idea behind this variable hints that it must be a boolean-like value, thus we should not do empty string comparisons.

The code here states that it would enable caching limits whereas it actually lifts those limits meaning everything will be cached.

I think a simpler way to implement all of this would be to do something like this:

cacheOptions := cache.Options{ ByObject: map[client.Object]cache.ByObject{ &v1.Deployment{}: cacheByObject, &corev1.Service{}: cacheByObject, &corev1.ServiceAccount{}: cacheByObject, &networkingv1.Ingress{}: cacheByObject, &corev1.PersistentVolumeClaim{}: cacheByObject, &corev1.ConfigMap{}: cacheByObject, &corev1.Secret{}: cacheByObject, }, } // TODO: Curious what would happen in vanilla k8s if we don't have this check for OpenShift if isOpenShift { cacheOptions.ByObject[&routev1.Route{}] = cacheByObject } // I like this name more, it's self-explanatory. By the way, I don't think we have to prefix environment variables with EXPERIMENTAL_, we just need to clarify that in docs / helm / code comments if cacheOnlyLabeledResources == "true" { controllerOptions.Cache = cacheOptions }

One more point would be:
Should secret / configMap caching be enabled, do we want it to be only for labeled resources? Should we have a separate configuration option that tweaks that behaviour or it's better to not use cacheByObject for them at all? Or should we even go as far as prometheus-operator which rather filters out by secret type. I'm not sure what's the best way to go, just highlighting a concern.

main.go

weisdd · 2025-03-09T13:31:25Z

@Baarsgaard I've just modified some of the comments that were added a few minutes ago, so, please, refer to the latest versions. Thx!

feat: Toggle caching of ConfigMaps and Secrets with CommonLabels

… setups

…ibility

…on in Helm values

theSuess · 2025-03-11T11:40:53Z

Refactored this to use an env var with different levels. Also moved the code around a bit to make the opt-in nature more apparent and easier to review.

@weisdd @Baarsgaard if you have a minute, I'd appreciate a re-review as I'm now obviously biased that this is good to merge 😅

weisdd

Naming and code structure are clear now, everything looks good to me :)

Baarsgaard changed the title ~~feat(internal): Ignore deployments/Configmaps missing CommonLabels~~ WIP: Ignore deployments/Configmaps missing CommonLabels Jan 11, 2025

Baarsgaard force-pushed the reduce_cache_size branch 2 times, most recently from 389e8d6 to e4ed220 Compare January 11, 2025 23:56

Baarsgaard changed the title ~~WIP: Ignore deployments/Configmaps missing CommonLabels~~ Fix: Do not cache native resources created without CommonLabels Jan 12, 2025

Baarsgaard force-pushed the reduce_cache_size branch from e4ed220 to 4e03b7a Compare January 12, 2025 11:17

Baarsgaard mentioned this pull request Jan 16, 2025

Fix: Fetch ConfigMaps and Secrets once per GrafanaContactPoint reconciliation #1828

Merged

Baarsgaard force-pushed the reduce_cache_size branch 2 times, most recently from 4848451 to 9a9bb49 Compare January 20, 2025 16:47

Baarsgaard mentioned this pull request Jan 21, 2025

Refactor: Move scope helper functions out of main #1833

Merged

Baarsgaard force-pushed the reduce_cache_size branch 2 times, most recently from d1f1f0b to 78841fb Compare January 21, 2025 19:14

Baarsgaard mentioned this pull request Jan 21, 2025

feat: Allow to restrict the CRs watched according to their labels #1832

Merged

Baarsgaard marked this pull request as ready for review January 24, 2025 12:46

Baarsgaard requested review from NissesSenap, weisdd, ishanjainn, theSuess, hubeadmin and pb82 as code owners January 24, 2025 12:46

Baarsgaard force-pushed the reduce_cache_size branch 2 times, most recently from 81e6afc to fc47228 Compare January 29, 2025 16:39

theSuess added this to the v5.17.0 milestone Feb 4, 2025

Baarsgaard force-pushed the reduce_cache_size branch from fc47228 to a55e286 Compare February 12, 2025 19:12

weisdd mentioned this pull request Feb 22, 2025

Default operator (controller) memory resources are too low #1622

Closed

Baarsgaard force-pushed the reduce_cache_size branch 2 times, most recently from 903567b to 43b2a3e Compare February 22, 2025 21:39

Baarsgaard force-pushed the reduce_cache_size branch from 20389e7 to e945cc3 Compare February 24, 2025 20:15

Baarsgaard force-pushed the reduce_cache_size branch from a9139e2 to 10d3e30 Compare March 6, 2025 18:31

Baarsgaard requested a review from weisdd March 6, 2025 18:40

weisdd requested changes Mar 9, 2025

View reviewed changes

sachinjkattoor mentioned this pull request Mar 11, 2025

feat(GrafanaAlertRuleGroup): add support for Grafana-managed recording rules #1881

Merged

theSuess force-pushed the reduce_cache_size branch 4 times, most recently from 17c7180 to 2df48b8 Compare March 11, 2025 11:36

Baarsgaard added 9 commits March 11, 2025 12:37

fix: Limit cache for k8s native resources

2605945

feat: Disable caching of ConfigMaps and Secrets

d2ea25e

feat: Toggle caching of ConfigMaps and Secrets with CommonLabels

chore: Move sharding log out of main (cyclomatic lint)

153d767

chore: Update commments

46c0099

Refactor: Re-use new util func to create cacheLabelConfig for default…

1dabe30

… setups

fix: Avoid labels.Everything returned from getLabelSelectors

5b78a62

feat: Add a way to disable the cache limitations for backwards compat…

2b7d18a

…ibility

chore: Make cache improvements opt-in

f522421

chore: Update env var name and "watchLabeledReferencesOnly" descripti…

b303750

…on in Helm values

theSuess force-pushed the reduce_cache_size branch from 2df48b8 to 330af1f Compare March 11, 2025 11:39

theSuess requested a review from weisdd March 11, 2025 11:40

refactor: switch to ENFORCE_CACHE_LABELS env var

06da208

theSuess force-pushed the reduce_cache_size branch from 330af1f to 06da208 Compare March 11, 2025 11:41

weisdd approved these changes Mar 11, 2025

View reviewed changes

theSuess added this pull request to the merge queue Mar 11, 2025

Merged via the queue into grafana:master with commit 06be4b3 Mar 11, 2025
15 checks passed

weisdd changed the title ~~Fix: Do not cache native resources created without CommonLabels~~ feat(controllers): optionally do not cache resources created without CommonLabels Mar 11, 2025

weisdd added the feature this PR introduces a new feature label Mar 11, 2025

Baarsgaard deleted the reduce_cache_size branch March 11, 2025 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(controllers): optionally do not cache resources created without CommonLabels #1818

feat(controllers): optionally do not cache resources created without CommonLabels #1818

Uh oh!

Baarsgaard commented Jan 11, 2025 •

edited

Loading

Uh oh!

Baarsgaard commented Jan 24, 2025

Uh oh!

Baarsgaard commented Feb 22, 2025 •

edited

Loading

Uh oh!

Baarsgaard commented Feb 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

weisdd Mar 9, 2025 •

edited

Loading

Uh oh!

weisdd Mar 9, 2025

Uh oh!

Uh oh!

weisdd commented Mar 9, 2025

Uh oh!

theSuess commented Mar 11, 2025

Uh oh!

weisdd left a comment

Uh oh!

Uh oh!

Uh oh!

feat(controllers): optionally do not cache resources created without CommonLabels #1818

feat(controllers): optionally do not cache resources created without CommonLabels #1818

Uh oh!

Conversation

Baarsgaard commented Jan 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Verifying cache issues:

Current progress

Footnotes

Uh oh!

Baarsgaard commented Jan 24, 2025

Uh oh!

Baarsgaard commented Feb 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Baarsgaard commented Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Ensure that the PR works and can be enabled

Continue with testing it does not break the watch label selector(sharding)

Uh oh!

Uh oh!

Uh oh!

weisdd Mar 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

weisdd Mar 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

weisdd commented Mar 9, 2025

Uh oh!

theSuess commented Mar 11, 2025

Uh oh!

weisdd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Baarsgaard commented Jan 11, 2025 •

edited

Loading

Baarsgaard commented Feb 22, 2025 •

edited

Loading

Baarsgaard commented Feb 24, 2025 •

edited

Loading

weisdd Mar 9, 2025 •

edited

Loading