Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default operator (controller) memory resources are too low #1622

Closed
bukovjanmic opened this issue Jul 29, 2024 · 25 comments · Fixed by #1929
Closed

Default operator (controller) memory resources are too low #1622

bukovjanmic opened this issue Jul 29, 2024 · 25 comments · Fixed by #1929
Labels
triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@bukovjanmic
Copy link

Operator (as distributed via OLM) has default memory limit set to 550MiB and memory request to 20MiB.

Our monitoring shows that the memory usage spikes a bit higher, on our relatively small cluster (4 worker nodes, tens of Grafana CRs) the memory usage oscilates between 540Mi and 683Mi, without any intervention the operator gets continuously OOMKilled.

On our largest cluster, the memory consumption oscilates around 2-2.1 GB of memory.

While we know how to adjust the values for individual operator instances, the defaults should probably be changed to something reasonable, I suggest 600MiB as memory.request and 2.5GiB for memory.limit.

The default for memory request (20 MiB) is way off, it seems, the operator has never such consumption.

@bukovjanmic bukovjanmic added bug Something isn't working needs triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 29, 2024
@theSuess
Copy link
Member

@NissesSenap I know you're running a larger deployment of the operator - can you share your resource consumption to see if we should change the defaults here?

@NissesSenap
Copy link
Collaborator

I'm currently on PTO and don't have access to our metrics, but I can take a look at the end of the week when I'm back home.

@theSuess theSuess added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 5, 2024
@weisdd
Copy link
Collaborator

weisdd commented Aug 7, 2024

@bukovjanmic we agree that the default limits/requests are unlikely to be optimal, and we can certainly adjust the values. At the same time, the production memory consumption that you highlighted seems to be quite high, so we'd like to treat it as an opportunity to assess if 2Gi is reasonable or we can optimize it. @theSuess has added a pprof endpoint as part of the latest release, so the best way forward would be to both get insights on your environment (how many of each CRa you run, their resync period, etc) and pprof traces. The second best option if you can share the above-mentioned insights, so at least we can try to reproduce them. We understand that any of those might not be part of what you're willing to share publicly, so we can get in touch on Slack. Please, let us know how you'd like to continue with this :)

@bukovjanmic
Copy link
Author

Hi, no problem to share stats (green is memory request, yellow memory limit, blue real consumption).

Here is our test cluster consumption over 2 days, (operator version 5.11.0):

image

This cluster-wide instance is reconciling 5 Grafana CR, 48 GrafanaDashboards, 20 GrafanaDatasources, 1 GrafanaFolder

Here is our one of our production clusters over 2 days (operator version 5.9.2):

image

This cluster-wide instance is reconciling 33 Grafana CR, 187 GrafanaDashboards, 240 GrafanaDatasources, 2 GrafanaFolders.

All operators are installed via OLM with defaults (except for resources), the resnyc period is up to operator.

@bukovjanmic
Copy link
Author

Also, I put version 5.0.12 on our test cluster and see the pprof endpoint. Any pointers what to do with this?

@weisdd
Copy link
Collaborator

weisdd commented Aug 9, 2024

@bukovjanmic

  1. Install golang 1.22+ if needed;
  2. Run kubectl port-forward pods/<pod-name> 8888:8888
  3. go tool pprof -pdf http://localhost:8888/debug/pprof/heap > heap.pdf
  4. go tool pprof -pdf http://localhost:8888/debug/pprof/goroutine > goroutine.pdf
  5. Ideally, a graph with memory consumption where we can see both overall and residential usage (RSS)
  6. Also, if you collect metrics from the operator itself, a graph with go_goroutines would be handy
  7. Insights on number and type of CRs that were deployed at the time when you captured the data. Extra information like whether it's in-cluster or external grafana instances, source for dashboards (e.g. whether they're fetched from URLs), etc is appreciated.

Thanks!

@bukovjanmic
Copy link
Author

Here are pprof dumps from our test cluster. This instance of Grafana operator has 5 Grafana CR, 48 GrafanaDashboards, 20 GrafanaDatasources, 1 GrafanaFolder.
grafana-operator-pprof.zip

@weisdd
Copy link
Collaborator

weisdd commented Aug 12, 2024

@bukovjanmic in this profiling set, seems like most of the memory is occupied for caching configmaps / secrets. I would be interesting to see how memory is used in production.

@bukovjanmic
Copy link
Author

Here are pprof files for our production cluster - 33 Grafana CR, 187 GrafanaDashboards, 240 GrafanaDatasources, 2 GrafanaFolders.

grafana-operator-pprof-prod.zip

@theSuess
Copy link
Member

theSuess commented Sep 3, 2024

Looks like the RabbitMQ operator had a similar issue in the past. I'll try to adapt their solution to our operator

rabbitmq/cluster-operator#1549

@weisdd
Copy link
Collaborator

weisdd commented Sep 3, 2024

The good thing is that it's not a memory leak, but just part of the "normal" operations. Would be nice to tune the caching for sure.

Copy link

github-actions bot commented Oct 4, 2024

This issue hasn't been updated for a while, marking as stale, please respond within the next 7 days to remove this label

@github-actions github-actions bot added the stale label Oct 4, 2024
@theSuess theSuess removed the stale label Oct 4, 2024
Copy link

github-actions bot commented Nov 4, 2024

This issue hasn't been updated for a while, marking as stale, please respond within the next 7 days to remove this label

@github-actions github-actions bot added the stale label Nov 4, 2024
@theSuess theSuess added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed stale labels Nov 4, 2024
@theSuess theSuess removed the triage/needs-information Indicates an issue needs more information in order to work on it. label Nov 11, 2024
@Rohlik
Copy link

Rohlik commented Dec 13, 2024

We have EKS cluster with just 35 GrafanaDashboards and 1 local Grafana configured as external within grafana-operator. No other CR are in use. It is worth mentioning that the operator is set to look for CR across all namespaces (116).
The memory consumption, especially at the beginning of the pod lifecycle, is pretty high - 1.68 GiB 🧐. That's quite a lot just for syncing 35 objects 🙄.
Attaching pprof output pprof-prod.zip.

Image

@weisdd
Copy link
Collaborator

weisdd commented Feb 22, 2025

@bukovjanmic @Rohlik In the upcoming release, we'll have support for automatic configuration of GOMEMLIMIT, which makes GC more aggressive when memory usage is nearing configured memory limits. It might help (or at least make things a bit better) until selective caching is implemented (#1818). Meanwhile, could you try to either set GOMEMLIMIT statically or derive it from the example below?

env:
  - name: GOMEMLIMIT
    valueFrom:
      resourceFieldRef:
        resource: limits.memory
        # NOTE: if you have multiple containers, you might need to add "containerName: <name>" as well

@weisdd
Copy link
Collaborator

weisdd commented Mar 24, 2025

@bukovjanmic @Rohlik would be nice if you could experiment with the latest release (https://github.com/grafana/grafana-operator/releases/tag/v5.17.0):

  • without extra settings (but make sure memory limits are set);
  • with ENFORCE_CACHE_LABELS set to safe and to all (you can find more details in the release description).

@Rohlik
Copy link

Rohlik commented Mar 25, 2025

Hi @weisdd, thanks for letting me know. I will test it, but just to be sure if I'm getting it right.
In the case of ENFORCE_CACHE_LABELS set to all, does the label app.kubernetes.io/managed-by: grafana-operator need to be added manually also to Secret, which is referenced within CR Grafana for an external Grafana 🤔?
That label is not required for CR GrafanaDashboard, right?

@Rohlik
Copy link

Rohlik commented Mar 25, 2025

I successfully tested the ENFORCE_CACHE_LABELS set to safe with a positive result; see the figure below from one of our DEV env.
In production, I believe savings will be much more significant.

Image

@weisdd
Copy link
Collaborator

weisdd commented Mar 25, 2025

@Rohlik in case of all, you'll need to add the label only to ConfigMaps and Secrets. We don't tune cache for operator's CRDs.

@bukovjanmic
Copy link
Author

I tested the "safe" option and memory consumption went from appx. 0.7GB to 0.066GB (a tenth of previous value). So far it seems everything is working correctly.

With the "all" option I am not sure - does it require to add the label to all ConfigMaps and Secrets, that are referenced by any GrafanaDatasource or GrafanaDashboard? What happens if the label is not there - the data source or dashboard will fail not be created or just not be cached?

@weisdd
Copy link
Collaborator

weisdd commented Mar 27, 2025

@bukovjanmic that's great, thanks for the update. I think we'll use "safe" by default in the next release.

As for your question, non-labeled ConfigMaps and Secrets will not be visible to the operator. It can get a bit cumbersome depending on which resources you have, but it is what it is.
controller-runtime does not cover cache miss cases out of the box, we might implement that later. Another option would be not to cache those resources at all, but that would mean higher k8s API usage, which is unlikely to scale well.

@bukovjanmic
Copy link
Author

OK, maybe we found a bug. We have set up the "safe" option.

We fiddle with plugins and added a plugin definition to an existing data source CR.

It seems that the operator will not reconcile the GF_INSTALL_PLUGINS env variable in Deployment, but this may be another bug.

To move forward, we deleted the Deployment object (owned by Grafana CR), expecting the operator to recreate the Deployment. This did not happen though, and I think this worked with previous versions, possibly because of caching.

After restarting the operator, the Deployment object was created and the GF_INSTALL_PLUGINS for reconciled.

@weisdd
Copy link
Collaborator

weisdd commented Mar 27, 2025

@bukovjanmic I think it's rather a bug in reconciler(s), could you open an issue for that?

@bukovjanmic
Copy link
Author

You are right, I see the same behavior on another cluster with 5.16 installed. I will open another issue.

@Rohlik
Copy link

Rohlik commented Mar 31, 2025

Just so you know, here's the result from our production. The drop is significant.

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants