Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingester zonal disruptions #9908

Open
nullren opened this issue Nov 14, 2024 · 2 comments
Open

Ingester zonal disruptions #9908

nullren opened this issue Nov 14, 2024 · 2 comments

Comments

@nullren
Copy link

nullren commented Nov 14, 2024

Is your feature request related to a problem? Please describe.

When deploying Mimir to K8s, some Pod Disruption Budgets (PDBs) are created for some pod types (distributors, ingesters, etc), however, they tend to be too restrictive—I think something like allowing only 1 disruption.

Anyway, because metrics are replicated across zones, there isn't a clear way to define a PDB that allows for more disruptions safely.

Describe the solution you'd like

It would be nice if there was some way to have a "high level PDB" where zones can be disrupted. A "zone" would be "healthy" or "up" if all pods in that zone are healthy/up. So, a disrupted zone would be one where at least 1 pod is unhealthy.

So, what that might enable is something like having a "ZDB" where you have rule for a majority of zones to be available/undisrupted. This would allow you to disrupt a single zone (eg, all pods in that zone). This would speed up draining k8s nodes since you can safely disrupt 1/3 total pods which is really important/helpful when running many pods.

This might be accomplished via some sort of controller/operator.

For example, we have a cluster with 420 ingester pods—having the PDB where only 1 pod means at a maximum, we can only drain 1 k8s node at a time when this could be done much more quickly (and safely).

Describe alternatives you've considered

This might be something we'll have to create ourselves because (ironically) it's very disruptive.

@nullren
Copy link
Author

nullren commented Nov 14, 2024

conceptually this could definitely be something that exists in kubernetes directly because the pattern of "allowing zonal disruptions" is not unique to mimir. eg, an elasticsearch cluster that has documents replicated across "zones" would benefit from this same controller...

@nullren
Copy link
Author

nullren commented Nov 14, 2024

perhaps this is something the https://github.com/grafana/rollout-operator could manage?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant