feat: Make readiness check of pods respect snapshot loading #401

NegatioN · 2025-10-20T10:21:29Z

Added a modified healthcheck.sh as readiness.sh that the operator can use to determine if snapshotting has completed.
Confirmed that helm package $repo -> helm install dragonfly-operator -> kubectl apply -f config/samples/v1alpha1_dragonfly.yaml (With updated docker image) leads us to a state where a ConfigMap hosts our readiness.sh script, accessible in the pod.

Not done: Bumping charts and preparing for release of this repo. Maybe it's better if a core dev does that (?) 🤷

Related: dragonflydb/dragonfly#5921 and #397

Did not run the precommit-hook for this, because I would have to reclone the repo and use Git. Will do it when/if PR gets approved.

NegatioN · 2025-10-20T10:28:25Z

@romange Tagging you here just in case this slips under the radar 😄 Not sure who else should be the contact point for this.
Was this something like what you had in mind?

romange · 2025-10-20T10:34:24Z

@NegatioN does this issue affect your deployment as well?

It is possible to inject readiness screipt by mounting an external read-only volume that contains only that script. I think that extending helm charts with mounting that script instead of changing the containter itself is a viable option.

NegatioN · 2025-10-20T10:58:23Z

@romange I'm not sure I understood everything you said, so feel free to point out any misunderstandings.

does this issue affect your deployment as well?

It doesn't affect the dragonfly-operator or any other deployments we have. But it affects the pods / statefulset of Dragonfly managed by the operator. In that they're reporting being healthy and ready for traffic, when they in reality are loading the snapshot, leading the operator to funnel them traffic before they're ready.

It is possible to inject readiness screipt by mounting an external read-only volume that contains only that script. I think that extending helm charts with mounting that script instead of changing the containter itself is a viable option.

I was under the impression that to interact with the actual pods of Dragonfly that the operator manages, that I would have to manage the Go-code here. Are you saying there's another yaml definition that's better for me to modify?

romange · 2025-10-20T12:19:15Z

No, I am not saying that. Some folks use helm-charts to deploy and for them my suggestion could be more relevant , i guess.

I am not familiar with inner workings of DF operator, I will check with the team and reply.

romange · 2025-10-20T12:21:57Z

I am sorry, I was totally confused. I missed entirely that this PR is in the dragonfly-operator repo :)

Ignore my comment. I will ask someone to review.

charts/dragonfly-operator/scripts/readiness.sh

internal/resources/resources.go

moredure · 2025-10-20T12:49:06Z

it's backward incompatible I believe, have you tried updating existing datastore with newer version of operator? will it result in update of existing statefulset?

NegatioN · 2025-10-20T13:02:35Z

it's backward incompatible I believe, have you tried updating existing datastore with newer version of operator? will it result in update of existing statefulset?

I have not tried that. (And when you say datastore here, I'm unsure what you mean. This is the first time I've written anything for a Helm app in years and years)

I'm pretty sure it's not backwards compatible though. The pods spawned by this Operator depends on there existing a ConfigMap which was not there earlier. The StatefulSet would also need to be updated with the volume that points to this ConfigMap.

Edit: Also it seems like the tests are failing because the ConfigMap is not loaded into the test environment running with kind. Any hints on how to include it?

NegatioN · 2025-10-20T14:21:36Z

Decided on adding a configmap just for the tests in code. The kustomize-part of the config seems to target the operator and it's namespace very directly, while this needs to live in the same namespace as our pods for example.
Figured the simplest thing would be to just keep behavior as before in tests, until we decide there's a need to explicitly need to e2e test this code.

NegatioN · 2025-10-21T12:59:47Z

What would be the next steps on this @moredure @vyavdoshenko ?

vyavdoshenko · 2025-10-21T14:09:10Z

What would be the next steps on this @moredure @vyavdoshenko ?

LGTM

moredure · 2025-10-21T15:33:27Z

What would be the next steps on this @moredure @vyavdoshenko ?

Code looks fine, but I am concerned about the operator version being updated over existing crd will result in issuing statefulset update, some users might not expect it.

@Abhra303 did we have any conventions over backward compatibility in this terms.

Probably worth flagging this change as optional, or skip comparing it inside resourceSpecsEqual. So the new structure will be issued only once.

Abhra303 · 2025-10-23T15:04:04Z

@NegatioN sorry for replying late, the script itself looks good. But its not gonna work in df operator. Your code creates a configmap only when you're using the helm chart. This code has some critical downsides -

It doesn't work if installed directly (without helm chart). E2e test is failing because it doesn't have the configmap we are trying to mount.
Breaks compatibility. Existing pods won't be able to find the configmap.

You either create a configmap when creating the dragonfly resources (by the operator) or you can add a new optional field in the CRD "customHealthCheckConfigMap" (or something similar) to let users configure the configmap.

E.g. the steps to have a custom health check script:

kubectl create configmap custom-script --from-file=your-script-file
configure CRD to use this configmap.

apiVersion: dragonflydb.io/v1alpha1
kind: Dragonfly
metadata:
  labels:
    app.kubernetes.io/name: dragonfly
    app.kubernetes.io/instance: dragonfly-sample
    app.kubernetes.io/part-of: dragonfly-operator
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/created-by: dragonfly-operator
  name: dragonfly-sample
spec:
  customHealthCheckConfigMap:
     name: custom-script
     ns: prod-ns
...

If the field is nil, go with the default health script.
This seems to be the simplest approach as it neither adds a configmap dependency for the operator nor it breaks the compatibility.

Abhra303 · 2025-10-23T15:12:44Z

So existing code looks good. You just need to add an optional struct field to the operator CRD (see other merged PRs e.g.) which is by default nil.

Abhra303 · 2025-10-23T15:21:42Z

e2e/dragonfly_controller_test.go


+		It("Should create configmap successfully", func() {
+			err := k8sClient.Create(ctx, &corev1.ConfigMap{
+				TypeMeta: metav1.TypeMeta{


You don't need to mention TypeMeta explicitly as it is already a configmap struct.

NegatioN · 2025-10-24T07:10:37Z

Hi @Abhra303 , thanks for the great feedback. I can definitely try to modify the PR to fit one of your suggestions.
However, while looking around, I was also wondering if this sort of check of the snapshot loading status could belong in this part of the code which deals with determining if a Replica is stable or not. (Or some very similar stage, specifically for snapshotting)
Would this be a better or worse solution?

If that was desirable, we wouldn't need to manage any new resources, and there wouldn't be any breaking changes. We would however lose the configurability part of the "provide your own readiness-check"-solution.

NegatioN · 2025-10-24T12:51:54Z

Actually, upon further investigation, the codepath I alluded to earlier which my coworker made me aware of, already does wait for snapshots.
_, err := redisClient.Ping(ctx).Result() will in the case of loading a snapshot return an explicit error with the message Dragonfly is loading the dataset in memory.

I think this entire PR and related Github Issues are based on a misunderstanding that there is downtime, when there is not (or might have been in older versions of the operator?).

As a consequence, I feel like closing this PR is the right move. If the end goal is to make a configurable readiness check, maybe that should be done in another PR. Possible use-cases I see of that, are saving large snapshots which might make a node unresponsive, and redirecting traffic based on that 🤷

Sorry for wasting the time of people who reviewed this PR.

Make readiness check of pods respect snapshot loading

ee34b42

vyavdoshenko reviewed Oct 20, 2025

View reviewed changes

charts/dragonfly-operator/scripts/readiness.sh Outdated Show resolved Hide resolved

moredure reviewed Oct 20, 2025

View reviewed changes

internal/resources/resources.go Outdated Show resolved Hide resolved

Adjust based on feedback

554973a

Add a configmap in tests that points to the healthcheck.sh script

8af19b9

NegatioN added 2 commits October 20, 2025 16:40

make sure targeted volume name also is based on operator name

365c381

move volume into standard definition

d586cf0

vyavdoshenko requested a review from moredure October 21, 2025 14:07

Abhra303 reviewed Oct 23, 2025

View reviewed changes

NegatioN closed this Oct 24, 2025

feat: Make readiness check of pods respect snapshot loading #401

feat: Make readiness check of pods respect snapshot loading #401

Uh oh!

Conversation

NegatioN commented Oct 20, 2025

Uh oh!

NegatioN commented Oct 20, 2025

Uh oh!

romange commented Oct 20, 2025

Uh oh!

NegatioN commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

romange commented Oct 20, 2025

Uh oh!

romange commented Oct 20, 2025

Uh oh!

Uh oh!

Uh oh!

moredure commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NegatioN commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NegatioN commented Oct 20, 2025

Uh oh!

NegatioN commented Oct 21, 2025

Uh oh!

vyavdoshenko commented Oct 21, 2025

Uh oh!

moredure commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Abhra303 commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Abhra303 commented Oct 23, 2025

Uh oh!

Abhra303 Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

NegatioN commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NegatioN commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

NegatioN commented Oct 20, 2025 •

edited

Loading

moredure commented Oct 20, 2025 •

edited

Loading

NegatioN commented Oct 20, 2025 •

edited

Loading

moredure commented Oct 21, 2025 •

edited

Loading

Abhra303 commented Oct 23, 2025 •

edited

Loading

NegatioN commented Oct 24, 2025 •

edited

Loading