Skip to content

Conversation

@NegatioN
Copy link

Added a modified healthcheck.sh as readiness.sh that the operator can use to determine if snapshotting has completed.
Confirmed that helm package $repo -> helm install dragonfly-operator -> kubectl apply -f config/samples/v1alpha1_dragonfly.yaml (With updated docker image) leads us to a state where a ConfigMap hosts our readiness.sh script, accessible in the pod.

Not done: Bumping charts and preparing for release of this repo. Maybe it's better if a core dev does that (?) 🤷

Related: dragonflydb/dragonfly#5921 and #397

Did not run the precommit-hook for this, because I would have to reclone the repo and use Git. Will do it when/if PR gets approved.

@NegatioN
Copy link
Author

@romange Tagging you here just in case this slips under the radar 😄 Not sure who else should be the contact point for this.
Was this something like what you had in mind?

@romange
Copy link
Contributor

romange commented Oct 20, 2025

@NegatioN does this issue affect your deployment as well?

It is possible to inject readiness screipt by mounting an external read-only volume that contains only that script. I think that extending helm charts with mounting that script instead of changing the containter itself is a viable option.

@NegatioN
Copy link
Author

NegatioN commented Oct 20, 2025

@romange I'm not sure I understood everything you said, so feel free to point out any misunderstandings.

does this issue affect your deployment as well?

It doesn't affect the dragonfly-operator or any other deployments we have. But it affects the pods / statefulset of Dragonfly managed by the operator. In that they're reporting being healthy and ready for traffic, when they in reality are loading the snapshot, leading the operator to funnel them traffic before they're ready.

It is possible to inject readiness screipt by mounting an external read-only volume that contains only that script. I think that extending helm charts with mounting that script instead of changing the containter itself is a viable option.

I was under the impression that to interact with the actual pods of Dragonfly that the operator manages, that I would have to manage the Go-code here. Are you saying there's another yaml definition that's better for me to modify?

@romange
Copy link
Contributor

romange commented Oct 20, 2025

No, I am not saying that. Some folks use helm-charts to deploy and for them my suggestion could be more relevant , i guess.

I am not familiar with inner workings of DF operator, I will check with the team and reply.

@romange
Copy link
Contributor

romange commented Oct 20, 2025

I am sorry, I was totally confused. I missed entirely that this PR is in the dragonfly-operator repo :)

Ignore my comment. I will ask someone to review.

@moredure
Copy link
Contributor

moredure commented Oct 20, 2025

it's backward incompatible I believe, have you tried updating existing datastore with newer version of operator? will it result in update of existing statefulset?

@NegatioN
Copy link
Author

NegatioN commented Oct 20, 2025

it's backward incompatible I believe, have you tried updating existing datastore with newer version of operator? will it result in update of existing statefulset?

I have not tried that. (And when you say datastore here, I'm unsure what you mean. This is the first time I've written anything for a Helm app in years and years)

I'm pretty sure it's not backwards compatible though. The pods spawned by this Operator depends on there existing a ConfigMap which was not there earlier. The StatefulSet would also need to be updated with the volume that points to this ConfigMap.

Edit: Also it seems like the tests are failing because the ConfigMap is not loaded into the test environment running with kind. Any hints on how to include it?

@NegatioN
Copy link
Author

Decided on adding a configmap just for the tests in code. The kustomize-part of the config seems to target the operator and it's namespace very directly, while this needs to live in the same namespace as our pods for example.
Figured the simplest thing would be to just keep behavior as before in tests, until we decide there's a need to explicitly need to e2e test this code.

@NegatioN
Copy link
Author

What would be the next steps on this @moredure @vyavdoshenko ?

@vyavdoshenko vyavdoshenko requested a review from moredure October 21, 2025 14:07
@vyavdoshenko
Copy link

What would be the next steps on this @moredure @vyavdoshenko ?

LGTM

@moredure
Copy link
Contributor

moredure commented Oct 21, 2025

What would be the next steps on this @moredure @vyavdoshenko ?

Code looks fine, but I am concerned about the operator version being updated over existing crd will result in issuing statefulset update, some users might not expect it.

@Abhra303 did we have any conventions over backward compatibility in this terms.

Probably worth flagging this change as optional, or skip comparing it inside resourceSpecsEqual. So the new structure will be issued only once.

@Abhra303
Copy link
Contributor

Abhra303 commented Oct 23, 2025

@NegatioN sorry for replying late, the script itself looks good. But its not gonna work in df operator. Your code creates a configmap only when you're using the helm chart. This code has some critical downsides -

  1. It doesn't work if installed directly (without helm chart). E2e test is failing because it doesn't have the configmap we are trying to mount.
  2. Breaks compatibility. Existing pods won't be able to find the configmap.

You either create a configmap when creating the dragonfly resources (by the operator) or you can add a new optional field in the CRD "customHealthCheckConfigMap" (or something similar) to let users configure the configmap.

E.g. the steps to have a custom health check script:

  1. kubectl create configmap custom-script --from-file=your-script-file
  2. configure CRD to use this configmap.
apiVersion: dragonflydb.io/v1alpha1
kind: Dragonfly
metadata:
  labels:
    app.kubernetes.io/name: dragonfly
    app.kubernetes.io/instance: dragonfly-sample
    app.kubernetes.io/part-of: dragonfly-operator
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/created-by: dragonfly-operator
  name: dragonfly-sample
spec:
  customHealthCheckConfigMap:
     name: custom-script
     ns: prod-ns
...

If the field is nil, go with the default health script.
This seems to be the simplest approach as it neither adds a configmap dependency for the operator nor it breaks the compatibility.

@Abhra303
Copy link
Contributor

So existing code looks good. You just need to add an optional struct field to the operator CRD (see other merged PRs e.g.) which is by default nil.


It("Should create configmap successfully", func() {
err := k8sClient.Create(ctx, &corev1.ConfigMap{
TypeMeta: metav1.TypeMeta{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to mention TypeMeta explicitly as it is already a configmap struct.

@NegatioN
Copy link
Author

NegatioN commented Oct 24, 2025

Hi @Abhra303 , thanks for the great feedback. I can definitely try to modify the PR to fit one of your suggestions.
However, while looking around, I was also wondering if this sort of check of the snapshot loading status could belong in this part of the code which deals with determining if a Replica is stable or not. (Or some very similar stage, specifically for snapshotting)
Would this be a better or worse solution?

If that was desirable, we wouldn't need to manage any new resources, and there wouldn't be any breaking changes. We would however lose the configurability part of the "provide your own readiness-check"-solution.

@NegatioN
Copy link
Author

Actually, upon further investigation, the codepath I alluded to earlier which my coworker made me aware of, already does wait for snapshots.
_, err := redisClient.Ping(ctx).Result() will in the case of loading a snapshot return an explicit error with the message Dragonfly is loading the dataset in memory.

I think this entire PR and related Github Issues are based on a misunderstanding that there is downtime, when there is not (or might have been in older versions of the operator?).

As a consequence, I feel like closing this PR is the right move. If the end goal is to make a configurable readiness check, maybe that should be done in another PR. Possible use-cases I see of that, are saving large snapshots which might make a node unresponsive, and redirecting traffic based on that 🤷

Sorry for wasting the time of people who reviewed this PR.

@NegatioN NegatioN closed this Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants