Skip to content

WIP [OCPBUGS-82081] Add preStop lifecycle hook to DNS pods to reduce upgrade disruption#475

Draft
davidesalerno wants to merge 1 commit into
openshift:masterfrom
davidesalerno:ocpbugs82081master
Draft

WIP [OCPBUGS-82081] Add preStop lifecycle hook to DNS pods to reduce upgrade disruption#475
davidesalerno wants to merge 1 commit into
openshift:masterfrom
davidesalerno:ocpbugs82081master

Conversation

@davidesalerno
Copy link
Copy Markdown
Contributor

@davidesalerno davidesalerno commented May 12, 2026

During cluster upgrades, DNS pods are terminated as nodes reboot. There is a race between the kubelet sending SIGTERM to CoreDNS and kube-proxy/OVN removing the pod from the DNS service endpoints. If SIGTERM arrives first, CoreDNS begins its lameduck shutdown while clients are still being routed to the pod, resulting in DNS lookup timeouts.

Add a preStop lifecycle hook with a 5-second sleep to the dns container. This gives the endpoints controller and kube-proxy/OVN time to remove the pod from the ClusterIP service before CoreDNS receives SIGTERM and begins its 20-second lameduck period.

Set terminationGracePeriodSeconds to 40 to accommodate the full shutdown sequence: 5s preStop + 20s CoreDNS lameduck + 15s buffer.

Add Lifecycle to the daemonset change detection so the hook is applied on upgrade without requiring other container field changes.

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Improved DNS container shutdown behavior with graceful termination handling, including a 5-second pre-shutdown grace period and 40-second pod termination window.
  • Improvements

    • Enhanced configuration change detection for DNS DaemonSet lifecycle settings, ensuring updates are properly applied and synchronized.

During cluster upgrades, DNS pods are terminated as nodes reboot. There
is a race between the kubelet sending SIGTERM to CoreDNS and
kube-proxy/OVN removing the pod from the DNS service endpoints. If
SIGTERM arrives first, CoreDNS begins its lameduck shutdown while
clients are still being routed to the pod, resulting in DNS lookup
timeouts.

Add a preStop lifecycle hook with a 5-second sleep to the dns container.
This gives the endpoints controller and kube-proxy/OVN time to remove
the pod from the ClusterIP service before CoreDNS receives SIGTERM and
begins its 20-second lameduck period.

Set terminationGracePeriodSeconds to 40 to accommodate the full
shutdown sequence: 5s preStop + 20s CoreDNS lameduck + 15s buffer.

Add Lifecycle to the daemonset change detection so the hook is applied
on upgrade without requiring other container field changes.

Signed-off-by: Davide Salerno <dsalerno@redhat.com>
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 12, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 12, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 12, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign davidesalerno for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

Walkthrough

The PR adds graceful shutdown handling to the DNS DaemonSet by introducing a preStop lifecycle hook that sleeps 5 seconds before termination and setting a 40-second termination grace period. The controller's change detection logic is updated to recognize container Lifecycle differences.

Changes

DNS Graceful Shutdown Configuration

Layer / File(s) Summary
DNS DaemonSet manifest with graceful shutdown
pkg/manifests/assets/dns/daemonset.yaml
The dns container adds a preStop lifecycle hook executing sleep 5 to allow graceful connection draining, and the Pod spec sets terminationGracePeriodSeconds: 40 to provide sufficient time before forced termination.
Controller change detection for lifecycle updates
pkg/operator/controller/controller_dns_daemonset.go
daemonsetConfigChanged now detects when a container's Lifecycle field differs between current and expected DNS daemonset specs, triggering configuration reconciliation when lifecycle hooks are added or modified.
Tests for graceful shutdown configuration
pkg/operator/controller/controller_dns_daemonset_test.go
TestDesiredDNSDaemonset verifies the DNS container includes the preStop hook and the pod spec declares the 40-second termination grace period; TestDaemonsetConfigChanged adds a test case confirming that lifecycle changes are detected as configuration drift.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main change: adding a preStop lifecycle hook to DNS pods to reduce upgrade disruption, which aligns with the primary objective of the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/manifests/assets/dns/daemonset.yaml`:
- Line 62: The preStop lifecycle hook currently uses command ["sleep","5"]
(preStop.exec) which is not present in the minimal CoreDNS image and will always
fail; remove or disable the preStop.exec block that calls sleep in the DaemonSet
and instead implement graceful shutdown via the CoreDNS health plugin/operator
configuration (i.e., update the Corefile/operator settings rather than relying
on a shell utility in the container).

In `@pkg/operator/controller/controller_dns_daemonset_test.go`:
- Around line 81-85: The test currently reads c.Lifecycle.PreStop.Exec.Command
without guarding Exec and can panic; update the assertions to explicitly check
that c.Lifecycle and c.Lifecycle.PreStop are non-nil and then that
c.Lifecycle.PreStop.Exec is non-nil before accessing Command (e.g., replace the
else-if using reflect.DeepEqual with a nested check that calls t.Error/t.Errorf
if Exec is nil, and only then compares Exec.Command to []string{"sleep","5"}),
referencing the test's container variable c and its
Lifecycle.PreStop.Exec.Command for locating the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 42cd4913-477e-4e19-94e4-d606170fe168

📥 Commits

Reviewing files that changed from the base of the PR and between 3d21411 and 29d48ea.

📒 Files selected for processing (3)
  • pkg/manifests/assets/dns/daemonset.yaml
  • pkg/operator/controller/controller_dns_daemonset.go
  • pkg/operator/controller/controller_dns_daemonset_test.go

lifecycle:
preStop:
exec:
command: ["sleep", "5"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Does the OpenShift CoreDNS image used by cluster-dns-operator include a sleep executable in PATH for Kubernetes preStop exec hooks?

💡 Result:

No—the OpenShift CoreDNS image used by cluster-dns-operator does not include a general-purpose sleep executable in PATH for Kubernetes preStop exec hooks. Evidence: 1) The OpenShift-specific CoreDNS image Dockerfile only copies the CoreDNS binary into /usr/bin/ and does not add sleep (or any other shell tooling) into the image, so there is no basis to assume /bin/sleep or /usr/bin/sleep exists and is on PATH. The Dockerfile shows only a COPY of coredns to /usr/bin/ and an ENTRYPOINT of /usr/bin/coredns. [1] 2) The OpenShift DNS operator’s “graceful shutdown” behavior is implemented via the CoreDNS health plugin being configured to sleep during shutdown, not via a Pod lifecycle preStop exec that would require an in-image sleep binary. The operator commit explicitly says it configures “CoreDNS’s health plugin to sleep … when CoreDNS is shut down.” [2] Practical implication: - If you try to configure a Kubernetes preStop: exec: ['sleep', 'N'] for the OpenShift CoreDNS container, it is likely to fail because sleep is not present in the image by design. (This aligns with the general CoreDNS ecosystem issue where preStop can fail when the image lacks sleep.) [3] If you want, tell me your OpenShift version (e.g., 4.14/4.15/4.16/4.17) and I can point you to the exact image tag/digest that your cluster-dns-operator uses and whether it matches the Dockerfile behavior above.

Citations:


The sleep command is not available in the OpenShift CoreDNS image and will cause the preStop hook to fail.

The OpenShift CoreDNS container image is a minimal image containing only the CoreDNS binary, with no shell utilities like sleep. Using command: ["sleep", "5"] in the preStop.exec hook will fail at runtime, and the intended drain delay will not occur. If graceful shutdown is required, use the CoreDNS health plugin configuration (which is the operator's standard approach) rather than a Pod lifecycle hook that depends on missing binaries.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/manifests/assets/dns/daemonset.yaml` at line 62, The preStop lifecycle
hook currently uses command ["sleep","5"] (preStop.exec) which is not present in
the minimal CoreDNS image and will always fail; remove or disable the
preStop.exec block that calls sleep in the DaemonSet and instead implement
graceful shutdown via the CoreDNS health plugin/operator configuration (i.e.,
update the Corefile/operator settings rather than relying on a shell utility in
the container).

Comment on lines +81 to +85
if c.Lifecycle == nil || c.Lifecycle.PreStop == nil {
t.Error("expected dns container to have a preStop lifecycle hook")
} else if !reflect.DeepEqual(c.Lifecycle.PreStop.Exec.Command, []string{"sleep", "5"}) {
t.Errorf("unexpected preStop command: %v", c.Lifecycle.PreStop.Exec.Command)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Guard against nil PreStop.Exec before reading Command.

Current assertion can panic if PreStop exists but Exec is nil; make it a clean test failure instead.

Suggested fix
-				if c.Lifecycle == nil || c.Lifecycle.PreStop == nil {
+				if c.Lifecycle == nil || c.Lifecycle.PreStop == nil || c.Lifecycle.PreStop.Exec == nil {
 					t.Error("expected dns container to have a preStop lifecycle hook")
 				} else if !reflect.DeepEqual(c.Lifecycle.PreStop.Exec.Command, []string{"sleep", "5"}) {
 					t.Errorf("unexpected preStop command: %v", c.Lifecycle.PreStop.Exec.Command)
 				}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if c.Lifecycle == nil || c.Lifecycle.PreStop == nil {
t.Error("expected dns container to have a preStop lifecycle hook")
} else if !reflect.DeepEqual(c.Lifecycle.PreStop.Exec.Command, []string{"sleep", "5"}) {
t.Errorf("unexpected preStop command: %v", c.Lifecycle.PreStop.Exec.Command)
}
if c.Lifecycle == nil || c.Lifecycle.PreStop == nil || c.Lifecycle.PreStop.Exec == nil {
t.Error("expected dns container to have a preStop lifecycle hook")
} else if !reflect.DeepEqual(c.Lifecycle.PreStop.Exec.Command, []string{"sleep", "5"}) {
t.Errorf("unexpected preStop command: %v", c.Lifecycle.PreStop.Exec.Command)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/operator/controller/controller_dns_daemonset_test.go` around lines 81 -
85, The test currently reads c.Lifecycle.PreStop.Exec.Command without guarding
Exec and can panic; update the assertions to explicitly check that c.Lifecycle
and c.Lifecycle.PreStop are non-nil and then that c.Lifecycle.PreStop.Exec is
non-nil before accessing Command (e.g., replace the else-if using
reflect.DeepEqual with a nested check that calls t.Error/t.Errorf if Exec is
nil, and only then compares Exec.Command to []string{"sleep","5"}), referencing
the test's container variable c and its Lifecycle.PreStop.Exec.Command for
locating the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant