Add cluster-health-monitor Task and CronJob for dogfooding cluster#3118
Draft
vdemeester wants to merge 1 commit intotektoncd:mainfrom
Draft
Add cluster-health-monitor Task and CronJob for dogfooding cluster#3118vdemeester wants to merge 1 commit intotektoncd:mainfrom
vdemeester wants to merge 1 commit intotektoncd:mainfrom
Conversation
Contributor
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
26522bc to
9499e30
Compare
Add a lightweight monitoring solution for the Tekton dogfooding cluster that checks CronJob, Job, and PipelineRun health daily and creates GitHub issues when problems are detected. The Task clones plumbing and runs standalone scripts from scripts/, making it easy for maintainers to run checks locally with just kubectl access to the cluster. CronJob/Job checks (check-cronjobs.sh): - Stuck active jobs blocking concurrencyPolicy=Forbid - CronJobs that haven't succeeded in a configurable threshold - Pods with ImagePullBackOff errors - Failed jobs PipelineRun checks (check-runs.sh): - Infrastructure failures always flagged (timeouts, image pull, etc) - Consistently failing pipelines (all last N runs failed) - Regressions detected (was succeeding, now all-failing) - Flaky pipelines with mixed results are skipped (not actionable) Issue management (report.sh): - Creates a single issue per incident - Adds comment updates on subsequent runs - Auto-closes with resolved comment when cluster recovers Key design decisions: - CronJob creates TaskRun directly via kubectl (no EventListener), independent of the trigger infrastructure it monitors - concurrencyPolicy: Replace + activeDeadlineSeconds: 600 so the monitor itself can never get stuck - ghcr.io/ images only to avoid CRI-O short name issues Relates to tektoncd#3119
9499e30 to
445f269
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes
Add a lightweight monitoring solution that checks the health of CronJobs,
Jobs, PipelineRuns, and TaskRuns in the dogfooding cluster and creates
GitHub issues when problems are detected.
Problem
We currently have no monitoring or alerting for the Tekton dogfooding
cluster. When things break (image pull failures, node replacements, stuck
jobs), nobody is notified until someone manually checks.
On 2026-02-11, all cluster nodes were replaced. The new nodes run CRI-O which
enforces fully-qualified image names. All 28 cronjobs started failing with
ImagePullBackOffand because they useconcurrencyPolicy: Forbid, stuckjobs blocked all subsequent runs. This went unnoticed for 2+ days.
Solution
A
cluster-health-monitorTask + CronJob that runs daily and checks:concurrencyPolicy=Forbid, missed schedulesImagePullBackOffpods, failed jobsnamespaces (
default,tekton-ci,tekton-nightly,bastion-p,bastion-z)When issues are detected, it creates a GitHub issue in
tektoncd/plumbingwith structured labels and a detailed report (with dedup to avoid spam).
Key design decisions
kubectl(no EventListener) — independentof the trigger infrastructure it monitors
ghcr.io/images only (kubectl,tkn) — avoids CRI-O short name issuesthat currently break
docker.io/workloadsconcurrencyPolicy: Replace+activeDeadlineSeconds: 600— themonitor itself can never get stuck
emptyDir) passes the report between the 3 stepsFiles
/kind feature
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them:
See the contribution guide
for more details.