Add cluster-health-monitor Task and CronJob for dogfooding cluster by vdemeester · Pull Request #3118 · tektoncd/plumbing

vdemeester · 2026-02-13T15:11:46Z

Changes

Add a lightweight monitoring solution that checks the health of CronJobs,
Jobs, PipelineRuns, and TaskRuns in the dogfooding cluster and creates
GitHub issues when problems are detected.

Problem

We currently have no monitoring or alerting for the Tekton dogfooding
cluster. When things break (image pull failures, node replacements, stuck
jobs), nobody is notified until someone manually checks.

On 2026-02-11, all cluster nodes were replaced. The new nodes run CRI-O which
enforces fully-qualified image names. All 28 cronjobs started failing with
ImagePullBackOff and because they use concurrencyPolicy: Forbid, stuck
jobs blocked all subsequent runs. This went unnoticed for 2+ days.

Solution

A cluster-health-monitor Task + CronJob that runs daily and checks:

CronJob health: suspended cronjobs, stuck active jobs blocking
concurrencyPolicy=Forbid, missed schedules
Job health: ImagePullBackOff pods, failed jobs
PipelineRun/TaskRun health: recent failures across all monitored
namespaces (default, tekton-ci, tekton-nightly, bastion-p, bastion-z)

When issues are detected, it creates a GitHub issue in tektoncd/plumbing
with structured labels and a detailed report (with dedup to avoid spam).

Key design decisions

CronJob → TaskRun directly via kubectl (no EventListener) — independent
of the trigger infrastructure it monitors
ghcr.io/ images only (kubectl, tkn) — avoids CRI-O short name issues
that currently break docker.io/ workloads
concurrencyPolicy: Replace + activeDeadlineSeconds: 600 — the
monitor itself can never get stuck
Shared workspace (emptyDir) passes the report between the 3 steps

Files

tekton/cronjobs/dogfooding/cluster-health-monitor/
├── README.md           # Architecture docs
├── rbac.yaml           # ServiceAccount + ClusterRole/Binding
├── task.yaml           # 3-step Task (check-cronjobs, check-pr-tr, report)
├── cronjob.yaml        # Daily CronJob creating a TaskRun
└── kustomization.yaml  # Kustomize bundle

/kind feature

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

Includes docs (if user facing)
Commit messages follow commit message best practices

See the contribution guide
for more details.

tekton-robot · 2026-02-13T15:11:52Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from vdemeester after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Add a lightweight monitoring solution for the Tekton dogfooding cluster that checks CronJob, Job, and PipelineRun health daily and creates GitHub issues when problems are detected. The Task clones plumbing and runs standalone scripts from scripts/, making it easy for maintainers to run checks locally with just kubectl access to the cluster. CronJob/Job checks (check-cronjobs.sh): - Stuck active jobs blocking concurrencyPolicy=Forbid - CronJobs that haven't succeeded in a configurable threshold - Pods with ImagePullBackOff errors - Failed jobs PipelineRun checks (check-runs.sh): - Infrastructure failures always flagged (timeouts, image pull, etc) - Consistently failing pipelines (all last N runs failed) - Regressions detected (was succeeding, now all-failing) - Flaky pipelines with mixed results are skipped (not actionable) Issue management (report.sh): - Creates a single issue per incident - Adds comment updates on subsequent runs - Auto-closes with resolved comment when cluster recovers Key design decisions: - CronJob creates TaskRun directly via kubectl (no EventListener), independent of the trigger infrastructure it monitors - concurrencyPolicy: Replace + activeDeadlineSeconds: 600 so the monitor itself can never get stuck - ghcr.io/ images only to avoid CRI-O short name issues Relates to tektoncd#3119

tekton-robot added kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Feb 13, 2026

tekton-robot requested review from AlanGreene and wlynch February 13, 2026 15:11

tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 13, 2026

vdemeester mentioned this pull request Feb 13, 2026

Improve monitoring and alerting for CronJobs, PipelineRuns, and TaskRuns in the dogfooding cluster #3119

Open

tekton-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 24, 2026

vdemeester force-pushed the feat/cluster-health-monitor branch from 26522bc to 9499e30 Compare February 24, 2026 10:42

vdemeester force-pushed the feat/cluster-health-monitor branch from 9499e30 to 445f269 Compare March 11, 2026 10:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cluster-health-monitor Task and CronJob for dogfooding cluster#3118

Add cluster-health-monitor Task and CronJob for dogfooding cluster#3118
vdemeester wants to merge 1 commit intotektoncd:mainfrom
vdemeester:feat/cluster-health-monitor

vdemeester commented Feb 13, 2026

Uh oh!

tekton-robot commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vdemeester commented Feb 13, 2026

Changes

Problem

Solution

Key design decisions

Files

Submitter Checklist

Uh oh!

tekton-robot commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants