Skip to content

Add cluster-health-monitor Task and CronJob for dogfooding cluster#3118

Draft
vdemeester wants to merge 1 commit intotektoncd:mainfrom
vdemeester:feat/cluster-health-monitor
Draft

Add cluster-health-monitor Task and CronJob for dogfooding cluster#3118
vdemeester wants to merge 1 commit intotektoncd:mainfrom
vdemeester:feat/cluster-health-monitor

Conversation

@vdemeester
Copy link
Member

Changes

Add a lightweight monitoring solution that checks the health of CronJobs,
Jobs, PipelineRuns, and TaskRuns in the dogfooding cluster and creates
GitHub issues when problems are detected.

Problem

We currently have no monitoring or alerting for the Tekton dogfooding
cluster. When things break (image pull failures, node replacements, stuck
jobs), nobody is notified until someone manually checks.

On 2026-02-11, all cluster nodes were replaced. The new nodes run CRI-O which
enforces fully-qualified image names. All 28 cronjobs started failing with
ImagePullBackOff and because they use concurrencyPolicy: Forbid, stuck
jobs blocked all subsequent runs. This went unnoticed for 2+ days.

Solution

A cluster-health-monitor Task + CronJob that runs daily and checks:

  1. CronJob health: suspended cronjobs, stuck active jobs blocking
    concurrencyPolicy=Forbid, missed schedules
  2. Job health: ImagePullBackOff pods, failed jobs
  3. PipelineRun/TaskRun health: recent failures across all monitored
    namespaces (default, tekton-ci, tekton-nightly, bastion-p, bastion-z)

When issues are detected, it creates a GitHub issue in tektoncd/plumbing
with structured labels and a detailed report (with dedup to avoid spam).

Key design decisions

  • CronJob → TaskRun directly via kubectl (no EventListener) — independent
    of the trigger infrastructure it monitors
  • ghcr.io/ images only (kubectl, tkn) — avoids CRI-O short name issues
    that currently break docker.io/ workloads
  • concurrencyPolicy: Replace + activeDeadlineSeconds: 600 — the
    monitor itself can never get stuck
  • Shared workspace (emptyDir) passes the report between the 3 steps

Files

tekton/cronjobs/dogfooding/cluster-health-monitor/
├── README.md           # Architecture docs
├── rbac.yaml           # ServiceAccount + ClusterRole/Binding
├── task.yaml           # 3-step Task (check-cronjobs, check-pr-tr, report)
├── cronjob.yaml        # Daily CronJob creating a TaskRun
└── kustomization.yaml  # Kustomize bundle

/kind feature

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

See the contribution guide
for more details.

@tekton-robot tekton-robot added kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Feb 13, 2026
@tekton-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from vdemeester after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 13, 2026
@tekton-robot tekton-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 24, 2026
@vdemeester vdemeester force-pushed the feat/cluster-health-monitor branch from 26522bc to 9499e30 Compare February 24, 2026 10:42
Add a lightweight monitoring solution for the Tekton dogfooding cluster
that checks CronJob, Job, and PipelineRun health daily and creates
GitHub issues when problems are detected.

The Task clones plumbing and runs standalone scripts from scripts/,
making it easy for maintainers to run checks locally with just
kubectl access to the cluster.

CronJob/Job checks (check-cronjobs.sh):
- Stuck active jobs blocking concurrencyPolicy=Forbid
- CronJobs that haven't succeeded in a configurable threshold
- Pods with ImagePullBackOff errors
- Failed jobs

PipelineRun checks (check-runs.sh):
- Infrastructure failures always flagged (timeouts, image pull, etc)
- Consistently failing pipelines (all last N runs failed)
- Regressions detected (was succeeding, now all-failing)
- Flaky pipelines with mixed results are skipped (not actionable)

Issue management (report.sh):
- Creates a single issue per incident
- Adds comment updates on subsequent runs
- Auto-closes with resolved comment when cluster recovers

Key design decisions:
- CronJob creates TaskRun directly via kubectl (no EventListener),
  independent of the trigger infrastructure it monitors
- concurrencyPolicy: Replace + activeDeadlineSeconds: 600 so the
  monitor itself can never get stuck
- ghcr.io/ images only to avoid CRI-O short name issues

Relates to tektoncd#3119
@vdemeester vdemeester force-pushed the feat/cluster-health-monitor branch from 9499e30 to 445f269 Compare March 11, 2026 10:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants