Skip to content

Conversation

Copilot
Copy link

@Copilot Copilot AI commented Sep 8, 2025

This PR implements Prometheus metrics to enable reliable alerting on image push failures in kube-snapshot, addressing the need for explicit, machine-readable indicators of push operations.

Problem

Previously, when image push operations failed (due to registry auth issues, network problems, or other errors), the only way to detect failures was through log parsing or generic task status monitoring. This made it difficult for platform teams to:

  • Set up reliable alerts for push failures
  • Monitor push success rates and failure patterns
  • Distinguish between different types of failures (auth, network, etc.)
  • Build dashboards showing operational health

Solution

Added comprehensive Prometheus metrics that expose detailed information about image push operations:

New Metrics

  • snapshot_image_push_total - Counter tracking all push attempts with labels:

    • namespace, pod, image - Resource identification
    • status - "success" or "failure"
    • reason - Failure categorization (auth_error, push_failed, image_not_exists, runtime_error)
  • snapshot_image_push_failures_total - Counter tracking only failed pushes with detailed failure reasons

  • snapshot_tasks_total - Counter tracking task phase transitions (CREATED, COMPLETED, FAILED)

Key Features

  • Detailed failure categorization: Auth errors, push failures, missing images, and runtime issues are clearly distinguished
  • Rich labeling: Full context (namespace, pod, image) for targeted alerting
  • Backward compatible: No breaking changes to existing functionality
  • Production ready: Metrics enabled by default on :8080

Example Usage

# Start with metrics enabled (default)
./bin/manager

# Access metrics
curl http://localhost:8080/metrics | grep snapshot_

# Example metrics output:
snapshot_image_push_total{namespace="default",pod="app",image="registry.com/app:v1",status="failure",reason="auth_error"} 1
snapshot_image_push_failures_total{namespace="default",pod="app",image="registry.com/app:v1",reason="auth_error"} 1

Alerting Examples

# Alert on any push failures in production
- alert: CriticalImagePushFailure  
  expr: increase(snapshot_image_push_failures_total{namespace="production"}[5m]) > 0
  annotations:
    summary: "Image push failed for {{ $labels.pod }}: {{ $labels.reason }}"

# Alert on high failure rate
- alert: HighImagePushFailureRate
  expr: rate(snapshot_image_push_failures_total[5m]) / rate(snapshot_image_push_total[5m]) > 0.1
  annotations:
    summary: "Image push failure rate exceeds 10%"

Implementation Details

  • Added internal/metrics package with Prometheus counter definitions
  • Instrumented reconcilePushImage function to record metrics on success/failure
  • Updated default metrics configuration to enable endpoint on :8080
  • Comprehensive test coverage for metrics functionality
  • Documentation with alerting rules and Grafana dashboard examples

The implementation provides the explicit, machine-readable indicators requested, enabling platform teams to build reliable monitoring and alerting around image push operations.

Fixes #55.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • https://storage.googleapis.com/storage/v1/b/kubebuilder-tools/o/kubebuilder-tools-1.30.0-linux-amd64.tar.gz
    • Triggering command: /home/REDACTED/work/kube-snapshot/kube-snapshot/bin/setup-envtest-release-0.18 use 1.30.0 --bin-dir /home/REDACTED/work/kube-snapshot/kube-snapshot/bin -p path (http block)

If you need me to access, download, or install something from one of these locations, you can either:


💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

Copy link
Author

Copilot AI commented Sep 8, 2025

@kebe7jun 👋 This repository doesn't have Copilot instructions. With Copilot instructions, I can understand the repository better, work faster and produce higher quality PRs.

I can generate a .github/copilot-instructions.md file for you automatically. Click here to open a pre-filled issue and assign it to me. I'll write the instructions, and then tag you for review.

@Copilot Copilot AI changed the title [WIP] [Feature Request] Support alerting on image push failure Add Prometheus metrics for image push failure alerting Sep 8, 2025
@Copilot Copilot AI requested a review from kebe7jun September 8, 2025 10:19
Copilot finished work on behalf of kebe7jun September 8, 2025 10:19
@CoderTH
Copy link
Collaborator

CoderTH commented Sep 17, 2025

@kebe7jun 这个pr啥时候能合入哇

@kebe7jun kebe7jun marked this pull request as ready for review September 17, 2025 08:38
@kebe7jun
Copy link
Collaborator

@kebe7jun 这个pr啥时候能合入哇

你帮忙测试 Review 一下?

Copy link

codecov bot commented Sep 17, 2025

Codecov Report

❌ Patch coverage is 48.48485% with 17 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
internal/controller/snapshotpodtask_controller.go 11.76% 11 Missing and 4 partials ⚠️
cmd/main.go 0.00% 2 Missing ⚠️
@@            Coverage Diff             @@
##             main      #56      +/-   ##
==========================================
+ Coverage   22.03%   23.60%   +1.57%     
==========================================
  Files           6        7       +1     
  Lines         926      953      +27     
==========================================
+ Hits          204      225      +21     
- Misses        701      706       +5     
- Partials       21       22       +1     
Flag Coverage Δ
unittests 23.60% <48.48%> (+1.57%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
internal/metrics/metrics.go 100.00% <100.00%> (ø)
cmd/main.go 0.00% <0.00%> (ø)
internal/controller/snapshotpodtask_controller.go 45.02% <11.76%> (+0.57%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Support alerting on image push failure

3 participants