Skip to content

Starter and stopper jobs have inconsistent BackoffLimit setting causing excessive pod creation on failures #658

@moko-poi

Description

@moko-poi

Brief summary

The starter and stopper jobs lack a BackoffLimit setting, causing them to retry up to 6 times on failures, while initializer and runner jobs are configured with BackoffLimit: &zero32 for immediate failure. This inconsistency leads to excessive pod creation, resource waste, and delayed error detection when starter/stopper curl connections fail.

k6-operator version or image

latest/main branch

Helm chart version (if applicable)

No response

TestRun / PrivateLoadZone YAML

apiVersion: k6.io/v1alpha1
kind: TestRun
metadata:
  name: k6-test
spec:
  parallelism: 1
  script:
    configMap:
      name: test
      file: test.js

Other environment details (if applicable)

No response

Steps to reproduce the problem

  1. Create a TestRun that will cause starter/stopper curl to fail (e.g., by deleting k6 runner pod immediately after creation for starter, or causing network issues during test execution for stopper)
  2. Observe that starter/stopper jobs create multiple pods (up to 6) that all fail
  3. Compare with initializer and runner jobs which have BackoffLimit: &zero32

Example commands

  • kubectl apply -f testrun.yaml
  • kubectl delete pod k6-test-1-xxx # Delete runner pod to cause starter failure
  • kubectl get pods -l job-name=k6-test-starter # Observe multiple failed pods
  • kubectl get pods -l job-name=k6-test-stopper # Observe multiple failed pods for stopper

Expected behaviour

Starter and stopper jobs should fail immediately (after curl's internal 3 retries) and not create multiple pods, consistent with initializer and runner job behavior which both use BackoffLimit: &zero32

Actual behaviour

Starter and stopper jobs create up to 6 pods (default Kubernetes BackoffLimit) when curl connection fails, causing:

  1. Resource waste: Multiple unnecessary pods created
  2. Inconsistent behavior: initializer/runner fail immediately, but starter/stopper retry 6 times
  3. Excessive logging: Same failure logged 6 times instead of once
  4. Delayed failure detection: Takes longer to identify actual issues

Each pod performs curl with --retry 3, so total attempts = 6 pods × 3 curl retries = 18 attempts for the same failure.

Current code in pkg/resources/jobs/starter.go and pkg/resources/jobs/stopper.go lack BackoffLimit setting, while initializer.go and runner.go both have BackoffLimit: &zero32.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions