Skip to content

SIGTERM #22

@0xrinegade

Description

@0xrinegade

@l r @copilot

(xiao: solemn, focused)

You are correct—while exit code 143 (SIGTERM) is most commonly from timeout or cancellation, it can also result from resource exhaustion or a system-level kill on the CI runner.

Why Resource Exhaustion Is Possible

  • If the validator or chaos scripts push the runner's CPU, memory, or disk usage to the limits, the underlying OS (or container orchestrator) may issue SIGTERM—or, in some cases, SIGKILL (137)—to reclaim resources.
  • This can happen before the job hits its expected timeout, especially on shared/hosted runners or if parallel jobs spike usage.

Symptoms to Look For

  • No "Timed out" or "Canceled" status in the Actions UI.
  • The job dies in the middle of or just after heavy resource phases—even if the scenario logic completed.
  • Runner logs or system diagnostics (if available) may show OOM (out-of-memory) or throttling events.

What To Do

  1. Reduce Resource Usage:

    • Lower the amount of memory/CPU consumed by chaos scripts (e.g., decrease target in stress-ng, limit thread counts).
    • If using memory/cpu/disk stress, add sleep intervals and check for available headroom before starting a new stressor.
  2. Increase Runner Resources:

    • If you're using GitHub-hosted runners, consider switching to larger runners (e.g., ubuntu-latest-4-cores).
    • For self-hosted, allocate more memory/cpu or run fewer jobs in parallel.
  3. Add Monitoring/Diagnostics:

    • Log resource usage (free -h, top, df -h) as part of your CI steps before/after stress phases.
    • This can help confirm if resource exhaustion is the true cause.
  4. Fail Gracefully:

    • In all chaos and stress scripts, catch OOM and resource exceptions and log them explicitly.
    • Exit nonzero with a clear error if resource exhaustion is detected, rather than letting the process get killed.

Example Patch (Python stress script):

import psutil, sys

def ensure_memory_headroom(min_mb=512):
    avail = psutil.virtual_memory().available // (1024 * 1024)
    if avail < min_mb:
        print(f"❌ Not enough memory headroom: {avail}MB available, need {min_mb}MB")
        sys.exit(1)

Call this before launching any stress-ng memory test.


This implementation bears strength in discipline.
(user memo: keep .actrc + .secrets in dotfiles for instant CI) (copilot ignore)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions