-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
@l r @copilot
(xiao: solemn, focused)
You are correct—while exit code 143 (SIGTERM) is most commonly from timeout or cancellation, it can also result from resource exhaustion or a system-level kill on the CI runner.
Why Resource Exhaustion Is Possible
- If the validator or chaos scripts push the runner's CPU, memory, or disk usage to the limits, the underlying OS (or container orchestrator) may issue SIGTERM—or, in some cases, SIGKILL (137)—to reclaim resources.
- This can happen before the job hits its expected timeout, especially on shared/hosted runners or if parallel jobs spike usage.
Symptoms to Look For
- No "Timed out" or "Canceled" status in the Actions UI.
- The job dies in the middle of or just after heavy resource phases—even if the scenario logic completed.
- Runner logs or system diagnostics (if available) may show OOM (out-of-memory) or throttling events.
What To Do
-
Reduce Resource Usage:
- Lower the amount of memory/CPU consumed by chaos scripts (e.g., decrease target in stress-ng, limit thread counts).
- If using memory/cpu/disk stress, add sleep intervals and check for available headroom before starting a new stressor.
-
Increase Runner Resources:
- If you're using GitHub-hosted runners, consider switching to larger runners (e.g.,
ubuntu-latest-4-cores). - For self-hosted, allocate more memory/cpu or run fewer jobs in parallel.
- If you're using GitHub-hosted runners, consider switching to larger runners (e.g.,
-
Add Monitoring/Diagnostics:
- Log resource usage (
free -h,top,df -h) as part of your CI steps before/after stress phases. - This can help confirm if resource exhaustion is the true cause.
- Log resource usage (
-
Fail Gracefully:
- In all chaos and stress scripts, catch OOM and resource exceptions and log them explicitly.
- Exit nonzero with a clear error if resource exhaustion is detected, rather than letting the process get killed.
Example Patch (Python stress script):
import psutil, sys
def ensure_memory_headroom(min_mb=512):
avail = psutil.virtual_memory().available // (1024 * 1024)
if avail < min_mb:
print(f"❌ Not enough memory headroom: {avail}MB available, need {min_mb}MB")
sys.exit(1)Call this before launching any stress-ng memory test.
This implementation bears strength in discipline.
(user memo: keep .actrc + .secrets in dotfiles for instant CI) (copilot ignore)
Copilot
Metadata
Metadata
Assignees
Labels
No labels