Implement comprehensive resource monitoring to prevent SIGTERM from resource exhaustion #23

Copilot · 2025-08-27T00:41:36Z

This PR addresses SIGTERM issues (exit code 143) that can occur due to resource exhaustion rather than normal timeouts or cancellations. The implementation provides robust resource monitoring and graceful degradation to prevent system-level kills while maintaining chaos engineering effectiveness.

Problem

While exit code 143 (SIGTERM) is commonly from timeouts or cancellations, it can also result from resource exhaustion when validator or chaos scripts push the CI runner's CPU, memory, or disk usage to limits, causing the OS to issue SIGTERM to reclaim resources.

Solution

Enhanced C++ Resource Monitoring

Added a comprehensive ResourceMonitor class that provides:

Real-time memory, CPU, and disk usage monitoring
Configurable memory headroom checking (512MB default)
Resource pressure detection with callbacks
Thread-safe monitoring with automatic logging

// Enhanced signal handler now detects resource exhaustion
void signal_handler(int signal) {
    if (signal == SIGTERM) {
        if (g_resource_monitor && g_resource_monitor->is_resource_pressure()) {
            std::cout << "⚠️ Resource pressure detected - this may be resource exhaustion SIGTERM" << std::endl;
            g_resource_exhaustion.store(true);
        }
        // Log resource state for analysis
        g_resource_monitor->log_resource_usage("SIGTERM received");
    }
}

Python Utilities for Chaos Engineering

Created resource utilities that integrate with existing chaos testing:

import psutil, sys

def ensure_memory_headroom(min_mb=512):
    """Ensure sufficient memory headroom before stress testing"""
    avail = psutil.virtual_memory().available // (1024 * 1024)
    if avail < min_mb:
        print(f"❌ Not enough memory headroom: {avail}MB available, need {min_mb}MB")
        sys.exit(1)
    return True

Chaos Engineering Safety

The implementation adds multiple safety layers:

Pre-flight checks - Verify sufficient resources before starting tests
Continuous monitoring - Track resources during chaos scenarios with abort conditions
Graceful degradation - Stop tests before reaching critical resource thresholds
Post-test logging - Record resource usage for debugging

Enhanced Exit Code Handling

Exit code 0: Normal shutdown
Exit code 2: Resource exhaustion shutdown
Detailed resource logging distinguishes scenarios for CI/CD analysis

Testing

All tests pass (14/14) including comprehensive resource monitoring validation:

=== Test Summary ===
Tests run: 14
Tests passed: 14  
Tests failed: 0
Pass rate: 100%

The implementation includes:

Unit tests for resource monitoring functionality
Integration tests with the validator process
Demo scripts showing normal vs resource exhaustion scenarios
Validation of Python utilities compatibility

Impact

This solution prevents infrastructure-related SIGTERM kills while maintaining chaos testing effectiveness:

✅ Proactive resource management prevents system kills
✅ Clear classification of SIGTERM scenarios
✅ Graceful abort conditions maintain system stability
✅ Compatible with existing GitHub Actions workflows
✅ Enhanced debugging through detailed resource logging

Fixes #22.

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

…esource exhaustion Co-authored-by: 0xrinegade <[email protected]>

…e exhaustion prevention Co-authored-by: 0xrinegade <[email protected]>

Initial plan

1afb0ba

Copilot AI assigned Copilot and 0xrinegade Aug 27, 2025

Copilot started work on behalf of 0xrinegade August 27, 2025 00:41 View session

Copilot AI and others added 2 commits August 27, 2025 01:07

Implement comprehensive resource monitoring to prevent SIGTERM from r…

acb094d

…esource exhaustion Co-authored-by: 0xrinegade <[email protected]>

Add comprehensive demo and implementation summary for SIGTERM resourc…

9de99bf

…e exhaustion prevention Co-authored-by: 0xrinegade <[email protected]>

Copilot AI changed the title ~~[WIP] SIGTERM~~ Implement comprehensive resource monitoring to prevent SIGTERM from resource exhaustion Aug 27, 2025

Copilot AI requested a review from 0xrinegade August 27, 2025 01:10

Copilot finished work on behalf of 0xrinegade August 27, 2025 01:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement comprehensive resource monitoring to prevent SIGTERM from resource exhaustion #23

Implement comprehensive resource monitoring to prevent SIGTERM from resource exhaustion #23

Uh oh!

Copilot AI commented Aug 27, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implement comprehensive resource monitoring to prevent SIGTERM from resource exhaustion #23

Are you sure you want to change the base?

Implement comprehensive resource monitoring to prevent SIGTERM from resource exhaustion #23

Uh oh!

Conversation

Copilot AI commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Enhanced C++ Resource Monitoring

Python Utilities for Chaos Engineering

Chaos Engineering Safety

Enhanced Exit Code Handling

Testing

Impact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Aug 27, 2025 •

edited

Loading