Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Aug 27, 2025

This PR addresses SIGTERM issues (exit code 143) that can occur due to resource exhaustion rather than normal timeouts or cancellations. The implementation provides robust resource monitoring and graceful degradation to prevent system-level kills while maintaining chaos engineering effectiveness.

Problem

While exit code 143 (SIGTERM) is commonly from timeouts or cancellations, it can also result from resource exhaustion when validator or chaos scripts push the CI runner's CPU, memory, or disk usage to limits, causing the OS to issue SIGTERM to reclaim resources.

Solution

Enhanced C++ Resource Monitoring

Added a comprehensive ResourceMonitor class that provides:

  • Real-time memory, CPU, and disk usage monitoring
  • Configurable memory headroom checking (512MB default)
  • Resource pressure detection with callbacks
  • Thread-safe monitoring with automatic logging
// Enhanced signal handler now detects resource exhaustion
void signal_handler(int signal) {
    if (signal == SIGTERM) {
        if (g_resource_monitor && g_resource_monitor->is_resource_pressure()) {
            std::cout << "⚠️ Resource pressure detected - this may be resource exhaustion SIGTERM" << std::endl;
            g_resource_exhaustion.store(true);
        }
        // Log resource state for analysis
        g_resource_monitor->log_resource_usage("SIGTERM received");
    }
}

Python Utilities for Chaos Engineering

Created resource utilities that integrate with existing chaos testing:

import psutil, sys

def ensure_memory_headroom(min_mb=512):
    """Ensure sufficient memory headroom before stress testing"""
    avail = psutil.virtual_memory().available // (1024 * 1024)
    if avail < min_mb:
        print(f"❌ Not enough memory headroom: {avail}MB available, need {min_mb}MB")
        sys.exit(1)
    return True

Chaos Engineering Safety

The implementation adds multiple safety layers:

  1. Pre-flight checks - Verify sufficient resources before starting tests
  2. Continuous monitoring - Track resources during chaos scenarios with abort conditions
  3. Graceful degradation - Stop tests before reaching critical resource thresholds
  4. Post-test logging - Record resource usage for debugging

Enhanced Exit Code Handling

  • Exit code 0: Normal shutdown
  • Exit code 2: Resource exhaustion shutdown
  • Detailed resource logging distinguishes scenarios for CI/CD analysis

Testing

All tests pass (14/14) including comprehensive resource monitoring validation:

=== Test Summary ===
Tests run: 14
Tests passed: 14  
Tests failed: 0
Pass rate: 100%

The implementation includes:

  • Unit tests for resource monitoring functionality
  • Integration tests with the validator process
  • Demo scripts showing normal vs resource exhaustion scenarios
  • Validation of Python utilities compatibility

Impact

This solution prevents infrastructure-related SIGTERM kills while maintaining chaos testing effectiveness:

  • ✅ Proactive resource management prevents system kills
  • ✅ Clear classification of SIGTERM scenarios
  • ✅ Graceful abort conditions maintain system stability
  • ✅ Compatible with existing GitHub Actions workflows
  • ✅ Enhanced debugging through detailed resource logging

Fixes #22.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI changed the title [WIP] SIGTERM Implement comprehensive resource monitoring to prevent SIGTERM from resource exhaustion Aug 27, 2025
Copilot AI requested a review from 0xrinegade August 27, 2025 01:10
Copilot finished work on behalf of 0xrinegade August 27, 2025 01:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SIGTERM

2 participants