Debugging Test Failures - Complete Workflow

This guide shows you how to investigate and fix test failures using the flaky test detector and AI-assisted root cause analysis.

Quick Start

When CI fails:

# 1. Update test_input.json with the failing test
# 2. Run the flaky detector
python3 local_test.py

# 3. Check if it's flaky or a real bug
# 4. Use AI to analyze the root cause
# 5. Fix and verify

Complete Workflow Example

Scenario: CI Test Failure

You push code and CI fails with:

FAILED tests/test_config.py::TestConfig::test_default_config
AssertionError: assert 10 == 11

Step 1: Check the CI Logs

# View the failure details
gh run view --log-failed

# Or check specific run
gh run view <run-id> --log-failed

What to look for:

Which test failed
The assertion error message
Any stack traces
Exit codes

Step 2: Determine if It's Flaky

Update test_input.json to target the failing test:

{
  "repo": "https://github.com/your-org/your-repo",
  "test_command": "pytest tests/test_config.py::TestConfig::test_default_config -v",
  "runs": 10,
  "parallelism": 3
}

Run the flaky test detector locally:

python3 local_test.py

Interpret the results:

Total runs:    10
Failures:      10
Passes:        0
Repro rate:    100.0%

🔴 CRITICAL: Very high failure rate (>90%) - likely a real bug!

Repro Rate	Meaning	Action
0-10%	Very flaky	Investigate timing/race conditions
10-30%	Moderately flaky	Check randomness, external deps
30-70%	Intermittent	Environmental issues likely
70-90%	Mostly failing	Real bug with some conditions
90-100%	Consistent bug	Direct code/test issue

Step 3: AI-Assisted Root Cause Analysis

Read the detailed results:

cat flaky_test_results.json | jq '.results[0]'

Create an analysis prompt:

I have a test failure with these details:

Test: tests/test_config.py::TestConfig::test_default_config
Error: AssertionError: assert 10 == 11
  where 10 = config.get("runs")

Repro rate: 100% (10/10 runs failed)

The test expects config.get("runs") to equal 11, but it returns 10.

Please analyze:
1. What is the root cause?
2. Is the test wrong or the code wrong?
3. What's the fix?

AI Analysis Response:

Root Cause: Test assertion bug

The default configuration in config.py:12 defines "runs": 10
The test incorrectly expects this to be 11

Type: Test bug (not a code bug)

Fix: Change line 17 in tests/test_config.py from:
  assert config.get("runs") == 11
To:
  assert config.get("runs") == 10

Step 4: Apply the Fix

# Edit the file
vim tests/test_config.py

# Verify the fix locally
PYTHONPATH=. pytest tests/test_config.py::TestConfig::test_default_config -v

# Run all checks
./scripts/run_all_checks.sh

Step 5: Commit and Verify

# Commit with detailed message
git add tests/test_config.py
git commit -m "Fix test assertion to match actual default config value

Root cause analysis (AI-assisted):
- Test expected config.get('runs') == 11
- Actual default value is 10 (defined in config.py:12)
- Fixed test to assert correct value

Verified with flaky test detector:
- 100% failure rate confirmed not a flaky test
- Consistent bug in test assertion

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>"

# Push and verify CI
git push

# Monitor CI
gh run watch

Step 6: Verify CI Success

# Check latest run
gh run list --limit 1

# View full results
gh run view

Expected output:

✓ Lint and Type Check    - 1m2s  ✅
✓ Test Suite             - 1m3s  ✅ (all tests passing)
✓ System Validation      - 48s   ✅

Common Scenarios

Scenario A: Intermittent Failure (Flaky Test)

Symptoms:

Repro rate: 15-40%
Passes sometimes, fails sometimes
No obvious pattern

Analysis:

# Run with more attempts
# Update test_input.json:
{
  "runs": 50,
  "parallelism": 10
}

# Check for patterns
cat flaky_test_results.json | jq '.results[] | select(.passed == false) | .attempt'

Common causes:

Timing issues: time.sleep() vs await conditions
Random data: Non-deterministic test data
External dependencies: API calls, databases
Race conditions: Parallel execution issues
Shared state: Tests affecting each other

Fixes:

# Bad: Time-based waiting
time.sleep(2)  # Hope it's done...

# Good: Condition-based waiting
for _ in range(50):
    if condition_met():
        break
    time.sleep(0.1)
else:
    raise TimeoutError("Condition not met")

# Bad: Random data without seed
import random
value = random.randint(1, 100)

# Good: Seeded random (use TEST_SEED env var)
import random
import os
seed = int(os.environ.get('TEST_SEED', '12345'))
random.seed(seed)
value = random.randint(1, 100)

Scenario B: Consistent Failure (Real Bug)

Symptoms:

Repro rate: 90-100%
Always fails
Clear error message

Analysis:

# Single run is enough
python3 local_test.py

# Check the error
cat flaky_test_results.json | jq '.results[0] | {exit_code, stdout, stderr}'

Common causes:

Assertion errors: Test expectations don't match reality
Import errors: Missing dependencies, wrong paths
Type errors: Wrong types passed to functions
Logic errors: Incorrect implementation

Fix approach:

Read the error message carefully
Check if test or code is wrong
Look at recent changes (git diff)
Fix the root cause
Verify locally before pushing

Scenario C: Environment-Specific Failure

Symptoms:

Fails in CI, passes locally (or vice versa)
Different behavior on different machines
Depends on Python version, OS, etc.

Analysis:

# Check environment differences
python --version
pip list

# Compare with CI environment
# (check .github/workflows/ci.yml for CI Python version)

Common causes:

Different Python versions: 3.11 vs 3.12 behavior
Missing environment variables: Needed in CI
File paths: Absolute vs relative paths
PYTHONPATH: Module import issues
OS differences: macOS vs Linux

Fixes:

# Bad: Hardcoded paths
config_path = "/Users/me/project/config.yml"

# Good: Relative paths
import os
config_path = os.path.join(os.path.dirname(__file__), "config.yml")

# Bad: Assuming env var exists
api_key = os.environ["API_KEY"]

# Good: Provide defaults or skip
api_key = os.environ.get("API_KEY")
if not api_key:
    pytest.skip("API_KEY not set")

AI-Assisted Analysis Tips

What to Include in Your Prompt

The error message:
```
AssertionError: assert 10 == 11
```

The test code:

def test_default_config(self):
    config = Config()
    assert config.get("runs") == 11

Relevant implementation:

DEFAULT_CONFIG = {
    "runs": 10,
    ...
}

Repro rate:
```
100% failure (10/10 runs)
```
Recent changes (if relevant):
```
git log --oneline -5
git diff HEAD~1
```

Sample Prompts

For assertion errors:

I have a test that expects X but gets Y.
Test code: [paste code]
Implementation: [paste code]
Error: [paste error]

Is the test wrong or the implementation wrong?
What should the expected value be and why?

For flaky tests:

This test passes 60% of the time and fails 40%.
Test code: [paste code]
Error when it fails: [paste error]

What could cause intermittent failures?
How can I make this test deterministic?

For import errors:

Test fails with ModuleNotFoundError: No module named 'X'
The module exists in the project.
Directory structure: [paste tree output]
Test file location: [path]

Why can't it find the module?
How should I fix the imports or PYTHONPATH?

Metrics and Benchmarks

Time to resolution (with this workflow):

Detection: < 2 minutes (CI)
Analysis: < 1 minute (flaky detector)
Root cause: < 1 minute (AI analysis)
Fix: 1-5 minutes (depending on complexity)
Verify: < 4 minutes (CI)

Total: 8-13 minutes from failure to verified fix

Traditional debugging:

Detection: < 2 minutes (CI)
Reproduce locally: 5-15 minutes
Investigation: 10-60 minutes
Fix: 5-30 minutes
Verify: < 4 minutes (CI)

Total: 22-111 minutes (median: ~45 minutes)

Improvement: 70-85% faster with automated analysis

Troubleshooting

Flaky Detector Won't Run

# Check if dependencies are installed
pip install -r requirements.txt

# Check if test_input.json is valid
cat test_input.json | jq .

# Run with verbose output
python3 local_test.py 2>&1 | tee debug.log

Results File Missing

# Check if it was created
ls -la flaky_test_results.json

# Check write permissions
touch flaky_test_results.json
rm flaky_test_results.json

AI Analysis Not Helpful

Improve your prompt:

Include more context (related code, error traces)
Ask specific questions
Provide recent git changes if relevant
Include environment details (Python version, OS)

Best Practices

Always run flaky detector first before assuming it's a real bug
Document the root cause in commit messages
Update tests when fixing to prevent regression
Use semantic commit messages for better git history
Verify locally before pushing with ./scripts/run_all_checks.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debugging Test Failures - Complete Workflow

Quick Start

Complete Workflow Example

Scenario: CI Test Failure

Step 1: Check the CI Logs

Step 2: Determine if It's Flaky

Step 3: AI-Assisted Root Cause Analysis

Step 4: Apply the Fix

Step 5: Commit and Verify

Step 6: Verify CI Success

Common Scenarios

Scenario A: Intermittent Failure (Flaky Test)

Scenario B: Consistent Failure (Real Bug)

Scenario C: Environment-Specific Failure

AI-Assisted Analysis Tips

What to Include in Your Prompt

Sample Prompts

Metrics and Benchmarks

Troubleshooting

Flaky Detector Won't Run

Results File Missing

AI Analysis Not Helpful

Best Practices

See Also

FilesExpand file tree

DEBUGGING_TEST_FAILURES.md

Latest commit

History

DEBUGGING_TEST_FAILURES.md

File metadata and controls

Debugging Test Failures - Complete Workflow

Quick Start

Complete Workflow Example

Scenario: CI Test Failure

Step 1: Check the CI Logs

Step 2: Determine if It's Flaky

Step 3: AI-Assisted Root Cause Analysis

Step 4: Apply the Fix

Step 5: Commit and Verify

Step 6: Verify CI Success

Common Scenarios

Scenario A: Intermittent Failure (Flaky Test)

Scenario B: Consistent Failure (Real Bug)

Scenario C: Environment-Specific Failure

AI-Assisted Analysis Tips

What to Include in Your Prompt

Sample Prompts

Metrics and Benchmarks

Troubleshooting

Flaky Detector Won't Run

Results File Missing

AI Analysis Not Helpful

Best Practices

See Also