This guide shows you how to investigate and fix test failures using the flaky test detector and AI-assisted root cause analysis.
When CI fails:
# 1. Update test_input.json with the failing test
# 2. Run the flaky detector
python3 local_test.py
# 3. Check if it's flaky or a real bug
# 4. Use AI to analyze the root cause
# 5. Fix and verifyYou push code and CI fails with:
FAILED tests/test_config.py::TestConfig::test_default_config
AssertionError: assert 10 == 11
# View the failure details
gh run view --log-failed
# Or check specific run
gh run view <run-id> --log-failedWhat to look for:
- Which test failed
- The assertion error message
- Any stack traces
- Exit codes
Update test_input.json to target the failing test:
{
"repo": "https://github.com/your-org/your-repo",
"test_command": "pytest tests/test_config.py::TestConfig::test_default_config -v",
"runs": 10,
"parallelism": 3
}Run the flaky test detector locally:
python3 local_test.pyInterpret the results:
Total runs: 10
Failures: 10
Passes: 0
Repro rate: 100.0%
π΄ CRITICAL: Very high failure rate (>90%) - likely a real bug!
| Repro Rate | Meaning | Action |
|---|---|---|
| 0-10% | Very flaky | Investigate timing/race conditions |
| 10-30% | Moderately flaky | Check randomness, external deps |
| 30-70% | Intermittent | Environmental issues likely |
| 70-90% | Mostly failing | Real bug with some conditions |
| 90-100% | Consistent bug | Direct code/test issue |
Read the detailed results:
cat flaky_test_results.json | jq '.results[0]'Create an analysis prompt:
I have a test failure with these details:
Test: tests/test_config.py::TestConfig::test_default_config
Error: AssertionError: assert 10 == 11
where 10 = config.get("runs")
Repro rate: 100% (10/10 runs failed)
The test expects config.get("runs") to equal 11, but it returns 10.
Please analyze:
1. What is the root cause?
2. Is the test wrong or the code wrong?
3. What's the fix?
AI Analysis Response:
Root Cause: Test assertion bug
The default configuration in config.py:12 defines "runs": 10
The test incorrectly expects this to be 11
Type: Test bug (not a code bug)
Fix: Change line 17 in tests/test_config.py from:
assert config.get("runs") == 11
To:
assert config.get("runs") == 10
# Edit the file
vim tests/test_config.py
# Verify the fix locally
PYTHONPATH=. pytest tests/test_config.py::TestConfig::test_default_config -v
# Run all checks
./scripts/run_all_checks.sh# Commit with detailed message
git add tests/test_config.py
git commit -m "Fix test assertion to match actual default config value
Root cause analysis (AI-assisted):
- Test expected config.get('runs') == 11
- Actual default value is 10 (defined in config.py:12)
- Fixed test to assert correct value
Verified with flaky test detector:
- 100% failure rate confirmed not a flaky test
- Consistent bug in test assertion
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>"
# Push and verify CI
git push
# Monitor CI
gh run watch# Check latest run
gh run list --limit 1
# View full results
gh run viewExpected output:
β Lint and Type Check - 1m2s β
β Test Suite - 1m3s β
(all tests passing)
β System Validation - 48s β
Symptoms:
- Repro rate: 15-40%
- Passes sometimes, fails sometimes
- No obvious pattern
Analysis:
# Run with more attempts
# Update test_input.json:
{
"runs": 50,
"parallelism": 10
}
# Check for patterns
cat flaky_test_results.json | jq '.results[] | select(.passed == false) | .attempt'Common causes:
- Timing issues:
time.sleep()vsawaitconditions - Random data: Non-deterministic test data
- External dependencies: API calls, databases
- Race conditions: Parallel execution issues
- Shared state: Tests affecting each other
Fixes:
# Bad: Time-based waiting
time.sleep(2) # Hope it's done...
# Good: Condition-based waiting
for _ in range(50):
if condition_met():
break
time.sleep(0.1)
else:
raise TimeoutError("Condition not met")
# Bad: Random data without seed
import random
value = random.randint(1, 100)
# Good: Seeded random (use TEST_SEED env var)
import random
import os
seed = int(os.environ.get('TEST_SEED', '12345'))
random.seed(seed)
value = random.randint(1, 100)Symptoms:
- Repro rate: 90-100%
- Always fails
- Clear error message
Analysis:
# Single run is enough
python3 local_test.py
# Check the error
cat flaky_test_results.json | jq '.results[0] | {exit_code, stdout, stderr}'Common causes:
- Assertion errors: Test expectations don't match reality
- Import errors: Missing dependencies, wrong paths
- Type errors: Wrong types passed to functions
- Logic errors: Incorrect implementation
Fix approach:
- Read the error message carefully
- Check if test or code is wrong
- Look at recent changes (git diff)
- Fix the root cause
- Verify locally before pushing
Symptoms:
- Fails in CI, passes locally (or vice versa)
- Different behavior on different machines
- Depends on Python version, OS, etc.
Analysis:
# Check environment differences
python --version
pip list
# Compare with CI environment
# (check .github/workflows/ci.yml for CI Python version)Common causes:
- Different Python versions: 3.11 vs 3.12 behavior
- Missing environment variables: Needed in CI
- File paths: Absolute vs relative paths
- PYTHONPATH: Module import issues
- OS differences: macOS vs Linux
Fixes:
# Bad: Hardcoded paths
config_path = "/Users/me/project/config.yml"
# Good: Relative paths
import os
config_path = os.path.join(os.path.dirname(__file__), "config.yml")
# Bad: Assuming env var exists
api_key = os.environ["API_KEY"]
# Good: Provide defaults or skip
api_key = os.environ.get("API_KEY")
if not api_key:
pytest.skip("API_KEY not set")-
The error message:
AssertionError: assert 10 == 11 -
The test code:
def test_default_config(self): config = Config() assert config.get("runs") == 11
-
Relevant implementation:
DEFAULT_CONFIG = { "runs": 10, ... }
-
Repro rate:
100% failure (10/10 runs) -
Recent changes (if relevant):
git log --oneline -5 git diff HEAD~1
For assertion errors:
I have a test that expects X but gets Y.
Test code: [paste code]
Implementation: [paste code]
Error: [paste error]
Is the test wrong or the implementation wrong?
What should the expected value be and why?
For flaky tests:
This test passes 60% of the time and fails 40%.
Test code: [paste code]
Error when it fails: [paste error]
What could cause intermittent failures?
How can I make this test deterministic?
For import errors:
Test fails with ModuleNotFoundError: No module named 'X'
The module exists in the project.
Directory structure: [paste tree output]
Test file location: [path]
Why can't it find the module?
How should I fix the imports or PYTHONPATH?
Time to resolution (with this workflow):
- Detection: < 2 minutes (CI)
- Analysis: < 1 minute (flaky detector)
- Root cause: < 1 minute (AI analysis)
- Fix: 1-5 minutes (depending on complexity)
- Verify: < 4 minutes (CI)
Total: 8-13 minutes from failure to verified fix
Traditional debugging:
- Detection: < 2 minutes (CI)
- Reproduce locally: 5-15 minutes
- Investigation: 10-60 minutes
- Fix: 5-30 minutes
- Verify: < 4 minutes (CI)
Total: 22-111 minutes (median: ~45 minutes)
Improvement: 70-85% faster with automated analysis
# Check if dependencies are installed
pip install -r requirements.txt
# Check if test_input.json is valid
cat test_input.json | jq .
# Run with verbose output
python3 local_test.py 2>&1 | tee debug.log# Check if it was created
ls -la flaky_test_results.json
# Check write permissions
touch flaky_test_results.json
rm flaky_test_results.jsonImprove your prompt:
- Include more context (related code, error traces)
- Ask specific questions
- Provide recent git changes if relevant
- Include environment details (Python version, OS)
- Always run flaky detector first before assuming it's a real bug
- Document the root cause in commit messages
- Update tests when fixing to prevent regression
- Use semantic commit messages for better git history
- Verify locally before pushing with
./scripts/run_all_checks.sh
- PREVENTING_CI_FAILURES.md - How to avoid failures
- QUICK_REFERENCE.md - Command cheat sheet
- CICD_INTEGRATION.md - Automated workflow setup
- QUALITY_CHECKS.md - Development standards