Automatically detect flaky tests by running them multiple times in parallel. Integrates seamlessly with GitHub Actions to provide immediate feedback on test failures.
# Run automated setup in your repository
bash <(curl -s https://raw.githubusercontent.com/runpod/testflake/main/setup.sh)See Getting Started Guide for detailed instructions.
- Getting Started - Quick setup guide (5 minutes)
- Quick Reference - Command cheat sheet
- Configuration Guide - Complete config reference
- All Documentation - Complete documentation index
Problem: Your CI tests fail randomly. Is it flaky or a real bug?
Solution: Automatically runs failing tests 20+ times to determine:
- π΄ 100% failure = Real bug, needs fixing
- π‘ 10-90% failure = Flaky test, needs stabilizing
- β 0% failure = One-time glitch, ignore
Result: Know immediately whether to fix the code or fix the test.
- π Auto-trigger: Runs automatically when PR tests fail
- β‘ Fast: Parallel execution with configurable workers
- π Multi-language: Python, Go, TypeScript, JavaScript
- π§ Easy setup: One script to install
- π Clear results: PR comments with severity and recommendations
- π‘οΈ Battle-tested: 96 tests, 91% coverage, production-ready
Catch issues before they reach CI with our multi-layer defense system:
# Run all CI checks locally in 30-60 seconds
./scripts/run_all_checks.sh
# Or validate the entire system end-to-end
python scripts/validate_flaky_detector.pyResults:
- β 90-95% reduction in CI debugging time
- β CI passes on first try >90% of the time
- β Faster feedback - 30-60s locally vs 3-5 min in CI
- β Prevents common bugs - Variable shadowing, type errors, shell quoting issues
- β System validation - End-to-end testing of the entire flaky detector
Four-layer defense:
- IDE/Editor - Real-time linting
- Pre-commit Hooks - Automatic checks on commit
- Local Test Script - Comprehensive verification before push
- CI Pipeline - Final safety net with system validation
Documentation:
- Preventing CI Failures β - Complete guide with examples
- Debugging Test Failures β - AI-assisted root cause analysis workflow
- Quick Reference β - Developer cheat sheet
- Quality Checks β - All tools and configurations
- Parallel Test Execution: Run tests multiple times concurrently to quickly identify flakiness
- Seed Randomization: Each test run uses a unique random seed to expose timing-dependent bugs
- Multi-Language Support: Python/pytest (built-in), Go, TypeScript/Jest, and more (see docs/MULTI_LANGUAGE.md)
- Automatic Dependency Installation: Installs requirements.txt automatically from cloned repositories
- π Auto-Trigger on Test Failures: Automatically runs when PR tests fail to immediately determine if it's flaky or a real bug
- CI/CD Integration: Deep integration with GitHub Actions with automatic PR comments and severity indicators
- Multi-Channel Reporting: Post results to PR comments with actionable recommendations
- Configuration File Support: Customize behavior per-repository with
.flaky-detector.yml - Historical Tracking: SQLite database tracks test results over time with trend analysis
- Interactive Dashboard: Streamlit-based dashboard for visualizing flakiness patterns
- Comprehensive Error Handling: Robust error handling for network issues, timeouts, and test failures
- Resource Cleanup: Automatic cleanup of temporary directories and working directory restoration
- Security Hardened: Protected against command injection with proper input validation
- Fully Tested: 40+ tests with 96% code coverage across all main modules
- Code Quality: Multi-layer defense system with ruff, pylint, mypy, bandit, actionlint
- CI/CD Quality Gates: Comprehensive automated checks with pre-commit hooks and local testing
- Workflow Validation: Catch GitHub Actions issues before CI, with optional AI suggestions
- CI Failure Prevention: 90-95% reduction in CI debugging time through early issue detection
Catch workflow errors before they reach CI with automated validation:
# Run all checks before pushing (comprehensive local testing)
./scripts/run_all_checks.sh
# Install pre-commit hooks (validates workflows automatically - no API key needed)
pip install pre-commit && pre-commit install
# Local validation (no API key needed)
python scripts/workflow_utils/validate_and_fix.py
# Optional: Get AI-powered fix suggestions (requires API key)
export ANTHROPIC_API_KEY="your-api-key"
python scripts/workflow_utils/validate_and_fix.py --ai-suggestFeatures:
- β Pre-commit hooks validate workflows before every commit (no setup required)
- π€ Optional AI suggestions using Claude API (requires
ANTHROPIC_API_KEY) - π¬ Validation results posted on PRs
- π Comprehensive validation reports in CI
- π‘οΈ Multi-layer defense: IDE β pre-commit β local script β CI
Note: Validation works fully without an API key. AI suggestions are an optional enhancement.
Documentation:
- Quality Checks β - All validation tools and setup
- Preventing CI Failures β - Best practices and common pitfalls
- Python 3.12 or higher
- Git installed on your system
- RunPod account (for deployment)
# Clone the repository
git clone https://github.com/runpod/testflake.git
cd testflake
# Install dependencies
pip install -r requirements.txt# Clone the repository
git clone https://github.com/runpod/testflake.git
cd testflake
# Install core dependencies
uv sync
# Install with dashboard support (optional)
uv sync --extra dashboard
# Install with development tools (optional)
uv sync --extra dev
# Install all extras
uv sync --all-extras# Core installation
pip install -e .
# With dashboard support
pip install -e ".[dashboard]"
# With development tools
pip install -e ".[dev]"
# With all optional dependencies
pip install -e ".[dashboard,dev]"Note on Dependencies: All package versions are pinned to specific releases (e.g., pytest==9.0.2) for reproducibility and stability. See requirements.txt for the complete list of pinned versions.
Optional Dependencies:
dashboard: Streamlit-based interactive dashboard (streamlit, plotly, pandas)dev: Development tools (ruff, mypy, pytest-cov)
Customize flaky test detector behavior per-repository with .flaky-detector.yml:
# Example configuration
runs: 150 # More thorough testing
parallelism: 15 # Faster execution
severity_thresholds:
medium: 0.05 # More sensitive to flakiness
ignore_patterns:
- "test_known_flaky_*" # Skip certain testsSee Configuration Guide for full reference.
Track test flakiness trends over time with the interactive dashboard:
streamlit run dashboard.py
# Opens at http://localhost:8501Dashboard features:
- π Overview metrics and statistics
- π Flakiness trend visualization over time
- π₯ Most flaky test commands
- π― Severity distribution charts
- π Filterable test run history
Test the flaky test detector with the included example:
# Run the example flaky test
pytest tests/test_flaky.py
# Run with a specific seed
TEST_SEED=12345 pytest tests/test_flaky.py
# Run multiple times to see flakiness
for i in {1..10}; do pytest tests/test_flaky.py; done
# Run all tests (40+ tests)
pytest tests/ -v
# Run with coverage report (only tested modules)
pytest tests/ --cov=worker --cov=config --cov=database --cov-report=term-missing
# Or use pytest built-in settings
pytest tests/ # Uses settings from pyproject.toml
# Run integration tests
python3 test_new_features.pyExplore complete flaky test examples for all supported languages:
# Python/pytest
cd examples/python
pip install -r requirements.txt
TEST_SEED=12345 pytest test_flaky.py -v
# Go
cd examples/go
GO_TEST_SEED=12345 go test -v
# TypeScript/Jest
cd examples/typescript-jest
npm install
JEST_SEED=12345 npm test
# TypeScript/Vitest
cd examples/typescript-vitest
npm install
VITE_TEST_SEED=12345 npm test
# JavaScript/Mocha
cd examples/javascript-mocha
npm install
MOCHA_SEED=12345 npm testEach example includes:
- β 6-12 realistic flaky test patterns
- β Seed configuration for reproducible randomness
- β Complete README with usage instructions
- β All necessary dependencies and configuration files
- β TEST_RESULTS.md with validation from 20-run analysis
Validation Results:
- Python: 26.7% average flakiness (most balanced)
- Go: 35.6% average flakiness (8 patterns tested)
- TypeScript/Jest: 44.0% average flakiness (10 patterns tested)
- TypeScript/Vitest: 50.5% average flakiness (partial reproducibility)
- JavaScript/Mocha: 43.8% average flakiness (12 patterns tested)
All examples have been validated with 20 test runs using different seeds, confirming reproducibility and realistic flaky behavior patterns.
See examples/README.md for detailed documentation.
You can test the worker function locally without deploying to RunPod:
# Start the worker (it will wait for jobs)
python worker.pyTo send a test job to the local worker, you'll need to use the RunPod SDK:
import runpod
# Configure for local testing
runpod.api_key = "your-api-key"
# Send a test job
result = runpod.run_sync(
endpoint_id="your-endpoint-id",
input={
"repo": "https://github.com/runpod/testflake",
"test_command": "pytest tests/test_flaky.py",
"runs": 50,
"parallelism": 5
}
)
print(result)This project includes comprehensive quality checks. See QUALITY_CHECKS.md for full details.
Run all checks locally:
# Lint code
ruff check .
# Auto-fix linting issues
ruff check . --fix
# Format code
ruff format .
# Type check
mypy worker.py config.py database.py
# Run tests with coverage (90% minimum, only tested modules)
pytest tests/ --cov=worker --cov=config --cov=database --cov-fail-under=90
# Run all checks at once
ruff check . && mypy worker.py config.py database.py && pytest tests/ --cov=worker --cov=config --cov=database --cov-fail-under=90Quality Standards:
- β Ruff linting (PEP 8, imports, bugbear, simplify)
- β Mypy type checking (strict mode)
- β 90% minimum test coverage (current: 96.7%)
- β Coverage measured on core modules only (worker, config, database)
- β Automated in CI/CD (see CI/CD Integration below)
Note: Coverage only measures the core modules we have tests for (worker.py, config.py, database.py), not UI code (dashboard.py) or integration scripts (scripts/).
The serverless function accepts the following input parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
repo |
string | Yes | - | Git repository URL (must start with https:// or git@) |
test_command |
string | Yes | - | Test command to execute (e.g., pytest tests/) |
runs |
integer | No | 10 | Number of times to run the test (1-1000) |
parallelism |
integer | No | 4 | Number of parallel workers (1-50) |
test_input.json - Simple test configuration:
{
"repo": "https://github.com/runpod/testflake",
"test_command": "pytest tests/test_flaky.py",
"runs": 50,
"parallelism": 5
}input.json - Production configuration:
{
"repo": "https://github.com/runpod/testflake",
"test_command": "pytest tests/test_flaky.py",
"runs": 100,
"parallelism": 8
}The flaky test detector includes two automated workflows:
Ensures code quality with automated checks:
Stage 1: Lint and Type Check
- β Ruff linting (code style, imports, common bugs)
- β Code formatting check
- β Mypy type checking (strict mode)
Stage 2: Test Suite (runs after lint passes)
- β Full test suite (40+ tests)
- β Coverage reporting (90% minimum required)
- β Coverage reports uploaded as artifacts
- β PR comments with coverage status
- β Change detection with commit tracking
- β Detailed summary with file changes and commit history
Workflow: .github/workflows/ci.yml
Change Detection Features:
- Automatically identifies code changes since last successful run
- Shows commit history with authors and messages
- Lists changed files by category (Python files, test files, workflow files)
- Highlights potentially breaking changes when tests fail
- Provides diff statistics in expandable sections
Automatically detects flaky tests when CI fails:
Setup Steps:
-
Add GitHub Secrets
Go to:
Settings β Secrets and variables β Actions β New repository secretAdd these secrets:
RUNPOD_API_KEY = <your RunPod API key> RUNPOD_ENDPOINT_ID = <your endpoint ID> SLACK_WEBHOOK_URL = <optional, for Slack notifications>Get your RunPod credentials from:
- API Key: https://www.runpod.io/console/user/settings
- Endpoint ID: Your RunPod serverless endpoint
-
Using GitHub CLI (alternative):
gh secret set RUNPOD_API_KEY --body "your-api-key" gh secret set RUNPOD_ENDPOINT_ID --body "your-endpoint-id" gh secret set SLACK_WEBHOOK_URL --body "your-slack-webhook" # optional
-
Verify Workflow Configuration
Edit
.github/workflows/flaky-test-detector.ymlline 5:workflows: ["CI"] # Match your CI workflow name
-
Test the Integration
Create a test branch with a failing test:
git checkout -b test-flaky-detection # Make a test fail temporarily git commit -am "Test flaky detector" git push -u origin test-flaky-detection
Create a PR β CI fails β Flaky detector runs automatically β Check PR comments
What Happens Automatically:
- CI tests fail
- Flaky detector workflow triggers
- Runs failed test 100x in parallel on RunPod
- Analyzes failure pattern
- Posts PR comment with severity:
- π΄ CRITICAL (>90%) - Real bug, not flaky
- π HIGH (50-90%) - Very unstable, fix before merge
- π‘ MEDIUM (10-50%) - Flaky test, should fix
- π’ LOW (1-10%) - Occasional flakiness
- β NONE (0%) - One-time issue, safe to merge
- Sends Slack notification (if configured)
- Uploads detailed results as artifacts
Workflow: .github/workflows/flaky-test-detector.yml
Cost: ~$0.024 per detection run (100 tests, 2 minutes)
To enable Slack notifications that automatically tag commit authors:
-
Get Slack Webhook URL
# Create incoming webhook in Slack: # Workspace Settings β Apps β Incoming Webhooks β Add to Slack gh secret set SLACK_WEBHOOK_URL --body "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
-
Find Slack User IDs
- Open Slack β Click on user's profile
- Click "β― More" β "Copy member ID"
- Example:
U01234ABCD
-
Create GitHub-to-Slack Mapping
# Create a JSON mapping of GitHub username β Slack user ID gh secret set GITHUB_SLACK_MAP --body '{ "octocat": "U01234ABCD", "github-username": "U56789EFGH", "another-user": "U01112IJKL" }'
Slack notification will include:
- Flakiness severity with color coding
- Test statistics (runs, failures, rate)
- Recent commits with author mentions
- Files changed (if available)
- Direct tags/mentions for commit authors
- Button to view in GitHub Actions
Example notification:
π‘ MEDIUM Flaky Test Detected
Repository: user/repo
Failure Rate: 35.0%
Total Runs: 100
Failed Runs: 35
Recent Commits (3):
β’ a1b2c3d Update worker.py validation - @john-slack
β’ e4f5g6h Fix timing issue - @jane-slack
β’ i7j8k9l Add error handling - @bob-slack
FYI: @john-slack, @jane-slack, @bob-slack
[View in GitHub Actions]
Workflow: .github/workflows/flaky-test-detector.yml
Cost: ~$0.024 per detection run (100 tests, 2 minutes)
The project includes two Dockerfile options:
Includes Python, Node.js, and Go runtimes for testing projects in multiple languages.
- Size: ~2.1 GB
- Supports: Python, Go, TypeScript/Jest, TypeScript/Vitest, JavaScript/Mocha
- Use when: You have polyglot projects or need to test multiple languages
# Build multi-language image
docker build -t your-username/flaky-test-detector:latest .
# Push to Docker Hub
docker push your-username/flaky-test-detector:latestSmaller image with only Python runtime for Python/pytest projects.
- Size: ~1.5 GB
- Supports: Python/pytest only
- Use when: You only need Python test support
- Note: Includes all dependencies (Streamlit, Plotly, etc.). For a minimal production image (~285MB), use requirements-minimal.txt (contains only runpod, pytest, PyYAML)
# Build Python-only image
docker build -f Dockerfile.python-only -t your-username/flaky-test-detector:python-only .
# Push to Docker Hub
docker push your-username/flaky-test-detector:python-onlyIncluded Dockerfile provides the multi-language setup:
FROM python:3.12-slim
# Install system dependencies
RUN apt-get update && apt-get install -y \
git curl wget ca-certificates gnupg \
&& rm -rf /var/lib/apt/lists/*
# Install Node.js 20.x for TypeScript/JavaScript
RUN curl -fsSL https://deb.nodesource.com/setup_20.x | bash - && \
apt-get install -y nodejs && \
rm -rf /var/lib/apt/lists/*
# Install Go 1.22
RUN wget -q https://go.dev/dl/go1.22.0.linux-amd64.tar.gz && \
tar -C /usr/local -xzf go1.22.0.linux-amd64.tar.gz && \
rm go1.22.0.linux-amd64.tar.gz
ENV PATH="/usr/local/go/bin:${PATH}"
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY worker.py run.sh .
RUN chmod +x run.sh
# Verify all runtimes
RUN python --version && node --version && go version
CMD ["./run.sh"]- Log in to RunPod
- Navigate to "Serverless" section
- Click "New Endpoint"
- Configure your endpoint:
- Name: Flaky Test Detector
- Docker Image:
your-username/flaky-test-detector:latest - Container Disk: 10 GB (adjust based on your needs)
- GPU Type: CPU or GPU based on your test requirements
- Click "Deploy"
After deployment, note your endpoint ID from the RunPod dashboard. You'll use this to send jobs.
π Configuration Guide: See TEST_INPUT_FILES.md for detailed information about configuring test runs, including local path support and best practices.
Using the RunPod Python SDK:
import runpod
runpod.api_key = "your-runpod-api-key"
# Run a flaky test detection job
job = runpod.Endpoint("your-endpoint-id").run(
{
"repo": "https://github.com/your-org/your-repo",
"test_command": "pytest tests/test_checkout.py::test_payment_processing",
"runs": 100,
"parallelism": 10
}
)
# Wait for results
result = job.output()
print(result)Using cURL:
curl -X POST https://api.runpod.ai/v2/your-endpoint-id/run \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-runpod-api-key" \
-d '{
"input": {
"repo": "https://github.com/your-org/your-repo",
"test_command": "pytest tests/test_checkout.py",
"runs": 100,
"parallelism": 10
}
}'The function returns a summary of test results:
{
"total_runs": 100,
"parallelism": 10,
"failures": 23,
"repro_rate": 0.23,
"results": [
{
"attempt": 0,
"exit_code": 0,
"stdout": "test output...",
"stderr": "",
"passed": true
},
{
"attempt": 1,
"exit_code": 1,
"stdout": "test output...",
"stderr": "assertion error...",
"passed": false
}
]
}Output Fields:
total_runs: Total number of test executionsparallelism: Number of parallel workers usedfailures: Number of failed test runsrepro_rate: Failure rate as a decimal (0.23 = 23% failure rate)results: Array of individual test run results, sorted by attempt number
# Test for race conditions in concurrent operations
runpod.Endpoint("your-endpoint-id").run({
"repo": "https://github.com/your-org/api-service",
"test_command": "pytest tests/test_concurrent_api.py -v",
"runs": 200,
"parallelism": 20
})# Run tests with different random seeds
runpod.Endpoint("your-endpoint-id").run({
"repo": "https://github.com/your-org/game-engine",
"test_command": "pytest tests/test_game_logic.py",
"runs": 500,
"parallelism": 25
})# Ensure tests are stable before merging
runpod.Endpoint("your-endpoint-id").run({
"repo": "https://github.com/your-org/web-app",
"test_command": "pytest tests/integration/",
"runs": 50,
"parallelism": 10
})Error: Failed to clone repository
Solutions:
- Verify the repository URL is correct and accessible
- For private repositories, ensure authentication is configured
- Check if the repository requires SSH keys or tokens
Error: Warning: Failed to install dependencies
Solutions:
- Check that
requirements.txtis valid - Verify all package names and versions are correct
- Ensure compatible Python version (3.12+)
Error: TIMEOUT
Solutions:
- Individual test runs have a 5-minute timeout
- Consider splitting long-running tests into smaller units
- Reduce the number of parallel workers if system resources are limited
Solutions:
- Reduce the
parallelismparameter - Increase the container memory allocation in RunPod settings
- Check for memory leaks in your test suite
Error: Invalid repository URL or ValueError
Solutions:
- Ensure repository URLs start with
https://orgit@ - Avoid special characters in test commands
- Use proper quoting for complex test commands
Error: ImportError while importing test module 'tests/test_config.py'
Problem: Tests can't find project modules (config, database, worker) because the project root isn't in Python's import path in GitHub Actions.
Solution: Add PYTHONPATH environment variable to your test job:
- name: Run tests
env:
PYTHONPATH: ${{ github.workspace }}
run: |
pytest tests/Why this happens:
- Locally: Current directory is automatically in
sys.path - GitHub Actions: Project root must be explicitly added to
PYTHONPATH - The fix ensures Python looks in the workspace root for imports
Alternative solution: Install as editable package:
- name: Install package
run: pip install -e .- Repository URLs are validated to prevent command injection
- Test commands are parsed with
shlex.split()for safe execution - Input parameters have strict bounds checking
- Temporary directories are automatically cleaned up
- Security scanning with Bandit in pre-commit hooks
π Complete Documentation Index β
- Getting Started Guide - β Quick setup (5 minutes)
- Quick Reference - β Command cheat sheet
- Configuration Guide - Complete config reference
- Migration Guide - Moving to a new repo/org
- RunPod Deployment - β Deploy to RunPod serverless
- CI/CD Integration - GitHub Actions setup
- Debugging Test Failures - Complete workflow
- Multi-Language Support - Go, TypeScript, JavaScript
- Preventing CI Failures - Multi-layer defense
- Architecture - System design & internals
- Quality Checks - Development standards
- Examples - Flaky test examples in 5 languages
For Developers:
# Before every push - runs all CI checks locally
./scripts/run_all_checks.sh
# Install pre-commit hooks for automatic validation
pre-commit install
# Read the prevention guide
cat docs/PREVENTING_CI_FAILURES.mdFor RunPod Deployment:
# Build and deploy to RunPod serverless
docker build -t your-username/testflake:latest .
docker push your-username/testflake:latest
# Read the deployment tutorial
cat docs/RUNPOD_TUTORIAL.mdFor CI/CD Integration:
# Set up GitHub secrets
gh secret set RUNPOD_API_KEY --body "your-key"
gh secret set RUNPOD_ENDPOINT_ID --body "your-id"
# Read the integration guide
cat docs/CICD_INTEGRATION.mdContributions are welcome! Please follow our multi-layer quality process:
- Fork the repository
- Create a feature branch
- Install development tools:
pip install -e ".[dev]" - Install pre-commit hooks:
pre-commit install
- Make your changes
- Add tests if applicable
- Run comprehensive checks:
./scripts/run_all_checks.sh - Pre-commit hooks will run automatically on
git commit
- Ensure all checks pass locally:
./scripts/run_all_checks.sh - Verify tests pass with coverage:
pytest --cov=. - Check code quality:
ruff check . # Linting ruff format . # Formatting pylint scripts/ tests/ # Deep analysis mypy scripts/ tests/ # Type checking
- Read PREVENTING_CI_FAILURES.md for best practices
- Submit a pull request
- β All tests must pass
- β Coverage must stay β₯90%
- β Ruff linting must pass
- β Pylint score β₯8.0/10
- β No mypy type errors
- β Workflow validation (if modifying .github/)
- β Security scan with bandit
Note: Our CI rarely fails because of the multi-layer defense system. If CI fails, it's usually caught by local checks first!
This project is provided as-is for detecting flaky tests in your codebase.
For issues or questions:
- Open an issue on GitHub
- Check the RunPod documentation
- Review the
docs/CLAUDE.mdfile for development guidance