Version: 1.0.0
Date: 2025-01-07
Purpose: Automated deployment scripts for RETE2 staging and production rollout
This directory contains automation scripts for deploying RETE2 using the Strangler Fig pattern with phased production rollout.
Scripts:
staging-deploy.sh- Automated staging deploymentproduction-rollout.sh- Phased production rollout managervalidate-staging.sh- Staging validation suite
# Simple one-command deployment
./staging-deploy.sh
# Or with custom staging host
STAGING_HOST=my-staging.example.com ./staging-deploy.shWhat it does:
- β Checks prerequisites (Go, tests, tools)
- β Builds optimized binary
- β Creates shadow mode configuration (0% RETE2)
- β Deploys to staging server
- β Sets up systemd service
- β Validates deployment
Duration: ~10-15 minutes
# Run after 4-6 hours of staging runtime
./validate-staging.sh
# Or with custom endpoints
STAGING_HOST=my-staging.example.com \
PROMETHEUS_URL=http://prometheus:9090 \
./validate-staging.shWhat it does:
- β Runs 10 automated tests
- β Checks for divergences (must be 0)
- β Validates performance metrics
- β Checks circuit breaker, cache, memory
- β Provides GO/NO-GO recommendation
Duration: ~2-3 minutes
# Check current status
./production-rollout.sh --status
# Deploy specific phase
./production-rollout.sh --phase 0 # Shadow mode
./production-rollout.sh --phase 1 # 1% traffic
./production-rollout.sh --phase 2 # 10% traffic
./production-rollout.sh --phase 3 # 25% traffic
./production-rollout.sh --phase 4 # 50% traffic (CRITICAL)
./production-rollout.sh --phase 5 # 75% traffic
./production-rollout.sh --phase 6 # 100% traffic
# Emergency rollback
./production-rollout.sh --rollback 0
# Validate current phase
./production-rollout.sh --validatePhases:
- Phase 0: Shadow (0% RETE2, 100% comparison) - 48h soak
- Phase 1: Canary (1% traffic) - 6h soak
- Phase 2: Early (10% traffic) - 8h soak
- Phase 3: Quarter (25% traffic) - 12h soak
- Phase 4: Half (50% traffic) - 24h soak
β οΈ CRITICAL GATE - Phase 5: Majority (75% traffic) - 48h soak
- Phase 6: Full (100% traffic) - 7+ days soak
Duration: 2-4 weeks total (staging β production 100%)
- Staging:
../../docs/STAGING_DEPLOYMENT_PLAYBOOK.md(1,159 lines) - Production:
../../docs/PRODUCTION_ROLLOUT_PLAYBOOK.md(1,064 lines) - Checklist:
../../docs/DEPLOYMENT_CHECKLIST.md(764 lines)
../../DEPLOYMENT_READY.md- Start here for overview
- Migration Guide:
../../docs/MIGRATION_GUIDE.md - Troubleshooting:
../../docs/MIGRATION_TROUBLESHOOTING.md - Rollback Procedure:
../../docs/ROLLBACK_PROCEDURE.md - Configuration:
../../docs/MIGRATION_CONFIG.md
All scripts support configuration via environment variables:
# Staging deployment
STAGING_HOST=staging.example.com
STAGING_USER=deploy
STAGING_PORT=8080
# Production rollout
PROD_HOST=production.example.com
PROD_USER=deploy
PROMETHEUS_URL=http://prometheus:9090
# Validation
MIN_COMPARISONS=1000
MAX_DIVERGENCES=0If not specified, scripts use these defaults:
- Staging host:
staging.example.com - Production host:
production.example.com - User:
deploy - Prometheus:
http://prometheus:9090 - App port:
8080
Purpose: Fully automated staging deployment
Features:
- Prerequisites validation (Go, curl, jq, SSH, tests)
- Optimized binary build (CGO_ENABLED=0, Linux/amd64)
- Shadow mode configuration (0% RETE2)
- SSH deployment automation
- Systemd service setup
- Health check validation
- Color-coded output
Exit Codes:
0- Success1- Error (prerequisites, build, or deployment failed)
Logs: Output to stdout with colored status
Purpose: Phased production rollout with validation
Features:
- 6-phase rollout management (0% β 100%)
- Runtime configuration updates via API
- Automatic metrics validation (Prometheus)
- Soak period management with countdown
- Periodic validation during soak (every 30min)
- Rollback automation
- Status reporting
- Multiple operation modes
Modes:
--status # Show current rollout status
--phase N # Execute specific phase (0-6)
--auto # Auto-execute all phases (with confirmations)
--rollback [N] # Rollback to phase N (default: 0)
--validate # Validate current phase metrics
--help # Show usageValidation Checks:
- Error rate (threshold: <0.1%)
- Divergences (must be 0)
- Circuit breaker state (must be CLOSED)
- Fallback rate (threshold: <1%)
- Memory growth (phases 4+, threshold: <10%)
Exit Codes:
0- Success1- Validation failed or user abort
Purpose: Comprehensive staging validation
10 Automated Tests:
- Service Health - Health endpoint, metrics endpoint
- Prometheus Integration - Scraping, target UP, metrics collection
- Comparator Validation
β οΈ CRITICAL - Divergences must be 0 - Performance Metrics - Latency, error rate, throughput
- Circuit Breaker - State, failures, successes
- Cache Performance - Hit rate (>80%), eviction rate
- Memory Stability - Growth (<5%), GC frequency
- Fallback Behavior - Count, rate
- Grafana Dashboard - Availability check
- Log Analysis - Manual check prompts
Output:
========================================
Validation Summary
========================================
Passed: 8
Warnings: 2
Failed: 0
β
ALL CHECKS PASSED WITH WARNINGS
Review warnings before proceeding to production
Exit Codes:
0- All checks passed (warnings acceptable)1- One or more checks failed (BLOCKER)
# 1. Deploy staging
./staging-deploy.sh
# 2. Wait 4-6 hours for data accumulation
# Monitor: http://staging:3000/d/rete2-migration
# 3. Validate staging
./validate-staging.sh
# Expected output: β
ALL CHECKS PASSED
# Critical: 0 divergences RETE1 vs RETE2# Deploy to production in shadow mode
./production-rollout.sh --phase 0
# Wait 48+ hours, monitor divergences
# Must be 0 before proceeding# Execute phases sequentially with soak periods
./production-rollout.sh --phase 1 # Wait 6h
./production-rollout.sh --phase 2 # Wait 8h
./production-rollout.sh --phase 3 # Wait 12h
./production-rollout.sh --phase 4 # Wait 24h (CRITICAL GATE)
./production-rollout.sh --phase 5 # Wait 48h
./production-rollout.sh --phase 6 # Wait 7+ days# Check status anytime
./production-rollout.sh --status
# Validate current phase
./production-rollout.sh --validate# Instant rollback to shadow mode (0%)
./production-rollout.sh --rollback 0
# Rollback to previous phase
./production-rollout.sh --rollback 3 # Back to 25%- All local tests pass:
make test - No race conditions:
go test ./... -race - Staging infrastructure provisioned
- Prometheus + Grafana ready
- Staging validated with 0 divergences
- Baseline metrics collected
- Team trained on runbook
- Rollback drills completed (2+)
- On-call schedule established
- Stakeholders notified
GO (proceed):
- β 0 divergences RETE1 vs RETE2 (MANDATORY)
- β All metrics within thresholds
- β Service stable for soak period
- β Team approval (Tech Lead + Ops Lead)
NO-GO (stop/rollback):
- β Any divergences detected
- β Metrics exceed critical thresholds
- β Service crashes or instability
- β Team not ready
Cause: SSH access issue
Fix:
# Test SSH access
ssh $STAGING_USER@$STAGING_HOST "echo OK"
# If fails, check SSH keys or credentialsCause: Prometheus not accessible or not scraping
Fix:
# Check Prometheus health
curl http://prometheus:9090/-/healthy
# Check targets
curl http://prometheus:9090/api/v1/targetsCause: RETE1 and RETE2 producing different results
Fix:
-
Check divergence logs:
ssh staging "sudo cat /var/log/rete2/divergences.log" -
Analyze divergence types (error, value, count, factset)
-
Fix code issue in RETE2
-
Re-deploy staging and re-validate
DO NOT proceed to production with divergences
Cause: Performance degradation or errors
Fix:
# Check specific metrics
curl http://prometheus:9090/api/v1/query?query=rete2_errors_total
# Review Grafana dashboard
# Dashboard: http://grafana:3000/d/rete2-production
# If persistent, rollback:
./production-rollout.sh --rollback 0- 0 divergences (MANDATORY)
- Service stable 24+ hours
- All validation tests pass
- Rollback <60s tested
- Baseline metrics collected
- 100% traffic on RETE2
- 30+ days stable
- Error rate β€ RETE1 baseline
- Latency β€ RETE1 baseline
- 0 critical incidents
- RETE1 decommissioned
- Migration Guide Review (1h) - Read
docs/MIGRATION_GUIDE.md - Runbook Training (30min) - Read
docs/MIGRATION_TROUBLESHOOTING.md - Rollback Drills (1h) - Practice emergency rollback
- Dashboard Training (30min) - Grafana dashboard familiarization
Total: 3-4 hours per team member
- Documentation:
../../docs/ - Playbooks: See links above
- Runbook:
../../docs/MIGRATION_TROUBLESHOOTING.md - Emergency: See
DEPLOYMENT_CHECKLIST.mdfor contacts
- DO NOT hardcode credentials in scripts
- Use SSH keys for authentication
- Use environment variables for sensitive config
- Rotate credentials after deployment
- Limit SSH access to deployment users
- Use sudo only when necessary
- Audit all production changes
- Log all deployments
- Initial release
- Staging deployment automation
- Production phased rollout (6 phases)
- Automated validation suite
- Rollback automation
- Comprehensive documentation
- Follow existing script structure
- Use
set -euo pipefailfor error handling - Add colored output (RED/GREEN/YELLOW/BLUE)
- Include
--helpusage - Document in this README
- Make executable:
chmod +x script.sh
# Dry-run mode (if supported)
DRY_RUN=1 ./staging-deploy.sh
# Test in isolated environment first
# Never test directly in production- Project Status:
../../NEXT_STEPS.md - Remaining Work:
../../RETE2_FINAL_REMAINING_WORK.md - Session Notes:
../../SESSION_2025-01-07_DEPLOYMENT_AUTOMATION.md
Version: 1.0.0
Last Updated: 2025-01-07
Maintainer: RETE2 Migration Team
Ready to deploy? Start with:
./staging-deploy.shQuestions? See ../../DEPLOYMENT_READY.md