Skip to content

Latest commit

Β 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

README.md

πŸš€ RETE2 Deployment Scripts

Version: 1.0.0
Date: 2025-01-07
Purpose: Automated deployment scripts for RETE2 staging and production rollout


πŸ“‹ Overview

This directory contains automation scripts for deploying RETE2 using the Strangler Fig pattern with phased production rollout.

Scripts:

  1. staging-deploy.sh - Automated staging deployment
  2. production-rollout.sh - Phased production rollout manager
  3. validate-staging.sh - Staging validation suite

πŸš€ Quick Start

1. Deploy to Staging

# Simple one-command deployment
./staging-deploy.sh

# Or with custom staging host
STAGING_HOST=my-staging.example.com ./staging-deploy.sh

What it does:

  • βœ… Checks prerequisites (Go, tests, tools)
  • βœ… Builds optimized binary
  • βœ… Creates shadow mode configuration (0% RETE2)
  • βœ… Deploys to staging server
  • βœ… Sets up systemd service
  • βœ… Validates deployment

Duration: ~10-15 minutes


2. Validate Staging

# Run after 4-6 hours of staging runtime
./validate-staging.sh

# Or with custom endpoints
STAGING_HOST=my-staging.example.com \
PROMETHEUS_URL=http://prometheus:9090 \
./validate-staging.sh

What it does:

  • βœ… Runs 10 automated tests
  • βœ… Checks for divergences (must be 0)
  • βœ… Validates performance metrics
  • βœ… Checks circuit breaker, cache, memory
  • βœ… Provides GO/NO-GO recommendation

Duration: ~2-3 minutes


3. Production Rollout

# Check current status
./production-rollout.sh --status

# Deploy specific phase
./production-rollout.sh --phase 0  # Shadow mode
./production-rollout.sh --phase 1  # 1% traffic
./production-rollout.sh --phase 2  # 10% traffic
./production-rollout.sh --phase 3  # 25% traffic
./production-rollout.sh --phase 4  # 50% traffic (CRITICAL)
./production-rollout.sh --phase 5  # 75% traffic
./production-rollout.sh --phase 6  # 100% traffic

# Emergency rollback
./production-rollout.sh --rollback 0

# Validate current phase
./production-rollout.sh --validate

Phases:

  • Phase 0: Shadow (0% RETE2, 100% comparison) - 48h soak
  • Phase 1: Canary (1% traffic) - 6h soak
  • Phase 2: Early (10% traffic) - 8h soak
  • Phase 3: Quarter (25% traffic) - 12h soak
  • Phase 4: Half (50% traffic) - 24h soak ⚠️ CRITICAL GATE
  • Phase 5: Majority (75% traffic) - 48h soak
  • Phase 6: Full (100% traffic) - 7+ days soak

Duration: 2-4 weeks total (staging β†’ production 100%)


πŸ“š Documentation

Playbooks (Step-by-Step Guides)

  • Staging: ../../docs/STAGING_DEPLOYMENT_PLAYBOOK.md (1,159 lines)
  • Production: ../../docs/PRODUCTION_ROLLOUT_PLAYBOOK.md (1,064 lines)
  • Checklist: ../../docs/DEPLOYMENT_CHECKLIST.md (764 lines)

Executive Summary

  • ../../DEPLOYMENT_READY.md - Start here for overview

Technical Documentation

  • Migration Guide: ../../docs/MIGRATION_GUIDE.md
  • Troubleshooting: ../../docs/MIGRATION_TROUBLESHOOTING.md
  • Rollback Procedure: ../../docs/ROLLBACK_PROCEDURE.md
  • Configuration: ../../docs/MIGRATION_CONFIG.md

πŸ”§ Configuration

Environment Variables

All scripts support configuration via environment variables:

# Staging deployment
STAGING_HOST=staging.example.com
STAGING_USER=deploy
STAGING_PORT=8080

# Production rollout
PROD_HOST=production.example.com
PROD_USER=deploy
PROMETHEUS_URL=http://prometheus:9090

# Validation
MIN_COMPARISONS=1000
MAX_DIVERGENCES=0

Default Values

If not specified, scripts use these defaults:

  • Staging host: staging.example.com
  • Production host: production.example.com
  • User: deploy
  • Prometheus: http://prometheus:9090
  • App port: 8080

βš™οΈ Script Details

staging-deploy.sh (304 lines)

Purpose: Fully automated staging deployment

Features:

  • Prerequisites validation (Go, curl, jq, SSH, tests)
  • Optimized binary build (CGO_ENABLED=0, Linux/amd64)
  • Shadow mode configuration (0% RETE2)
  • SSH deployment automation
  • Systemd service setup
  • Health check validation
  • Color-coded output

Exit Codes:

  • 0 - Success
  • 1 - Error (prerequisites, build, or deployment failed)

Logs: Output to stdout with colored status


production-rollout.sh (457 lines)

Purpose: Phased production rollout with validation

Features:

  • 6-phase rollout management (0% β†’ 100%)
  • Runtime configuration updates via API
  • Automatic metrics validation (Prometheus)
  • Soak period management with countdown
  • Periodic validation during soak (every 30min)
  • Rollback automation
  • Status reporting
  • Multiple operation modes

Modes:

--status       # Show current rollout status
--phase N      # Execute specific phase (0-6)
--auto         # Auto-execute all phases (with confirmations)
--rollback [N] # Rollback to phase N (default: 0)
--validate     # Validate current phase metrics
--help         # Show usage

Validation Checks:

  • Error rate (threshold: <0.1%)
  • Divergences (must be 0)
  • Circuit breaker state (must be CLOSED)
  • Fallback rate (threshold: <1%)
  • Memory growth (phases 4+, threshold: <10%)

Exit Codes:

  • 0 - Success
  • 1 - Validation failed or user abort

validate-staging.sh (465 lines)

Purpose: Comprehensive staging validation

10 Automated Tests:

  1. Service Health - Health endpoint, metrics endpoint
  2. Prometheus Integration - Scraping, target UP, metrics collection
  3. Comparator Validation ⚠️ CRITICAL - Divergences must be 0
  4. Performance Metrics - Latency, error rate, throughput
  5. Circuit Breaker - State, failures, successes
  6. Cache Performance - Hit rate (>80%), eviction rate
  7. Memory Stability - Growth (<5%), GC frequency
  8. Fallback Behavior - Count, rate
  9. Grafana Dashboard - Availability check
  10. Log Analysis - Manual check prompts

Output:

========================================
Validation Summary
========================================

Passed:   8
Warnings: 2
Failed:   0

βœ… ALL CHECKS PASSED WITH WARNINGS

Review warnings before proceeding to production

Exit Codes:

  • 0 - All checks passed (warnings acceptable)
  • 1 - One or more checks failed (BLOCKER)

🎯 Typical Workflow

Day 1: Staging Deployment

# 1. Deploy staging
./staging-deploy.sh

# 2. Wait 4-6 hours for data accumulation
# Monitor: http://staging:3000/d/rete2-migration

# 3. Validate staging
./validate-staging.sh

# Expected output: βœ… ALL CHECKS PASSED
# Critical: 0 divergences RETE1 vs RETE2

Day 2-3: Production Phase 0 (Shadow)

# Deploy to production in shadow mode
./production-rollout.sh --phase 0

# Wait 48+ hours, monitor divergences
# Must be 0 before proceeding

Week 2-3: Phased Rollout

# Execute phases sequentially with soak periods
./production-rollout.sh --phase 1  # Wait 6h
./production-rollout.sh --phase 2  # Wait 8h
./production-rollout.sh --phase 3  # Wait 12h
./production-rollout.sh --phase 4  # Wait 24h (CRITICAL GATE)
./production-rollout.sh --phase 5  # Wait 48h
./production-rollout.sh --phase 6  # Wait 7+ days

Monitoring During Rollout

# Check status anytime
./production-rollout.sh --status

# Validate current phase
./production-rollout.sh --validate

Emergency Rollback

# Instant rollback to shadow mode (0%)
./production-rollout.sh --rollback 0

# Rollback to previous phase
./production-rollout.sh --rollback 3  # Back to 25%

⚠️ Critical Requirements

Before Staging Deployment

  • All local tests pass: make test
  • No race conditions: go test ./... -race
  • Staging infrastructure provisioned
  • Prometheus + Grafana ready

Before Production Deployment

  • Staging validated with 0 divergences
  • Baseline metrics collected
  • Team trained on runbook
  • Rollback drills completed (2+)
  • On-call schedule established
  • Stakeholders notified

GO/NO-GO Criteria

GO (proceed):

  • βœ… 0 divergences RETE1 vs RETE2 (MANDATORY)
  • βœ… All metrics within thresholds
  • βœ… Service stable for soak period
  • βœ… Team approval (Tech Lead + Ops Lead)

NO-GO (stop/rollback):

  • ❌ Any divergences detected
  • ❌ Metrics exceed critical thresholds
  • ❌ Service crashes or instability
  • ❌ Team not ready

πŸ†˜ Troubleshooting

"Connection refused" during staging deploy

Cause: SSH access issue

Fix:

# Test SSH access
ssh $STAGING_USER@$STAGING_HOST "echo OK"

# If fails, check SSH keys or credentials

"Prometheus query failed" during validation

Cause: Prometheus not accessible or not scraping

Fix:

# Check Prometheus health
curl http://prometheus:9090/-/healthy

# Check targets
curl http://prometheus:9090/api/v1/targets

"Divergences detected" in validation

Cause: RETE1 and RETE2 producing different results

Fix: ⚠️ BLOCKER FOR PRODUCTION

  1. Check divergence logs:

    ssh staging "sudo cat /var/log/rete2/divergences.log"
  2. Analyze divergence types (error, value, count, factset)

  3. Fix code issue in RETE2

  4. Re-deploy staging and re-validate

DO NOT proceed to production with divergences


"Metrics validation failed" during rollout

Cause: Performance degradation or errors

Fix:

# Check specific metrics
curl http://prometheus:9090/api/v1/query?query=rete2_errors_total

# Review Grafana dashboard
# Dashboard: http://grafana:3000/d/rete2-production

# If persistent, rollback:
./production-rollout.sh --rollback 0

πŸ“Š Success Metrics

Staging Qualified βœ…

  • 0 divergences (MANDATORY)
  • Service stable 24+ hours
  • All validation tests pass
  • Rollback <60s tested
  • Baseline metrics collected

Production Migration Complete βœ…

  • 100% traffic on RETE2
  • 30+ days stable
  • Error rate ≀ RETE1 baseline
  • Latency ≀ RETE1 baseline
  • 0 critical incidents
  • RETE1 decommissioned

πŸŽ“ Training & Support

Required Training

  • Migration Guide Review (1h) - Read docs/MIGRATION_GUIDE.md
  • Runbook Training (30min) - Read docs/MIGRATION_TROUBLESHOOTING.md
  • Rollback Drills (1h) - Practice emergency rollback
  • Dashboard Training (30min) - Grafana dashboard familiarization

Total: 3-4 hours per team member

Support Channels

  • Documentation: ../../docs/
  • Playbooks: See links above
  • Runbook: ../../docs/MIGRATION_TROUBLESHOOTING.md
  • Emergency: See DEPLOYMENT_CHECKLIST.md for contacts

πŸ” Security Notes

Credentials

  • DO NOT hardcode credentials in scripts
  • Use SSH keys for authentication
  • Use environment variables for sensitive config
  • Rotate credentials after deployment

Access Control

  • Limit SSH access to deployment users
  • Use sudo only when necessary
  • Audit all production changes
  • Log all deployments

πŸ“ Change Log

v1.0.0 (2025-01-07)

  • Initial release
  • Staging deployment automation
  • Production phased rollout (6 phases)
  • Automated validation suite
  • Rollback automation
  • Comprehensive documentation

🀝 Contributing

Adding New Scripts

  1. Follow existing script structure
  2. Use set -euo pipefail for error handling
  3. Add colored output (RED/GREEN/YELLOW/BLUE)
  4. Include --help usage
  5. Document in this README
  6. Make executable: chmod +x script.sh

Testing Scripts

# Dry-run mode (if supported)
DRY_RUN=1 ./staging-deploy.sh

# Test in isolated environment first
# Never test directly in production

πŸ“š Additional Resources

  • Project Status: ../../NEXT_STEPS.md
  • Remaining Work: ../../RETE2_FINAL_REMAINING_WORK.md
  • Session Notes: ../../SESSION_2025-01-07_DEPLOYMENT_AUTOMATION.md

Version: 1.0.0
Last Updated: 2025-01-07
Maintainer: RETE2 Migration Team

Ready to deploy? Start with:

./staging-deploy.sh

Questions? See ../../DEPLOYMENT_READY.md