Welcome to this Technical Case Study exploring High Availability (HA) deployments, Observability, and automated alerting. This project demonstrates a production-grade infrastructure setup utilizing Nginx as a Reverse Proxy, Docker Compose, and a custom Python Log-Watcher to ensure system reliability and seamless failovers.
In traditional software deployments, releasing a new version often requires stopping the old application before starting the new one. This leads to Downtime, which is unacceptable for mission-critical applications.
Blue/Green Deployment solves this by running two identical production environments (Pool Blue and Pool Green).
- Only one environment is live at any given time serving all production traffic.
- When deploying a new release, we deploy it to the inactive environment.
- Once tested and verified, we instantly flip the traffic switch at the load balancer (Nginx) level.
- Result: Zero-Downtime Deployments, instant rollbacks, and a safety net for unexpected crashes.
This repository simulates a highly available application with built-in observability. Here's how the different pieces connect:
graph TD
User([π End User]) -->|HTTP Requests| Nginx[π¦ Nginx Reverse Proxy]
subgraph "Application Pools (Docker)"
AppBlue[π¦ App Blue<br/>Active Pool]
AppGreen[π© App Green<br/>Standby Pool]
end
Nginx -->|Routes Traffic| AppBlue
Nginx -.->|Failover Traffic| AppGreen
Nginx -->|Writes Structured Logs| Logs[(π /logs/nginx/access.log)]
Logs -->|Tails in Real-Time| Watcher[π Python Log-Watcher]
Watcher -->|Anomaly Detected!| Slack[π¬ Slack Alerts]
- The Reverse Proxy (Nginx): Acts as the entry point, routing traffic to the currently active application pool. It generates structured logs containing crucial metrics like
upstream_status,request_time, andpool. - The Observability Layer (Python Watcher): A standalone Python daemon continuously tails the Nginx logs. It relies on a sliding window approach rather than absolute counts to dynamically detect anomalies.
- The Alerting Mechanism: If the watcher detects a traffic failover or a spike in 5xx errors (e.g., crossing a 2% threshold), it instantly fires a structured notification to a designated Slack channel, allowing on-call engineers to react swiftly.
Want to see it in action? Follow these steps, perfectly structured for beginners:
# Clone the repository
git clone https://github.com/moriadim/stage2-devops.git
cd stage2-devops
# Set up your environment variables
cp .env.example .envπ‘ Make sure to edit .env and add your SLACK_WEBHOOK_URL.
# We added a smart Makefile to make your life easier! Give it a try:
make upTime to test the Blue/Green failover mechanism by simulating an application crash.
# Start injecting errors to force a failover
make chaos# Watch the Nginx routing logs and the Python watcher simultaneously
make logsWatch your terminal for live logs and check your Slack for an alert: π¨ Failover Detected - pool changed from blue β green!
# Wipe out all containers, volumes, and logs to start fresh
make cleanBuilding this case study was a fantastic deep dive into modern SRE practices. Here is what I learned:
- The Art of Log Parsing (Regex): Parsing plain text logs efficiently using Regex in Python taught me how to extract structured, actionable data (
Key=Valuepairs) out of unstructured streams. - Balancing Alert Thresholds: Initially, raw error counts created a lot of "noise." By implementing a sliding window with a percentage threshold (e.g., 2% errors over the last 200 requests) and adding cooldown periods, I learned how to combat "Alert Fatigue" for on-call engineers.
- Graceful Degradation: Realizing that failures will happen, and designing a system (Blue/Green + Nginx upstream) that handles these failures seamlessly without dropping the user's connection.
This repository is designed to be a sandbox for learning High Availability and Monitoring. Feel free to break things, observe how the system reacts, and level up your SRE skills!