CloudSentinel

AI-powered platform for real-time infrastructure monitoring, anomaly detection, and incident investigation. Ingests host and Kubernetes metrics via Kafka, runs ML-based anomaly detection, triggers AI agent investigations, and surfaces everything through a live dashboard.

Architecture

psutil (host metrics)          Kubernetes API
        ↓                             ↓
 metric_generator.py          k8s-collector/main.py
        ↓ metrics.raw                 ↓ k8s.pods / k8s.events / k8s.deployments / k8s.nodes
        └──────────────┬─────────────┘
                       ↓
              anomaly-service
          (Isolation Forest / Autoencoder + threshold rules)
                       ↓ anomalies.detected
              Backend (Spring Boot)
          - Persists anomalies, metrics, K8s state
          - Triggers AI investigator agent (Gemini)
          - REST API + WebSocket (STOMP) server
                       ↓ WebSocket /ws  (STOMP over native WS)
              Frontend (Next.js)
          - Live metric dashboard (pushed, no polling)
          - Anomaly feed, investigations, cluster view

Services

Service	Tech	Port	Purpose
`frontend`	Next.js 16 + TypeScript + Tailwind	3000	Dashboard UI
`backend`	Spring Boot 4 + Java 25	8080	REST API, auth, Kafka consumer/producer
`ai`	FastAPI + Python 3.12	8000	Agent service (Gemini-powered investigation agent)
`anomaly-service`	Python 3.12	—	ML anomaly detection (Kafka consumer/producer)
`k8s-collector`	Python 3.12	—	Kubernetes metrics collector (one-shot or scheduled)
`kafka`	Confluent Kafka 7.6	9092	Message bus
`postgres`	PostgreSQL 16	5432	Persistent storage (db: `astraquant-db`)
`redis`	Redis 7	6379	Caching

Kafka Topics

Topic	Producer	Consumer	Purpose
`metrics.raw`	metric_generator.py	anomaly-service	Host CPU/memory/disk every 5s
`anomalies.detected`	anomaly-service	Backend	ML anomaly events
`k8s.pods`	k8s-collector	Backend	Pod status snapshots
`k8s.deployments`	k8s-collector	Backend	Deployment replica counts
`k8s.events`	k8s-collector	Backend	Cluster warning events
`k8s.nodes`	k8s-collector	Backend	Node status
`incidents.created`	Backend	(future)	Incident notifications
`logs.raw`	(future)	(future)	Raw log events
`features.processed`	(future)	(future)	Engineered features

Topics are auto-created by the kafka-init container at startup (3 partitions, replication factor 1).

Getting Started

Prerequisites

Docker and Docker Compose
A .env file in the project root with:

DB_USERNAME=your_db_user
DB_PASSWORD=your_db_password
GEMINI_KEY=your_google_gemini_api_key
JWT_SECRET=your_jwt_secret

Run the full stack

docker compose up --build

The frontend will be available at http://localhost:3000, which redirects to the dashboard at /dashboard. The backend API is at http://localhost:8080.

Local development

Backend (requires a backend/.env with the same vars as above):

cd backend && ./mvnw spring-boot:run

AI agent service:

cd ai && source venv/bin/activate && uvicorn app.main:app --reload --port 8000

Anomaly service (connects to Kafka on localhost:9092 by default):

cd ai && source venv/bin/activate && python anomaly-service/main.py

Frontend:

cd frontend && npm run dev

Start the metric collector manually:

from app.services.metric_generator import run
run()

Build & Test

Backend:

cd backend
./mvnw clean package          # build JAR
./mvnw test                   # run all tests
./mvnw test -Dtest=ClassName  # run single test class
./mvnw verify                 # build + integration tests

Frontend:

cd frontend
npm run build   # production build
npm run lint    # ESLint

AI / anomaly service:

cd ai && source venv/bin/activate
python -m pytest tests/
python tests/python-test.py

Frontend Pages

Route	Description
`/`	Redirects to `/dashboard`
`/dashboard`	Live CPU, memory, and disk gauges — pushed via WebSocket, no polling
`/metrics-table`	Historical metrics table with color-coded usage values
`/anomalies`	Detected anomaly feed — history loaded once via REST, new arrivals pushed via WebSocket
`/investigations`	Investigation list — history loaded once via REST, new investigations pushed live
`/investigations/[id]`	Investigation detail: timeline, evidence, status controls — live agent progress via WebSocket
`/k8s/overview`	Cluster stat cards (nodes, running/failed pods, deployments)
`/k8s/pods`	Pod table with status badges and restart counts
`/k8s/deployments`	Deployment replica health
`/k8s/timeline`	Merged anomaly + cluster event feed — history via REST, new events pushed live
`/simulation-lab`	Run built-in failure scenarios (cpu-spike, memory-leak, crash-loop, bad-deployment)

Backend REST API

Method	Path	Description
GET	`/metrics/latest`	Most recent metric sample (from Redis)
GET	`/metrics/history`	All stored metric samples
GET	`/anomalies/`	All detected anomalies
GET	`/investigations`	All investigations
GET	`/investigations/{id}`	Investigation detail (timeline, evidence)
PATCH	`/investigations/{id}/status`	Update investigation status
PATCH	`/investigations/{id}/findings`	Update root cause, confidence, summary
GET	`/k8s/pods`	All pod records
GET	`/k8s/deployments`	All deployment records
GET	`/k8s/events`	All cluster events
GET	`/k8s/nodes`	All node records
GET	`/simulations/scenarios`	List available simulation scenarios
GET	`/simulations`	List active/past simulation runs
POST	`/simulations`	Start a simulation scenario
GET	`/simulations/{runId}`	Get simulation run status and events
DELETE	`/simulations/{runId}`	Stop a simulation run
GET	`/models`	List ML model versions

Auth endpoints (/auth/register, /auth/login) issue JWTs valid for 1 hour.

WebSocket API

Connect to ws://localhost:8080/ws using STOMP. No authentication required.

Topic	Payload type	Triggered by
`/topic/metrics`	`MetricSampleEntity`	Every `metrics.raw` Kafka message
`/topic/anomalies`	`AnomalyEntity`	Every `anomalies.detected` Kafka message
`/topic/events`	`ClusterEventEntity`	Every `k8s.events` Kafka message
`/topic/investigations`	`InvestigationEntity`	New investigation created or status/findings updated
`/topic/investigations/{id}`	`InvestigationDetailResponse`	Status or findings updated for that investigation

Anomaly Detection

The anomaly-service consumes metrics.raw and runs a two-stage detection pipeline:

Warm-up: threshold rules fire immediately (e.g. CPU > 90%, memory > 85%).
ML detection: once a feature window is populated, an Isolation Forest or Autoencoder model scores each sample. The model with the highest priority that is loaded is used.
Explainer: contributing factors (which metric deviated and by how much) are appended to each anomaly event before publishing to anomalies.detected.

Models are trained via anomaly-service/model/train.py and loaded at startup. The trained model file is volume-mounted at ./ai/anomaly-service/model in Docker.

When the backend receives an anomaly on anomalies.detected, it persists it and triggers the investigator_agent (Gemini-powered) to open an investigation, analyze root cause, and record evidence.

Key Technical Notes

Backend JPA: ddl-auto=create-drop in dev — all tables are dropped and recreated on each restart.
Lombok: MetricSampleEntity uses @Data + @NoArgsConstructor — the no-arg constructor is required by JPA.
Tailwind CSS v4: uses @import "tailwindcss" in globals.css, not a tailwind.config.js.
WebSocket: STOMP over native WebSocket; endpoint /ws; in-memory broker on /topic; WebSocketBroadcastService handles all broadcasts from Kafka consumers and InvestigationService.
K8s collector: deployed as the k8s-collector service in Docker Compose (network_mode: host to access ~/.kube/config); also runs embedded inside the ai service (same module, started on startup, disabled gracefully if no kubeconfig).

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.vscode		.vscode
ai		ai
backend		backend
frontend		frontend
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CloudSentinel

Architecture

Services

Kafka Topics

Getting Started

Prerequisites

Run the full stack

Local development

Build & Test

Frontend Pages

Backend REST API

WebSocket API

Anomaly Detection

Key Technical Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CloudSentinel

Architecture

Services

Kafka Topics

Getting Started

Prerequisites

Run the full stack

Local development

Build & Test

Frontend Pages

Backend REST API

WebSocket API

Anomaly Detection

Key Technical Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages