Event-driven platform for AI-assisted root cause analysis. Correlates production alerts using sliding time windows, detects deployment regressions, and generates structured RCA reports via a local LLM — with hallucination guardrails.
📊 Benchmark Results → — 8 scenarios · 0 hallucination flags · avg 7,894ms RCA latency · 100% guardrail pass rate
synthetic-generator (:8083)
│ X-Trace-Id header generated
▼
alerts.raw (Kafka)
│
▼
alert-ingestion-service (:8081)
· Prometheus / Alertmanager webhook receiver
· Redis SETNX deduplication (5 min TTL)
· Anti-corruption enricher
│
▼
alerts.raw (Kafka) ──── deployments.events (Kafka)
│ │
└───────────────────────────┘
│
▼
incident-correlation-engine (:8082)
· Topology-aware sliding window (Redis graph)
· Cross-service alert grouping via dependency graph
· Deployment correlation (10 min lookback)
· Causal chain builder + confidence scoring
· TraceId propagated via Kafka headers
│
▼
incidents.created (Kafka)
│
▼
ai-orchestrator (:8000)
· RAG: pgvector similarity search (BAAI/bge-small-en-v1.5)
· LangGraph 5-node pipeline with retry
· Qwen2.5 14B via Ollama (fully local)
· Hallucination guardrails (6 checks)
· Stores embeddings after each RCA
│
▼
rca.generated (Kafka)
| Service | Language | Port | Responsibility |
|---|---|---|---|
alert-ingestion-service |
Java / Spring Boot | 8081 | Receives Prometheus/Alertmanager webhooks, deduplicates via Redis, publishes to Kafka |
incident-correlation-engine |
Java / Spring Boot | 8082 | Topology-aware sliding window correlation, deployment detection, causal chain builder |
synthetic-incident-generator |
Java / Spring Boot | 8083 | Generates 8 realistic failure scenarios with distributed traceId |
ai-orchestrator |
Python / FastAPI | 8000 | RAG similarity search + LangGraph RCA pipeline + hallucination guardrails |
| Topic | Partitions | Description |
|---|---|---|
alerts.raw |
3 | Raw alerts from Prometheus or synthetic generator |
alerts.correlated |
3 | Deduplicated and enriched alerts |
deployments.events |
2 | Service deployment lifecycle events |
incidents.created |
3 | Correlated incidents from correlation engine |
incidents.enriched |
3 | Incidents enriched with deployment context |
rca.generated |
1 | Final AI-generated RCA reports (ordered) |
Distributed TraceId
Every alert gets a traceId generated at the synthetic generator and propagated via Kafka headers through the entire pipeline — alerts.raw → incidents.created → rca.generated. Each log line includes the traceId, enabling end-to-end request tracing across all services.
Topology-Aware Correlation
The correlation engine maintains a service dependency graph in Redis. When an alert fires on order-service, it checks if any open window contains a related service (payment-service, inventory-service). If so, the alert joins that window instead of creating a new one — enabling cross-service incident grouping beyond time-bucket windowing.
api-gateway → order-service → payment-service
→ inventory-service
→ notification-service
RAG Similarity Search
After each RCA is generated, the incident is embedded using BAAI/bge-small-en-v1.5 (384 dimensions, fully local) and stored in pgvector. On the next incident, the top-3 most similar past RCAs are retrieved and injected into the LLM prompt as few-shot examples — improving accuracy for recurring failure patterns.
Hallucination Guardrails Six validation checks run on every LLM output before publishing: unknown service names, summary length, empty causal chains, unknown services in remediation, overconfident claims without deployment evidence, and duplicate steps. 100% pass rate across all benchmark runs.
- Docker Desktop
- Java 21 (Eclipse Temurin)
- Python 3.11+
- Maven 3.9+
- Ollama with
qwen2.5:14bpulled
make upRun all
makecommands from the project root, not from insideinfra/.
Verify at http://localhost:8080 (Kafka UI) that all 6 topics exist.
Other useful commands:
make down # stop containers (keeps volumes)
make clean # stop + delete all volumes (full reset)
make logs # tail all container logs
make kafka-topics # list all Kafka topics
make psql # open Postgres shell
make redis-cli # open Redis CLIcd shared-schema
mvn clean install -DskipTestscd synthetic-incident-generator && mvn spring-boot:run
cd alert-ingestion-service && mvn spring-boot:run
cd incident-correlation-engine && mvn spring-boot:runcd ai-orchestrator
pip install -r requirements.txt
python main.py# Linux / macOS
curl -X POST http://localhost:8083/api/scenarios/trigger \
-H "Content-Type: application/json" \
-d '{"scenarioType":"DEPLOYMENT_REGRESSION","targetService":"payment-service","eventDelayMs":1000}'
# Windows PowerShell
Invoke-WebRequest -Uri "http://localhost:8083/api/scenarios/trigger" `
-Method POST -ContentType "application/json" `
-Body '{"scenarioType":"DEPLOYMENT_REGRESSION","targetService":"payment-service","eventDelayMs":1000}'- Kafka UI → http://localhost:8080 — watch messages flow through topics
- Correlation windows → http://localhost:8082/api/correlation/windows
- Grafana → http://localhost:3000 (admin/admin)
| Scenario | Chain |
|---|---|
DEPLOYMENT_REGRESSION |
Deploy → CPU spike → latency → downstream errors |
MEMORY_LEAK |
Memory climbs → GC pressure → OOM restart |
KAFKA_LAG_CASCADE |
Consumer lag → downstream starvation |
DB_CONNECTION_EXHAUSTION |
Pool exhausted → service errors |
THREAD_POOL_STARVATION |
Thread pool full → request queuing |
CASCADING_FAILURE |
A → B → C multi-hop degradation |
TRANSIENT_SPIKE |
Single alert fires and resolves (false positive test) |
TRAFFIC_SURGE |
External traffic spike, no deployment |
{
"rootCauseService": "payment-service",
"triggeringDeploymentVersion": "v3.9.7",
"confidenceScore": 0.80,
"modelUsed": "qwen2.5:14b",
"latencyMs": 18092,
"hallucinationFlags": [],
"rootCauseSummary": "Deployment of v3.9.7 on payment-service introduced a regression causing CPU saturation, elevated p99 latency, and cascading errors on downstream services.",
"causalChain": [
{ "type": "DEPLOYMENT", "service": "payment-service", "description": "v3.9.7 deployed (rolling)" },
{ "type": "ALERT", "service": "payment-service", "description": "HighCpuUsage fired (HIGH)" },
{ "type": "ALERT", "service": "payment-service", "description": "HighP99Latency fired (HIGH)" },
{ "type": "CASCADING_FAILURE", "service": "order-service", "description": "HighErrorRate fired (CRITICAL)" }
],
"remediationSteps": [
"Rollback payment-service to previous stable version",
"Investigate changes introduced in v3.9.7",
"Review DB connection pool and cache TTL config changes",
"Redeploy corrected version after validation"
]
}| Service | URL | Credentials |
|---|---|---|
| Kafka UI | http://localhost:8080 | — |
| Grafana | http://localhost:3000 | admin / admin |
| Prometheus | http://localhost:9090 | — |
| PostgreSQL | localhost:5432 | dia_user / dia_pass |
| Redis | localhost:6379 | — |
Why Redis for correlation windows? Redis TTL acts as the sliding window idle timeout for free. No scheduler needed to expire windows — TTL resets on every new alert. When TTL expires naturally, the flush scheduler picks it up.
Why topology-aware windowing over time-bucket windowing?
Time-bucket windowing groups alerts by when they fired. But a CPU spike on payment-service at T+0 and an error on order-service at T+3m might be in different time buckets yet causally related. Topology-aware windowing uses the service dependency graph to group alerts by reachability — not just timing.
Why Kafka headers for traceId? Headers are metadata — embedding traceId in the message body would pollute the schema. Headers let you add/remove tracing without breaking existing consumers. Every service reads the header, logs it, and re-publishes it downstream.
Why RAG over pure LLM generation? Without RAG, the LLM generates from scratch every time. With RAG, it sees confirmed past RCAs for similar incidents. For recurring patterns (same service, same failure type), the second RCA is substantially more specific than the first.
Why LangGraph over raw API calls? LangGraph gives explicit typed state and conditional edges. The retry on parse failure loops back to only the generation step — not re-running prompt preparation or RAG retrieval. Not possible with linear chain frameworks.
Why manual Kafka ack? At-least-once delivery semantics. Offset is only committed after the incident is successfully published downstream. If the service crashes mid-processing, the alert replays on restart.
Why separate Java and Python services? The Java services are pure event-driven pipeline — deterministic, fast, testable. The Python AI layer can be iterated independently without touching the correlation logic. Clear separation of concerns.
deterministic-incident-analyzer/
├── infra/ # Docker Compose, Kafka, pgvector Postgres, Redis, Prometheus, Alertmanager, Grafana
├── shared-schema/ # Java event contracts (AlertEvent, IncidentEvent, DeploymentEvent, TraceContext)
├── alert-ingestion-service/ # Spring Boot — Alertmanager webhook receiver + Redis dedup
├── incident-correlation-engine/ # Spring Boot — topology-aware sliding window + deployment correlation
│ └── topology/ # Service dependency graph (Redis-backed)
├── synthetic-incident-generator/ # Spring Boot — 8 scenario types with traceId generation
├── ai-orchestrator/ # Python FastAPI — RAG + LangGraph RCA pipeline
│ ├── agents/ # LangGraph 5-node RCA graph
│ ├── rag/ # pgvector embeddings + similarity search
│ ├── guardrails/ # Hallucination validation layer
│ └── messaging/ # Kafka consumer with traceId propagation
├── docs/ # Benchmark results
└── Makefile