deterministic-incident-analyzer

Event-driven platform for AI-assisted root cause analysis. Correlates production alerts using sliding time windows, detects deployment regressions, and generates structured RCA reports via a local LLM — with hallucination guardrails.

📊 Benchmark Results → — 8 scenarios · 0 hallucination flags · avg 7,894ms RCA latency · 100% guardrail pass rate

Architecture

synthetic-generator (:8083)
        │ X-Trace-Id header generated
        ▼
   alerts.raw (Kafka)
        │
        ▼
alert-ingestion-service (:8081)
  · Prometheus / Alertmanager webhook receiver
  · Redis SETNX deduplication (5 min TTL)
  · Anti-corruption enricher
        │
        ▼
   alerts.raw (Kafka) ──── deployments.events (Kafka)
        │                           │
        └───────────────────────────┘
                        │
                        ▼
   incident-correlation-engine (:8082)
     · Topology-aware sliding window (Redis graph)
     · Cross-service alert grouping via dependency graph
     · Deployment correlation (10 min lookback)
     · Causal chain builder + confidence scoring
     · TraceId propagated via Kafka headers
                        │
                        ▼
              incidents.created (Kafka)
                        │
                        ▼
          ai-orchestrator (:8000)
            · RAG: pgvector similarity search (BAAI/bge-small-en-v1.5)
            · LangGraph 5-node pipeline with retry
            · Qwen2.5 14B via Ollama (fully local)
            · Hallucination guardrails (6 checks)
            · Stores embeddings after each RCA
                        │
                        ▼
             rca.generated (Kafka)

Services

Service	Language	Port	Responsibility
`alert-ingestion-service`	Java / Spring Boot	8081	Receives Prometheus/Alertmanager webhooks, deduplicates via Redis, publishes to Kafka
`incident-correlation-engine`	Java / Spring Boot	8082	Topology-aware sliding window correlation, deployment detection, causal chain builder
`synthetic-incident-generator`	Java / Spring Boot	8083	Generates 8 realistic failure scenarios with distributed traceId
`ai-orchestrator`	Python / FastAPI	8000	RAG similarity search + LangGraph RCA pipeline + hallucination guardrails

Kafka Topics

Topic	Partitions	Description
`alerts.raw`	3	Raw alerts from Prometheus or synthetic generator
`alerts.correlated`	3	Deduplicated and enriched alerts
`deployments.events`	2	Service deployment lifecycle events
`incidents.created`	3	Correlated incidents from correlation engine
`incidents.enriched`	3	Incidents enriched with deployment context
`rca.generated`	1	Final AI-generated RCA reports (ordered)

Key Design Features

Distributed TraceId Every alert gets a traceId generated at the synthetic generator and propagated via Kafka headers through the entire pipeline — alerts.raw → incidents.created → rca.generated. Each log line includes the traceId, enabling end-to-end request tracing across all services.

Topology-Aware Correlation The correlation engine maintains a service dependency graph in Redis. When an alert fires on order-service, it checks if any open window contains a related service (payment-service, inventory-service). If so, the alert joins that window instead of creating a new one — enabling cross-service incident grouping beyond time-bucket windowing.

api-gateway → order-service → payment-service
                            → inventory-service
           → notification-service

RAG Similarity Search After each RCA is generated, the incident is embedded using BAAI/bge-small-en-v1.5 (384 dimensions, fully local) and stored in pgvector. On the next incident, the top-3 most similar past RCAs are retrieved and injected into the LLM prompt as few-shot examples — improving accuracy for recurring failure patterns.

Hallucination Guardrails Six validation checks run on every LLM output before publishing: unknown service names, summary length, empty causal chains, unknown services in remediation, overconfident claims without deployment evidence, and duplicate steps. 100% pass rate across all benchmark runs.

Quick Start

Prerequisites

Docker Desktop
Java 21 (Eclipse Temurin)
Python 3.11+
Maven 3.9+
Ollama with qwen2.5:14b pulled

1. Start infra

make up

Run all make commands from the project root, not from inside infra/.

Verify at http://localhost:8080 (Kafka UI) that all 6 topics exist.

Other useful commands:

make down          # stop containers (keeps volumes)
make clean         # stop + delete all volumes (full reset)
make logs          # tail all container logs
make kafka-topics  # list all Kafka topics
make psql          # open Postgres shell
make redis-cli     # open Redis CLI

2. Install shared schema

cd shared-schema
mvn clean install -DskipTests

3. Start Java services (separate terminals)

cd synthetic-incident-generator && mvn spring-boot:run
cd alert-ingestion-service      && mvn spring-boot:run
cd incident-correlation-engine  && mvn spring-boot:run

4. Start AI orchestrator

cd ai-orchestrator
pip install -r requirements.txt
python main.py

5. Trigger a scenario

# Linux / macOS
curl -X POST http://localhost:8083/api/scenarios/trigger \
  -H "Content-Type: application/json" \
  -d '{"scenarioType":"DEPLOYMENT_REGRESSION","targetService":"payment-service","eventDelayMs":1000}'

# Windows PowerShell
Invoke-WebRequest -Uri "http://localhost:8083/api/scenarios/trigger" `
  -Method POST -ContentType "application/json" `
  -Body '{"scenarioType":"DEPLOYMENT_REGRESSION","targetService":"payment-service","eventDelayMs":1000}'

6. Watch the pipeline

Kafka UI → http://localhost:8080 — watch messages flow through topics
Correlation windows → http://localhost:8082/api/correlation/windows
Grafana → http://localhost:3000 (admin/admin)

Failure Scenarios

Scenario	Chain
`DEPLOYMENT_REGRESSION`	Deploy → CPU spike → latency → downstream errors
`MEMORY_LEAK`	Memory climbs → GC pressure → OOM restart
`KAFKA_LAG_CASCADE`	Consumer lag → downstream starvation
`DB_CONNECTION_EXHAUSTION`	Pool exhausted → service errors
`THREAD_POOL_STARVATION`	Thread pool full → request queuing
`CASCADING_FAILURE`	A → B → C multi-hop degradation
`TRANSIENT_SPIKE`	Single alert fires and resolves (false positive test)
`TRAFFIC_SURGE`	External traffic spike, no deployment

Sample RCA Output

{
  "rootCauseService": "payment-service",
  "triggeringDeploymentVersion": "v3.9.7",
  "confidenceScore": 0.80,
  "modelUsed": "qwen2.5:14b",
  "latencyMs": 18092,
  "hallucinationFlags": [],
  "rootCauseSummary": "Deployment of v3.9.7 on payment-service introduced a regression causing CPU saturation, elevated p99 latency, and cascading errors on downstream services.",
  "causalChain": [
    { "type": "DEPLOYMENT",        "service": "payment-service", "description": "v3.9.7 deployed (rolling)" },
    { "type": "ALERT",             "service": "payment-service", "description": "HighCpuUsage fired (HIGH)" },
    { "type": "ALERT",             "service": "payment-service", "description": "HighP99Latency fired (HIGH)" },
    { "type": "CASCADING_FAILURE", "service": "order-service",   "description": "HighErrorRate fired (CRITICAL)" }
  ],
  "remediationSteps": [
    "Rollback payment-service to previous stable version",
    "Investigate changes introduced in v3.9.7",
    "Review DB connection pool and cache TTL config changes",
    "Redeploy corrected version after validation"
  ]
}

Infrastructure

Service	URL	Credentials
Kafka UI	http://localhost:8080	—
Grafana	http://localhost:3000	admin / admin
Prometheus	http://localhost:9090	—
PostgreSQL	localhost:5432	dia_user / dia_pass
Redis	localhost:6379	—

Key Design Decisions

Why Redis for correlation windows? Redis TTL acts as the sliding window idle timeout for free. No scheduler needed to expire windows — TTL resets on every new alert. When TTL expires naturally, the flush scheduler picks it up.

Why topology-aware windowing over time-bucket windowing? Time-bucket windowing groups alerts by when they fired. But a CPU spike on payment-service at T+0 and an error on order-service at T+3m might be in different time buckets yet causally related. Topology-aware windowing uses the service dependency graph to group alerts by reachability — not just timing.

Why Kafka headers for traceId? Headers are metadata — embedding traceId in the message body would pollute the schema. Headers let you add/remove tracing without breaking existing consumers. Every service reads the header, logs it, and re-publishes it downstream.

Why RAG over pure LLM generation? Without RAG, the LLM generates from scratch every time. With RAG, it sees confirmed past RCAs for similar incidents. For recurring patterns (same service, same failure type), the second RCA is substantially more specific than the first.

Why LangGraph over raw API calls? LangGraph gives explicit typed state and conditional edges. The retry on parse failure loops back to only the generation step — not re-running prompt preparation or RAG retrieval. Not possible with linear chain frameworks.

Why manual Kafka ack? At-least-once delivery semantics. Offset is only committed after the incident is successfully published downstream. If the service crashes mid-processing, the alert replays on restart.

Why separate Java and Python services? The Java services are pure event-driven pipeline — deterministic, fast, testable. The Python AI layer can be iterated independently without touching the correlation logic. Clear separation of concerns.

Project Structure

deterministic-incident-analyzer/
├── infra/                          # Docker Compose, Kafka, pgvector Postgres, Redis, Prometheus, Alertmanager, Grafana
├── shared-schema/                  # Java event contracts (AlertEvent, IncidentEvent, DeploymentEvent, TraceContext)
├── alert-ingestion-service/        # Spring Boot — Alertmanager webhook receiver + Redis dedup
├── incident-correlation-engine/    # Spring Boot — topology-aware sliding window + deployment correlation
│   └── topology/                   # Service dependency graph (Redis-backed)
├── synthetic-incident-generator/   # Spring Boot — 8 scenario types with traceId generation
├── ai-orchestrator/                # Python FastAPI — RAG + LangGraph RCA pipeline
│   ├── agents/                     # LangGraph 5-node RCA graph
│   ├── rag/                        # pgvector embeddings + similarity search
│   ├── guardrails/                 # Hallucination validation layer
│   └── messaging/                  # Kafka consumer with traceId propagation
├── docs/                           # Benchmark results
└── Makefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

deterministic-incident-analyzer

Architecture

Services

Kafka Topics

Key Design Features

Quick Start

Prerequisites

1. Start infra

2. Install shared schema

3. Start Java services (separate terminals)

4. Start AI orchestrator

5. Trigger a scenario

6. Watch the pipeline

Failure Scenarios

Sample RCA Output

Infrastructure

Key Design Decisions

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
ai-orchestrator		ai-orchestrator
alert-ingestion-service		alert-ingestion-service
docs		docs
incident-correlation-engine		incident-correlation-engine
infra		infra
shared-schema		shared-schema
synthetic-incident-generator		synthetic-incident-generator
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
deterministic_incident_analyzer_structure.svg		deterministic_incident_analyzer_structure.svg

Folders and files

Latest commit

History

Repository files navigation

deterministic-incident-analyzer

Architecture

Services

Kafka Topics

Key Design Features

Quick Start

Prerequisites

1. Start infra

2. Install shared schema

3. Start Java services (separate terminals)

4. Start AI orchestrator

5. Trigger a scenario

6. Watch the pipeline

Failure Scenarios

Sample RCA Output

Infrastructure

Key Design Decisions

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages