Skip to content

C0ffeeMachine/Deterministic-Incident-Analyzer

Repository files navigation

deterministic-incident-analyzer

Event-driven platform for AI-assisted root cause analysis. Correlates production alerts using sliding time windows, detects deployment regressions, and generates structured RCA reports via a local LLM — with hallucination guardrails.

Java Spring Boot Kafka Python LangGraph Ollama pgvector Redis

📊 Benchmark Results → — 8 scenarios · 0 hallucination flags · avg 7,894ms RCA latency · 100% guardrail pass rate


Architecture

synthetic-generator (:8083)
        │ X-Trace-Id header generated
        ▼
   alerts.raw (Kafka)
        │
        ▼
alert-ingestion-service (:8081)
  · Prometheus / Alertmanager webhook receiver
  · Redis SETNX deduplication (5 min TTL)
  · Anti-corruption enricher
        │
        ▼
   alerts.raw (Kafka) ──── deployments.events (Kafka)
        │                           │
        └───────────────────────────┘
                        │
                        ▼
   incident-correlation-engine (:8082)
     · Topology-aware sliding window (Redis graph)
     · Cross-service alert grouping via dependency graph
     · Deployment correlation (10 min lookback)
     · Causal chain builder + confidence scoring
     · TraceId propagated via Kafka headers
                        │
                        ▼
              incidents.created (Kafka)
                        │
                        ▼
          ai-orchestrator (:8000)
            · RAG: pgvector similarity search (BAAI/bge-small-en-v1.5)
            · LangGraph 5-node pipeline with retry
            · Qwen2.5 14B via Ollama (fully local)
            · Hallucination guardrails (6 checks)
            · Stores embeddings after each RCA
                        │
                        ▼
             rca.generated (Kafka)

Services

Service Language Port Responsibility
alert-ingestion-service Java / Spring Boot 8081 Receives Prometheus/Alertmanager webhooks, deduplicates via Redis, publishes to Kafka
incident-correlation-engine Java / Spring Boot 8082 Topology-aware sliding window correlation, deployment detection, causal chain builder
synthetic-incident-generator Java / Spring Boot 8083 Generates 8 realistic failure scenarios with distributed traceId
ai-orchestrator Python / FastAPI 8000 RAG similarity search + LangGraph RCA pipeline + hallucination guardrails

Kafka Topics

Topic Partitions Description
alerts.raw 3 Raw alerts from Prometheus or synthetic generator
alerts.correlated 3 Deduplicated and enriched alerts
deployments.events 2 Service deployment lifecycle events
incidents.created 3 Correlated incidents from correlation engine
incidents.enriched 3 Incidents enriched with deployment context
rca.generated 1 Final AI-generated RCA reports (ordered)

Key Design Features

Distributed TraceId Every alert gets a traceId generated at the synthetic generator and propagated via Kafka headers through the entire pipeline — alerts.rawincidents.createdrca.generated. Each log line includes the traceId, enabling end-to-end request tracing across all services.

Topology-Aware Correlation The correlation engine maintains a service dependency graph in Redis. When an alert fires on order-service, it checks if any open window contains a related service (payment-service, inventory-service). If so, the alert joins that window instead of creating a new one — enabling cross-service incident grouping beyond time-bucket windowing.

api-gateway → order-service → payment-service
                            → inventory-service
           → notification-service

RAG Similarity Search After each RCA is generated, the incident is embedded using BAAI/bge-small-en-v1.5 (384 dimensions, fully local) and stored in pgvector. On the next incident, the top-3 most similar past RCAs are retrieved and injected into the LLM prompt as few-shot examples — improving accuracy for recurring failure patterns.

Hallucination Guardrails Six validation checks run on every LLM output before publishing: unknown service names, summary length, empty causal chains, unknown services in remediation, overconfident claims without deployment evidence, and duplicate steps. 100% pass rate across all benchmark runs.


Quick Start

Prerequisites

  • Docker Desktop
  • Java 21 (Eclipse Temurin)
  • Python 3.11+
  • Maven 3.9+
  • Ollama with qwen2.5:14b pulled

1. Start infra

make up

Run all make commands from the project root, not from inside infra/.

Verify at http://localhost:8080 (Kafka UI) that all 6 topics exist.

Other useful commands:

make down          # stop containers (keeps volumes)
make clean         # stop + delete all volumes (full reset)
make logs          # tail all container logs
make kafka-topics  # list all Kafka topics
make psql          # open Postgres shell
make redis-cli     # open Redis CLI

2. Install shared schema

cd shared-schema
mvn clean install -DskipTests

3. Start Java services (separate terminals)

cd synthetic-incident-generator && mvn spring-boot:run
cd alert-ingestion-service      && mvn spring-boot:run
cd incident-correlation-engine  && mvn spring-boot:run

4. Start AI orchestrator

cd ai-orchestrator
pip install -r requirements.txt
python main.py

5. Trigger a scenario

# Linux / macOS
curl -X POST http://localhost:8083/api/scenarios/trigger \
  -H "Content-Type: application/json" \
  -d '{"scenarioType":"DEPLOYMENT_REGRESSION","targetService":"payment-service","eventDelayMs":1000}'

# Windows PowerShell
Invoke-WebRequest -Uri "http://localhost:8083/api/scenarios/trigger" `
  -Method POST -ContentType "application/json" `
  -Body '{"scenarioType":"DEPLOYMENT_REGRESSION","targetService":"payment-service","eventDelayMs":1000}'

6. Watch the pipeline


Failure Scenarios

Scenario Chain
DEPLOYMENT_REGRESSION Deploy → CPU spike → latency → downstream errors
MEMORY_LEAK Memory climbs → GC pressure → OOM restart
KAFKA_LAG_CASCADE Consumer lag → downstream starvation
DB_CONNECTION_EXHAUSTION Pool exhausted → service errors
THREAD_POOL_STARVATION Thread pool full → request queuing
CASCADING_FAILURE A → B → C multi-hop degradation
TRANSIENT_SPIKE Single alert fires and resolves (false positive test)
TRAFFIC_SURGE External traffic spike, no deployment

Sample RCA Output

{
  "rootCauseService": "payment-service",
  "triggeringDeploymentVersion": "v3.9.7",
  "confidenceScore": 0.80,
  "modelUsed": "qwen2.5:14b",
  "latencyMs": 18092,
  "hallucinationFlags": [],
  "rootCauseSummary": "Deployment of v3.9.7 on payment-service introduced a regression causing CPU saturation, elevated p99 latency, and cascading errors on downstream services.",
  "causalChain": [
    { "type": "DEPLOYMENT",        "service": "payment-service", "description": "v3.9.7 deployed (rolling)" },
    { "type": "ALERT",             "service": "payment-service", "description": "HighCpuUsage fired (HIGH)" },
    { "type": "ALERT",             "service": "payment-service", "description": "HighP99Latency fired (HIGH)" },
    { "type": "CASCADING_FAILURE", "service": "order-service",   "description": "HighErrorRate fired (CRITICAL)" }
  ],
  "remediationSteps": [
    "Rollback payment-service to previous stable version",
    "Investigate changes introduced in v3.9.7",
    "Review DB connection pool and cache TTL config changes",
    "Redeploy corrected version after validation"
  ]
}

Infrastructure

Service URL Credentials
Kafka UI http://localhost:8080
Grafana http://localhost:3000 admin / admin
Prometheus http://localhost:9090
PostgreSQL localhost:5432 dia_user / dia_pass
Redis localhost:6379

Key Design Decisions

Why Redis for correlation windows? Redis TTL acts as the sliding window idle timeout for free. No scheduler needed to expire windows — TTL resets on every new alert. When TTL expires naturally, the flush scheduler picks it up.

Why topology-aware windowing over time-bucket windowing? Time-bucket windowing groups alerts by when they fired. But a CPU spike on payment-service at T+0 and an error on order-service at T+3m might be in different time buckets yet causally related. Topology-aware windowing uses the service dependency graph to group alerts by reachability — not just timing.

Why Kafka headers for traceId? Headers are metadata — embedding traceId in the message body would pollute the schema. Headers let you add/remove tracing without breaking existing consumers. Every service reads the header, logs it, and re-publishes it downstream.

Why RAG over pure LLM generation? Without RAG, the LLM generates from scratch every time. With RAG, it sees confirmed past RCAs for similar incidents. For recurring patterns (same service, same failure type), the second RCA is substantially more specific than the first.

Why LangGraph over raw API calls? LangGraph gives explicit typed state and conditional edges. The retry on parse failure loops back to only the generation step — not re-running prompt preparation or RAG retrieval. Not possible with linear chain frameworks.

Why manual Kafka ack? At-least-once delivery semantics. Offset is only committed after the incident is successfully published downstream. If the service crashes mid-processing, the alert replays on restart.

Why separate Java and Python services? The Java services are pure event-driven pipeline — deterministic, fast, testable. The Python AI layer can be iterated independently without touching the correlation logic. Clear separation of concerns.


Project Structure

deterministic-incident-analyzer/
├── infra/                          # Docker Compose, Kafka, pgvector Postgres, Redis, Prometheus, Alertmanager, Grafana
├── shared-schema/                  # Java event contracts (AlertEvent, IncidentEvent, DeploymentEvent, TraceContext)
├── alert-ingestion-service/        # Spring Boot — Alertmanager webhook receiver + Redis dedup
├── incident-correlation-engine/    # Spring Boot — topology-aware sliding window + deployment correlation
│   └── topology/                   # Service dependency graph (Redis-backed)
├── synthetic-incident-generator/   # Spring Boot — 8 scenario types with traceId generation
├── ai-orchestrator/                # Python FastAPI — RAG + LangGraph RCA pipeline
│   ├── agents/                     # LangGraph 5-node RCA graph
│   ├── rag/                        # pgvector embeddings + similarity search
│   ├── guardrails/                 # Hallucination validation layer
│   └── messaging/                  # Kafka consumer with traceId propagation
├── docs/                           # Benchmark results
└── Makefile

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors