AI Operations

When to use this runbook: monitoring, alerting, and incident response for the AI orchestration platform.

Prerequisites
When to use this
Testing Strategy
Key Metrics
Critical Alerts
Automated Maintenance Jobs
Procedure — Daily Operational Checklist
Procedure — Weekly Review
Procedure — Monthly Review
Incident Runbooks
Verification
Rollback
Troubleshooting

Prerequisites

Backend (powernode-backend@default), worker (powernode-worker@default), and frontend (powernode-frontend@default) services running — verify with sudo scripts/systemd/powernode-installer.sh status
ai.monitoring.read, ai.autonomy.manage permissions for operators
Sidekiq dashboard accessible (default http://localhost:4567)
Sentry / APM credentials configured (see production-deployment.md)

When to use this

Daily operational sweeps for the AI platform
After a deploy that touched orchestration code
When an automated alert fires (mission stuck, agent quarantined, trust score collapse, etc.)
Weekly / monthly governance reviews

Testing Strategy

Testing Pyramid

        ┌──────────────┐
        │  E2E Tests   │   ~10% — Mission/workflow execution scenarios
        │  Playwright  │
        ├──────────────┤
        │ Integration  │   ~30% — API + DB + WebSocket
        │   Tests      │
        ├──────────────┤
        │ Unit Tests   │   ~60% — Services, models, components
        │ RSpec + Jest │
        └──────────────┘

Coverage Targets

Layer	Target	Focus
Services	85%+	Orchestration, autonomy, security, memory, RAG
Models	90%+	Trust scoring, delegation, guardrails, missions
Controllers	80%+	AI controllers under `/api/v1/ai`
Jobs	75%+	Mission phase jobs, maintenance jobs
Frontend components	70%+	Mission dashboard, agent index, team index
Frontend services	90%+	API integration layer

Key Test Patterns

# Setup with permissions
user = user_with_permissions('ai.missions.manage')

# Auth headers for request specs
headers = auth_headers_for(user)

# Response helpers
expect_success_response(json_response_data)
expect_error_response("Not found", :not_found)

# Shared examples
include_examples 'requires authentication'
include_examples 'requires permission', 'ai.missions.manage'
include_examples 'scopes to current account'

Key Metrics

Metric	Description	Alert Threshold
API response time	P95 across AI controllers	> 500ms
API error rate	5xx errors	> 5%
Mission completion rate	Successful missions	< 80%
Ralph task success rate	Passed vs failed tasks	< 85%
Trust score distribution	Agent tier breakdown	> 50% supervised
Memory consolidation lag	STM entries pending promotion	> 1000
RAG query latency	Average retrieval time	> 2s
Guardrail block rate	Blocked vs allowed requests	> 20%
Circuit breaker opens	Open provider breakers	> 3
Skill conflict count	Unresolved conflicts	> 10
Budget utilisation	Per-agent budget usage	> 90%
Queue depth	AI execution queue	> 1000 jobs

Critical Alerts

AI Platform availability (P1):

- alert: AIPlatformHighErrorRate
  expr: |
    (sum(rate(ai_api_requests_500[5m])) /
     sum(rate(ai_api_requests_total[5m]))) > 0.05
  for: 5m
  labels:
    severity: critical

Agent quarantine surge:

- alert: MassAgentQuarantine
  expr: ai_quarantine_records_active > 10
  for: 5m
  labels:
    severity: high

Trust score widespread decay:

- alert: TrustScoreWidespreadDecay
  expr: ai_agents_supervised_tier_count / ai_agents_total > 0.5
  for: 1h
  labels:
    severity: warning

Automated Maintenance Jobs

The worker runs the following jobs on the schedules shown. Full job catalogue: worker-operations.md.

Job	Schedule	Purpose
Trust score decay	2:00 AM daily	Decay idle agent trust toward 0.5 baseline
Learning decay	3:45 AM daily	Exponential decay on stale compound learnings
Memory consolidation	4:00 AM daily	STM → LTM promotion (access ≥ 3)
Context rot detection	4:00 AM daily	Archive context entries with staleness ≥ 0.9
Skill conflict scan	4:15 AM daily	Detect overlapping / contradictory skills
Skill stale decay	5:00 AM weekly	Reduce effectiveness of unused skills
Skill re-embedding	5:00 AM weekly	Refresh skill embeddings for discovery
Knowledge doc sync	5:30 AM daily	Sync knowledge to documentation files
Skill gap detection	3:00 AM monthly	Identify missing capabilities

Procedure — Daily Operational Checklist

Roughly 10 minutes per day.

Check dashboard for anomalies (error rates, latencies)
Review overnight quarantine records
Verify all services running: sudo scripts/systemd/powernode-installer.sh status
Check mission pipeline — any stuck missions?
Review cost tracking — any budget alerts?

Procedure — Weekly Review

Roughly 30 minutes per week.

Analyse trust score distribution across agents
Review skill conflict report
Check memory consolidation metrics
Review RAG query quality scores
Audit guardrail block rates
Check model router optimisation recommendations

Procedure — Monthly Review

Roughly 2 hours per month.

Trust tier promotion / demotion analysis
Cost attribution deep-dive by provider / model / agent
Skill gap detection results
Security audit trail review (high-risk events)
Memory tier capacity planning
Knowledge base freshness assessment

Incident Runbooks

Mission Stuck in Phase

Check mission status: Ai::Mission.find(id).current_phase
Check worker job status: systemctl status powernode-worker@default
Review worker logs: journalctl -u powernode-worker@default -f
Check for failed Ralph tasks: mission.ralph_loop.ralph_tasks.failed
Retry the current phase: mission_orchestrator.retry_phase!

Agent Quarantined Unexpectedly

Check the quarantine record: Ai::QuarantineRecord.for_agent(agent_id).active
Review trigger reason and source
Check behavioural fingerprint anomalies
If false positive: restore the agent and tune thresholds
Review the security audit trail for context

Trust Score Collapsed

Check recent executions: agent.agent_executions.recent
Review dimension breakdown: agent.agent_trust_score.dimensions
Check for emergency demotion events
If legitimate: agent will recover naturally with successful executions
If anomalous: investigate the security audit trail

Verification

After any intervention:

GET /api/v1/ai/monitoring/health → data.healthy: true
GET /api/v1/ai/missions?status=stuck → returns no items
sudo scripts/systemd/powernode-installer.sh status → all services active

Reliability targets:

Target	Threshold
API availability	99.9% uptime
API response time (P95)	< 200ms
Mission success rate	> 80%
Ralph task pass rate	> 85%

Performance targets:

Target	Threshold
RAG query latency (P95)	< 2s
Memory consolidation	< 5 minutes
Trust evaluation	< 100ms
Model route decision	< 50ms

Rollback

If an intervention destabilises an agent or mission:

POST /api/v1/ai/autonomy/kill_switch → emergency halt
Investigate via GET /api/v1/ai/kill_switch/status
Restore from the most recent good agent state (via admin UI or rails console: Ai::Agent.find(id).restore_from_history!(version: N))
POST /api/v1/ai/autonomy/kill_switch/release once safe

Troubleshooting

Quick Reference Commands

# Backend tests
cd server && bundle exec rspec

# Frontend tests
cd frontend && CI=true npm test

# TypeScript check
cd frontend && npx tsc --noEmit

# Service status
sudo scripts/systemd/powernode-installer.sh status

# Backend logs
journalctl -u powernode-backend@default -f

# Worker monitor
systemctl status powernode-worker@default

Symptom	Likely cause	First action
`429 Too Many Requests` from `/api/v1/ai/*`	Rate limiting	Check `current_quota_usage` on the data source / provider
Memory consolidation never runs	Worker scheduler stopped	Restart `powernode-worker@default`; verify `sidekiq-cron` job list
RAG retrieval returns no chunks	Embedding job behind	Check `AiKnowledgeGraphMaintenanceJob` + run manual sync
Skill conflict count > 10	Recent skill changes introduced overlaps	Run `platform.skill_health`, resolve via `resolve_contradiction`

Related runbooks

worker-operations.md — Worker job reference
production-deployment.md — Initial AI service deployment
performance-tuning.md — Latency & throughput tuning

Materials previously at

docs/platform/AI_ORCHESTRATION_OPERATIONS.md

Last verified: 2026-05-17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI Operations

Table of Contents

Prerequisites

When to use this

Testing Strategy

Testing Pyramid

Coverage Targets

Key Test Patterns

Key Metrics

Critical Alerts

Automated Maintenance Jobs

Procedure — Daily Operational Checklist

Procedure — Weekly Review

Procedure — Monthly Review

Incident Runbooks

Mission Stuck in Phase

Agent Quarantined Unexpectedly

Trust Score Collapsed

Verification

Rollback

Troubleshooting

Quick Reference Commands

Related runbooks

Materials previously at

FilesExpand file tree

ai-operations.md

Latest commit

History

ai-operations.md

File metadata and controls

AI Operations

Table of Contents

Prerequisites

When to use this

Testing Strategy

Testing Pyramid

Coverage Targets

Key Test Patterns

Key Metrics

Critical Alerts

Automated Maintenance Jobs

Procedure — Daily Operational Checklist

Procedure — Weekly Review

Procedure — Monthly Review

Incident Runbooks

Mission Stuck in Phase

Agent Quarantined Unexpectedly

Trust Score Collapsed

Verification

Rollback

Troubleshooting

Quick Reference Commands

Related runbooks

Materials previously at