When to use this runbook: monitoring, alerting, and incident response for the AI orchestration platform.
- Prerequisites
- When to use this
- Testing Strategy
- Key Metrics
- Critical Alerts
- Automated Maintenance Jobs
- Procedure — Daily Operational Checklist
- Procedure — Weekly Review
- Procedure — Monthly Review
- Incident Runbooks
- Verification
- Rollback
- Troubleshooting
- Backend (
powernode-backend@default), worker (powernode-worker@default), and frontend (powernode-frontend@default) services running — verify withsudo scripts/systemd/powernode-installer.sh status ai.monitoring.read,ai.autonomy.managepermissions for operators- Sidekiq dashboard accessible (default
http://localhost:4567) - Sentry / APM credentials configured (see production-deployment.md)
- Daily operational sweeps for the AI platform
- After a deploy that touched orchestration code
- When an automated alert fires (mission stuck, agent quarantined, trust score collapse, etc.)
- Weekly / monthly governance reviews
┌──────────────┐
│ E2E Tests │ ~10% — Mission/workflow execution scenarios
│ Playwright │
├──────────────┤
│ Integration │ ~30% — API + DB + WebSocket
│ Tests │
├──────────────┤
│ Unit Tests │ ~60% — Services, models, components
│ RSpec + Jest │
└──────────────┘
| Layer | Target | Focus |
|---|---|---|
| Services | 85%+ | Orchestration, autonomy, security, memory, RAG |
| Models | 90%+ | Trust scoring, delegation, guardrails, missions |
| Controllers | 80%+ | AI controllers under /api/v1/ai |
| Jobs | 75%+ | Mission phase jobs, maintenance jobs |
| Frontend components | 70%+ | Mission dashboard, agent index, team index |
| Frontend services | 90%+ | API integration layer |
# Setup with permissions
user = user_with_permissions('ai.missions.manage')
# Auth headers for request specs
headers = auth_headers_for(user)
# Response helpers
expect_success_response(json_response_data)
expect_error_response("Not found", :not_found)
# Shared examples
include_examples 'requires authentication'
include_examples 'requires permission', 'ai.missions.manage'
include_examples 'scopes to current account'| Metric | Description | Alert Threshold |
|---|---|---|
| API response time | P95 across AI controllers | > 500ms |
| API error rate | 5xx errors | > 5% |
| Mission completion rate | Successful missions | < 80% |
| Ralph task success rate | Passed vs failed tasks | < 85% |
| Trust score distribution | Agent tier breakdown | > 50% supervised |
| Memory consolidation lag | STM entries pending promotion | > 1000 |
| RAG query latency | Average retrieval time | > 2s |
| Guardrail block rate | Blocked vs allowed requests | > 20% |
| Circuit breaker opens | Open provider breakers | > 3 |
| Skill conflict count | Unresolved conflicts | > 10 |
| Budget utilisation | Per-agent budget usage | > 90% |
| Queue depth | AI execution queue | > 1000 jobs |
AI Platform availability (P1):
- alert: AIPlatformHighErrorRate
expr: |
(sum(rate(ai_api_requests_500[5m])) /
sum(rate(ai_api_requests_total[5m]))) > 0.05
for: 5m
labels:
severity: criticalAgent quarantine surge:
- alert: MassAgentQuarantine
expr: ai_quarantine_records_active > 10
for: 5m
labels:
severity: highTrust score widespread decay:
- alert: TrustScoreWidespreadDecay
expr: ai_agents_supervised_tier_count / ai_agents_total > 0.5
for: 1h
labels:
severity: warningThe worker runs the following jobs on the schedules shown. Full job catalogue: worker-operations.md.
| Job | Schedule | Purpose |
|---|---|---|
| Trust score decay | 2:00 AM daily | Decay idle agent trust toward 0.5 baseline |
| Learning decay | 3:45 AM daily | Exponential decay on stale compound learnings |
| Memory consolidation | 4:00 AM daily | STM → LTM promotion (access ≥ 3) |
| Context rot detection | 4:00 AM daily | Archive context entries with staleness ≥ 0.9 |
| Skill conflict scan | 4:15 AM daily | Detect overlapping / contradictory skills |
| Skill stale decay | 5:00 AM weekly | Reduce effectiveness of unused skills |
| Skill re-embedding | 5:00 AM weekly | Refresh skill embeddings for discovery |
| Knowledge doc sync | 5:30 AM daily | Sync knowledge to documentation files |
| Skill gap detection | 3:00 AM monthly | Identify missing capabilities |
Roughly 10 minutes per day.
- Check dashboard for anomalies (error rates, latencies)
- Review overnight quarantine records
- Verify all services running:
sudo scripts/systemd/powernode-installer.sh status - Check mission pipeline — any stuck missions?
- Review cost tracking — any budget alerts?
Roughly 30 minutes per week.
- Analyse trust score distribution across agents
- Review skill conflict report
- Check memory consolidation metrics
- Review RAG query quality scores
- Audit guardrail block rates
- Check model router optimisation recommendations
Roughly 2 hours per month.
- Trust tier promotion / demotion analysis
- Cost attribution deep-dive by provider / model / agent
- Skill gap detection results
- Security audit trail review (high-risk events)
- Memory tier capacity planning
- Knowledge base freshness assessment
- Check mission status:
Ai::Mission.find(id).current_phase - Check worker job status:
systemctl status powernode-worker@default - Review worker logs:
journalctl -u powernode-worker@default -f - Check for failed Ralph tasks:
mission.ralph_loop.ralph_tasks.failed - Retry the current phase:
mission_orchestrator.retry_phase!
- Check the quarantine record:
Ai::QuarantineRecord.for_agent(agent_id).active - Review trigger reason and source
- Check behavioural fingerprint anomalies
- If false positive: restore the agent and tune thresholds
- Review the security audit trail for context
- Check recent executions:
agent.agent_executions.recent - Review dimension breakdown:
agent.agent_trust_score.dimensions - Check for emergency demotion events
- If legitimate: agent will recover naturally with successful executions
- If anomalous: investigate the security audit trail
After any intervention:
GET /api/v1/ai/monitoring/health→data.healthy: trueGET /api/v1/ai/missions?status=stuck→ returns no itemssudo scripts/systemd/powernode-installer.sh status→ all servicesactive
Reliability targets:
| Target | Threshold |
|---|---|
| API availability | 99.9% uptime |
| API response time (P95) | < 200ms |
| Mission success rate | > 80% |
| Ralph task pass rate | > 85% |
Performance targets:
| Target | Threshold |
|---|---|
| RAG query latency (P95) | < 2s |
| Memory consolidation | < 5 minutes |
| Trust evaluation | < 100ms |
| Model route decision | < 50ms |
If an intervention destabilises an agent or mission:
POST /api/v1/ai/autonomy/kill_switch→ emergency halt- Investigate via
GET /api/v1/ai/kill_switch/status - Restore from the most recent good agent state (via admin UI or rails console:
Ai::Agent.find(id).restore_from_history!(version: N)) POST /api/v1/ai/autonomy/kill_switch/releaseonce safe
# Backend tests
cd server && bundle exec rspec
# Frontend tests
cd frontend && CI=true npm test
# TypeScript check
cd frontend && npx tsc --noEmit
# Service status
sudo scripts/systemd/powernode-installer.sh status
# Backend logs
journalctl -u powernode-backend@default -f
# Worker monitor
systemctl status powernode-worker@default| Symptom | Likely cause | First action |
|---|---|---|
429 Too Many Requests from /api/v1/ai/* |
Rate limiting | Check current_quota_usage on the data source / provider |
| Memory consolidation never runs | Worker scheduler stopped | Restart powernode-worker@default; verify sidekiq-cron job list |
| RAG retrieval returns no chunks | Embedding job behind | Check AiKnowledgeGraphMaintenanceJob + run manual sync |
| Skill conflict count > 10 | Recent skill changes introduced overlaps | Run platform.skill_health, resolve via resolve_contradiction |
- worker-operations.md — Worker job reference
- production-deployment.md — Initial AI service deployment
- performance-tuning.md — Latency & throughput tuning
docs/platform/AI_ORCHESTRATION_OPERATIONS.md
Last verified: 2026-05-17