Witness & Deacon: Alarm-driven orchestration with on-demand LLM triage

## Overview

In local Gastown, the Witness and Deacon are persistent AI agent sessions running continuous patrol loops in tmux. They burn LLM tokens on every cycle, even though ~90% of their behavior is mechanical — threshold checks, protocol message routing, session liveness detection, timer evaluation. The LLM's role is largely to run `gt` commands, parse structured output, and call deterministic Go handler functions. Genuine reasoning is only needed for a small set of ambiguous situations.

The cloud should not replicate this. The TownDO alarm IS the patrol loop. It should run all mechanical checks as deterministic code, and only spawn short-lived LLM agent sessions when a check produces an ambiguous result that requires reasoning.

Parent: #204 / #419

## Background: What Local Gastown's Witness & Deacon Actually Do

### Three execution layers in local Gastown

1. **Go Daemon** — Pure Go process on a 3-minute heartbeat. All behavior is mechanical. Handles session liveness, crash loop detection, orphan cleanup, GUPP checks, heartbeat freshness.
2. **Deacon** — LLM agent in tmux running `mol-deacon-patrol` formula. Continuous loop: inbox check → orphan cleanup → spawn triggers → gate evaluation → convoy checks → health scan → zombie scan → plugin run → loop.
3. **Witness** (one per rig) — LLM agent in tmux running `mol-witness-patrol` formula. Continuous loop: inbox check → process cleanups → check refinery → survey workers → loop.

### Mechanical behaviors (deterministic — no LLM needed)

These are implemented as Go handler functions that the LLM agents invoke but don't reason about:

**Witness mechanical behaviors:**
- Receive POLECAT_DONE → verify git state → create cleanup wisp → send MERGE_READY to Refinery
- Receive MERGED → verify commit on main → nuke polecat worktree
- Receive MERGE_FAILED → send rework notification to polecat identity
- Zombie detection: cross-reference agent metadata state with container process state
- Done-intent label detection for crash recovery (stale `done-intent` label on agent bead)
- Hung session detection (30-min inactivity threshold on `last_activity_at`)
- GUPP violation detection (hook set + no progress for 30 min)
- Orphaned work detection (hook set + no container process)
- Auto-nuke clean agents, flag dirty ones for triage
- Swarm completion tracking (count closed vs total tracked beads)
- Message deduplication (processed message ID set)
- Timer gate evaluation (`created_at + timeout < now`)

**Deacon mechanical behaviors:**
- Heartbeat freshness check (timestamp age thresholds)
- Redispatch failed beads (cooldown timer + attempt counter + max retry threshold)
- Stale hook detection (dead session + hooked bead OR unknown assignee + age > 1 hour)
- Convoy completion checks (all tracked beads closed → land convoy)
- Stranded convoy detection (open beads with no assigned agent)
- Orphan process cleanup (TTY-based process filter)
- Gate evaluation (elapsed time > timeout)
- Plugin gate evaluation (cooldown/cron/condition checks)
- Dog pool maintenance (pool size, work duration vs timeout, chronic failure tracking)
- Log rotation and wisp compaction
- Crash loop detection (restart count + timing with exponential backoff)

### Intelligent behaviors (genuinely need LLM reasoning)

These are the ~10% of behaviors where deterministic code can't make the right call:

| Behavior | Why LLM is needed |
|---|---|
| **Dirty polecat triage** | Must read `git status`/`git diff` output and judge if uncommitted changes are valuable work worth saving, disposable artifacts, or a confused state requiring escalation |
| **Refinery queue health assessment** | Local Gastown's formula explicitly says "no hardcoded thresholds — use your judgment." Must reason about queue depth, staleness patterns, time context |
| **Live agent progress inspection** | Must interpret agent conversation/activity output to determine if an agent is stuck, thinking deeply, or making slow but real progress |
| **Help request handling** | When a polecat sends HELP, must understand the problem domain and craft contextual guidance |
| **Escalation assessment** | Must understand an escalation's context to decide: handle locally, forward to Mayor, or alert human |
| **Zombie scan confirmation** | Verify that automated zombie detection results are correct before taking destructive action (nuke) |
| **Contextual notification composition** | Compose convoy completion summaries, escalation descriptions, handoff notes |

## Cloud Architecture: Alarm + On-Demand Triage Agents

### TownDO alarm handler (deterministic code)

The TownDO alarm runs on a configurable interval (15s when active, 5m when idle) and handles all mechanical patrol behaviors:

```
TownDO.alarm()
  ├── schedulePendingWork()         // existing: dispatch pending beads to polecats
  ├── processReviewQueue()          // existing: trigger refinery for ready MRs
  ├── witnessPatrol()               // enhanced: full mechanical witness behaviors
  │     ├── processProtocolMail()   // POLECAT_DONE → MERGE_READY → MERGED flow
  │     ├── detectZombies()         // agent_metadata.status vs container process state
  │     ├── detectGUPPViolations()  // hook set + stale updated_at
  │     ├── detectOrphanedWork()    // hook set + no container process
  │     ├── autoNukeCleanAgents()   // cleanup_status = 'clean' → nuke
  │     ├── checkTimerGates()       // created_at + timeout < now
  │     └── flagDirtyForTriage()    // dirty/ambiguous → queue for LLM triage
  ├── deaconPatrol()                // enhanced: full mechanical deacon behaviors
  │     ├── redispatchFailedBeads() // cooldown + retry counter + max attempts
  │     ├── detectStaleHooks()      // dead process + hooked bead
  │     ├── checkConvoyCompletion() // all tracked beads closed → land
  │     ├── feedStrandedConvoys()   // open beads with no assignee → auto-sling
  │     ├── evaluateGates()         // timer/condition gate checks
  │     └── reEscalateStale()       // unacknowledged escalations past threshold
  ├── containerHealthCheck()        // existing: ping container, restart if dead
  └── maybeDispatchTriageAgent()    // NEW: if dirty/ambiguous items queued, spawn triage
```

### On-demand triage agent (LLM, spawned when needed)

When the alarm's mechanical checks produce results that need reasoning, it queues them as **triage request beads** (type = `'triage_request'`) with structured context. When the queue is non-empty, the alarm dispatches a short-lived triage agent in the container:

```typescript
// In TownDO alarm handler
const triageQueue = await this.listBeads({ type: 'triage_request', status: 'open' });
if (triageQueue.length > 0) {
  await this.dispatchTriageAgent(triageQueue);
}
```

The triage agent gets a focused system prompt:

```
You are a Gastown triage agent. You will be given a list of situations that 
require judgment. For each one, assess the situation and take one of the 
prescribed actions. Be decisive. When done, call gt_done.

Situations to assess:
1. [DIRTY_POLECAT] Agent "Toast" has uncommitted changes after completion.
   Git status: <output>
   Git diff --stat: <output>
   Options: COMMIT_AND_PUSH | DISCARD | ESCALATE

2. [STUCK_AGENT] Agent "Maple" has not made progress in 45 minutes.
   Last activity: <timestamp>
   Recent conversation tail: <last 20 lines>
   Options: NUDGE | RESTART | ESCALATE

3. [HELP_REQUEST] Agent "Shadow" sent HELP: "Can't resolve merge conflict in auth.ts"
   Context: <bead body>
   Options: PROVIDE_GUIDANCE | ESCALATE_TO_MAYOR
```

The triage agent processes each item, takes an action (via tool calls back to the TownDO), and exits. Session lifetime: seconds to minutes, not hours. LLM cost: proportional to actual ambiguity in the system, not to wall-clock uptime.

### Triage request bead schema

Triage requests are beads (consistent with #441):

```sql
-- No separate table needed. Uses the universal beads table.
-- type = 'triage_request'
-- metadata JSON contains the structured context:
{
  "triage_type": "dirty_polecat",          -- or "stuck_agent", "help_request", "queue_health", "zombie_confirm"
  "agent_bead_id": "...",                  -- which agent this concerns
  "context": {                             -- type-specific context
    "git_status": "...",
    "git_diff_stat": "...",
    "cleanup_status": "has_uncommitted"
  },
  "options": ["COMMIT_AND_PUSH", "DISCARD", "ESCALATE"]
}
```

### Triage agent tools

The triage agent needs a narrow tool set (subset of the existing plugin):

| Tool | Purpose |
|---|---|
| `gt_triage_resolve` | Resolve a triage request with a chosen action. TownDO executes the action (nuke, restart, escalate, etc.) |
| `gt_mail_send` | Send contextual guidance to a stuck agent |
| `gt_escalate` | Forward to Mayor or human |
| `gt_nudge` | Send a message to a running agent's session |
| `gt_done` | Signal triage session complete |

### When the alarm does NOT spawn a triage agent

Most patrol cycles will have zero ambiguous situations. The alarm runs, all checks pass (or produce clear mechanical outcomes like auto-nuke), and no triage agent is needed. The LLM is only invoked when the system encounters genuine uncertainty.

Expected frequency: triage agents spawn on <10% of alarm cycles in a healthy town. In a town with many stuck or failing agents, they'll spawn more often — which is correct, because that's when reasoning is most valuable.

## What This Replaces

| Local Gastown | Cloud Gastown |
|---|---|
| Go Daemon (3-min heartbeat) | TownDO alarm (15s active / 5m idle) |
| Boot agent (ephemeral AI triage per tick) | Not needed — TownDO alarm is already the external observer |
| Deacon (persistent AI patrol loop) | `deaconPatrol()` in alarm handler (mechanical) + on-demand triage agent (intelligent) |
| Witness (persistent AI patrol loop per rig) | `witnessPatrol()` in alarm handler (mechanical) + on-demand triage agent (intelligent) |
| Dogs (Deacon's helper agents) | Not needed for Phase 2.5 — narrow tasks handled by triage agent or alarm code |

### Why the watchdog chain simplifies

Local Gastown needs Boot→Deacon→Witness because "a hung Deacon can't detect it's hung" — Boot provides an external observer. In the cloud, DO alarms are the external observer. They're durable (re-fire after eviction), managed by the Cloudflare runtime (not by user code that can hang), and independent of the container. If the container dies, the alarm still fires and detects dead agents. The three-tier watchdog chain collapses to: **DO alarm (always fires) → mechanical checks → triage agent (when needed)**.

One risk: a logic bug in the alarm handler could silently break the town. Mitigation: a Cron Trigger that pings each active town's health endpoint independently of the DO alarm, providing an external watchdog analogous to Boot.

## Implementation Plan

### Step 1: Expand alarm handler with full mechanical patrol

Move all mechanical witness/deacon behaviors into the TownDO alarm. The current `witnessPatrol()` already does basic stale-agent checks — expand it to cover the full protocol mail flow (POLECAT_DONE → MERGE_READY → MERGED), zombie detection, GUPP violations, orphaned work, convoy completion, stale hook cleanup, gate evaluation, and escalation re-escalation.

### Step 2: Triage request queue

When mechanical checks produce ambiguous results (dirty polecat, stuck agent, help request), create triage request beads instead of taking immediate action.

### Step 3: Triage agent dispatch

When triage requests are queued, the alarm dispatches a short-lived triage agent session in the container with the focused prompt and narrow tool set. The agent processes all pending requests and exits.

### Step 4: Triage agent tools

Add `gt_triage_resolve` tool to the plugin. This tool takes a triage request bead ID and a chosen action, and the TownDO executes the action (nuke agent, restart agent, send mail, escalate, etc.).

### Step 5: External health watchdog

Add a Cron Trigger (or separate DO with its own alarm) that periodically verifies each active town's alarm is firing and its container is responsive. This replaces Boot's role as the external observer.

## Acceptance Criteria

- [ ] All mechanical witness behaviors run in the TownDO alarm without LLM involvement
- [ ] All mechanical deacon behaviors run in the TownDO alarm without LLM involvement
- [ ] Ambiguous situations produce triage request beads with structured context
- [ ] Triage agent is dispatched only when triage requests are queued
- [ ] Triage agent has a focused prompt and narrow tool set
- [ ] Triage agent processes all pending requests and exits (no persistent session)
- [ ] External health watchdog exists independent of the TownDO alarm
- [ ] No persistent Witness or Deacon LLM sessions running continuously

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Witness & Deacon: Alarm-driven orchestration with on-demand LLM triage #442

Overview

Background: What Local Gastown's Witness & Deacon Actually Do

Three execution layers in local Gastown

Mechanical behaviors (deterministic — no LLM needed)

Intelligent behaviors (genuinely need LLM reasoning)

Cloud Architecture: Alarm + On-Demand Triage Agents

TownDO alarm handler (deterministic code)

On-demand triage agent (LLM, spawned when needed)

Triage request bead schema

Triage agent tools

When the alarm does NOT spawn a triage agent

What This Replaces

Why the watchdog chain simplifies

Implementation Plan

Step 1: Expand alarm handler with full mechanical patrol

Step 2: Triage request queue

Step 3: Triage agent dispatch

Step 4: Triage agent tools

Step 5: External health watchdog

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Behavior	Why LLM is needed
Dirty polecat triage	Must read `git status`/`git diff` output and judge if uncommitted changes are valuable work worth saving, disposable artifacts, or a confused state requiring escalation
Refinery queue health assessment	Local Gastown's formula explicitly says "no hardcoded thresholds — use your judgment." Must reason about queue depth, staleness patterns, time context
Live agent progress inspection	Must interpret agent conversation/activity output to determine if an agent is stuck, thinking deeply, or making slow but real progress
Help request handling	When a polecat sends HELP, must understand the problem domain and craft contextual guidance
Escalation assessment	Must understand an escalation's context to decide: handle locally, forward to Mayor, or alert human
Zombie scan confirmation	Verify that automated zombie detection results are correct before taking destructive action (nuke)
Contextual notification composition	Compose convoy completion summaries, escalation descriptions, handoff notes

Tool	Purpose
`gt_triage_resolve`	Resolve a triage request with a chosen action. TownDO executes the action (nuke, restart, escalate, etc.)
`gt_mail_send`	Send contextual guidance to a stuck agent
`gt_escalate`	Forward to Mayor or human
`gt_nudge`	Send a message to a running agent's session
`gt_done`	Signal triage session complete

Local Gastown	Cloud Gastown
Go Daemon (3-min heartbeat)	TownDO alarm (15s active / 5m idle)
Boot agent (ephemeral AI triage per tick)	Not needed — TownDO alarm is already the external observer
Deacon (persistent AI patrol loop)	`deaconPatrol()` in alarm handler (mechanical) + on-demand triage agent (intelligent)
Witness (persistent AI patrol loop per rig)	`witnessPatrol()` in alarm handler (mechanical) + on-demand triage agent (intelligent)
Dogs (Deacon's helper agents)	Not needed for Phase 2.5 — narrow tasks handled by triage agent or alarm code

Witness & Deacon: Alarm-driven orchestration with on-demand LLM triage #442

Description

Overview

Background: What Local Gastown's Witness & Deacon Actually Do

Three execution layers in local Gastown

Mechanical behaviors (deterministic — no LLM needed)

Intelligent behaviors (genuinely need LLM reasoning)

Cloud Architecture: Alarm + On-Demand Triage Agents

TownDO alarm handler (deterministic code)

On-demand triage agent (LLM, spawned when needed)

Triage request bead schema

Triage agent tools

When the alarm does NOT spawn a triage agent

What This Replaces

Why the watchdog chain simplifies

Implementation Plan

Step 1: Expand alarm handler with full mechanical patrol

Step 2: Triage request queue

Step 3: Triage agent dispatch

Step 4: Triage agent tools

Step 5: External health watchdog

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions