diff --git a/docs/brainstorms/2026-05-07-erin-phase-isolation-requirements.md b/docs/brainstorms/2026-05-07-erin-phase-isolation-requirements.md
new file mode 100644
index 000000000..b23abbe5a
--- /dev/null
+++ b/docs/brainstorms/2026-05-07-erin-phase-isolation-requirements.md
@@ -0,0 +1,168 @@
+---
+date: 2026-05-07
+topic: erin-phase-isolation
+---
+
+# Erin Phase Isolation: Subagent Wrapping for Long-Running Phases
+
+## Problem Frame
+
+Two distinct pains drive this work, and conflating them obscures the right design:
+
+1. **Context bloat.** When Erin runs a long phase (`/ce:work` especially) as an in-thread skill invocation, every tool call, every file read, every reviewer panel output accumulates in the main conversation. Long features burn through the context window before Erin even reaches `compound`. Jeff has to manually `/compact` mid-workflow.
+
+2. **Memory loss across `/compact`.** When Jeff does `/compact` mid-workflow, Erin loses the thread of which phase is current, what decisions were made, what the user just said. He copy-pastes a re-priming reminder to keep going.
+
+These are different problems with different fixes. Pain #2 is solved by Erin writing a small `run-state.md` after each phase and reading it on resume — no subagent isolation needed. Pain #1 is the part that actually requires dispatching long phases out of the main thread so their tool churn doesn't accumulate.
+
+Two rounds of document review tightened the design. The first pass cut a "Tina persona" framing, generic primitive in `ce-run`, JSONL event log, and 4-of-5 wrapped phases. The second pass surfaced epistemic gaps: spike thresholds were unfalsifiable, "observable evidence" was self-reported by the entity being checked, and the `needs-input` round-trip path was destructive (not just wasteful) for a phase like `/ce:work` that already mutates filesystem and git state. v1 now ships: a spike with numeric pass/fail thresholds, the `work` phase wrapped, Erin-verified evidence via direct `git diff` checks, and a hard halt on any subagent that needs user input mid-phase (no dialogue-relay protocol; the user re-runs `/ce:run erin` with the answer in args).
+
+## User Flow
+
+```mermaid
+flowchart TB
+    Jeff[Jeff main session] -->|/ce:run erin feature| Erin[Erin Opus, main thread]
+    Erin -->|brainstorm + plan in-thread| Plan[Plan doc]
+    Erin -->|writes run-state.md PRE-dispatch| State[(run-state.md)]
+    Erin -->|wrapped: true, Agent dispatch| Sub[Work subagent<br/>Opus, isolated context]
+    Sub -->|streams to Jeff terminal| Jeff
+    Sub -->|writes handoff + artifacts| Disk[(handoff + plan + commits)]
+    Sub -->|compact return ≤200 words| Erin
+    Erin -->|git diff --stat verification| GitCheck{Evidence<br/>matches?}
+    GitCheck -->|yes| Erin2[Erin updates run-state POST-return]
+    GitCheck -->|no| Retry[Re-spawn or surface]
+    Erin2 -->|review + compound in-thread| Done[Done]
+
+    JeffCompact[Jeff /compact mid-run] -.->|context reset| Erin
+    Erin -.->|reads run-state + scans handoffs/ for any newer| State
+```
+
+## Requirements
+
+**Phase 0 — Platform Spike (v0 prerequisite, hard gate)**
+
+- R1. Before any other implementation, run a measurement spike that dispatches an Opus subagent which itself dispatches two parallel Sonnet subagents. The spike MUST capture, with explicit numeric pass/fail thresholds:
+  - **(a) Main-thread token accumulation.** Measure parent-context token growth attributable to the wrapped phase. **Pass:** parent grows by less than 20% of the total tokens consumed by the leaf subagents (i.e., isolation captures >80%). **Fail:** parent grows by more than 50% of leaf token cost.
+  - **(b) Streaming fidelity.** Capture what surfaces in the user's terminal during the wrapped phase. **Pass:** at minimum, each leaf tool call's name and abbreviated input is visible in real time. **Fail:** terminal shows nothing until subagent returns ("black box for N minutes" UX).
+  - **(c) Reliability.** Run the dispatch at least 5 times in succession. **Pass:** all 5 succeed end-to-end. **Fail:** any run fails outright; investigate before proceeding.
+  - **(d) Per-phase baseline.** As part of the spike, also capture per-phase token shares from the most recent 3 real `/ce:run erin` workflows on this codebase. This produces the baseline against which v1's success criterion (≥30% main-thread reduction) is measured AND validates the assumption that `work` is the largest phase.
+- R2. The spike's findings document MUST live at `docs/solutions/2026-05-XX-agent-tool-depth-2-spike.md` and include the four measurements above with pass/fail status against the thresholds in R1. The rest of v1 is gated on (a), (b), and (c) all passing. (d) informs v1 scope but is not a pass/fail gate.
+- R3. If any of (a)–(c) fail at the thresholds in R1, the rest of v1 is redesigned (or abandoned). If (d) shows `work` is not the largest source of bloat, v1 scope re-considers which phase to wrap first before proceeding.
+
+**Phase Wrapping Primitive (post-spike)**
+
+- R4. Erin's orchestrator file (`erin.md` in `ce-reviewers-jsl`) MUST support a `wrapped: true` flag on individual phase entries.
+- R5. When a phase has `wrapped: true`, Erin dispatches that phase via the `Agent` tool to a fresh Opus subagent. The subagent runs the phase's `skill:` (e.g., `ce:work`) in its isolated context, persists artifacts to disk, and returns a compact summary.
+- R6. The wrapping logic lives in Erin's behavior (in `erin.md` itself), NOT as a generic primitive in `ce-run`. `ce-run` is unchanged. When orchestrator #2 needs the same pattern, the abstraction extracts itself.
+- R7. Phase args (`$ARGUMENTS`, `$PLAN_PATH`) MUST pass through to the subagent prompt unchanged.
+- R8. Subagent prompt scope: in v1, the wrapped subagent receives the skill invocation + args ONLY. It does NOT receive Erin's review-preferences, persona definition, or other orchestrator-level prose. (Moot in v1 because review isn't wrapped; revisit when v2 wraps `review` or `plan-review`.)
+- R9. Model resolution: subagent runs on Opus by default (judgment-bearing). Workers it spawns follow per-skill defaults (typically Sonnet for reviewers, Haiku for research).
+
+**v1 Wrapped Phases (intentionally narrow)**
+
+- R10. v1 wraps **only the `work` phase**. This is the single largest source of context bloat in the working hypothesis (validated in spike R1(d)). Other phases stay in-thread.
+- R11. Phases NOT wrapped in v1: `brainstorm`, `plan`, `plan-review`, `user-plan-review`, `review`, `user-scenarios` (all stages), `everyday-usability`, `todo-resolve`, `test-browser`, `feature-video`, `compound`. Some are interactive; some are short enough that wrapping adds ceremony beyond savings.
+- R12. v2 candidates for wrapping: `review` and `plan-review`. Adding them depends on (a) the spike confirming depth-2 streaming and (b) v1's `work` wrap demonstrating real context savings on a real workflow.
+
+**Phase Handoff Contract**
+
+- R13. The wrapped phase's subagent MUST produce a handoff document at `docs/plans/<plan-filename-stem>/handoffs/<phase>.md`, where `<plan-filename-stem>` is the plan's filename minus the `.md` extension. (Stable `plan_id` in plan frontmatter is deferred to a future iteration if rename collisions become a real problem in practice.)
+- R14. Handoff frontmatter required fields: `phase`, `plan_filename`, `started`, `completed`, `status`, `claimed_files_modified`, `claimed_lines_changed_delta`. The last two are the subagent's *self-reported* numbers; Erin verifies them independently (R18).
+- R15. Handoff body required sections: `## Outcome` (one sentence), `## Artifacts` (paths, commit SHAs), `## Recommended Next Phase Action`. Optional: `## Open Questions`. If the subagent has anything to flag for Erin's between-phase judgment, include `## Judgment Calls For Erin`; omit the section entirely when nothing applies (no stub `None.`).
+- R16. `status` enum: `success | partial | failed | needs-input`. `partial` means the phase made forward progress but couldn't complete (e.g., test failures the subagent couldn't resolve). `needs-input` means the underlying skill required user dialogue — see R23.
+- R17. The subagent's return value to Erin MUST be a compact summary (≤200 words) containing: `status`, claimed evidence numbers, one-sentence outcome, judgment-call count (if any), recommended next action, handoff doc path.
+
+**Erin-Verified Evidence (Independent Ground Truth)**
+
+- R18. Before dispatching a wrapped phase, Erin captures a pre-dispatch snapshot: the current git HEAD SHA. After the subagent returns, Erin runs `git diff --stat <pre-sha>..HEAD` and `git log <pre-sha>..HEAD --oneline` in the main thread to obtain *independent* counts of files modified and lines changed. These are the **verified** numbers; the subagent's `claimed_*` fields are claims to compare against.
+- R19. Sanity-check rule: if the subagent returned `success` OR `partial` AND the verified evidence shows zero files modified AND zero lines changed, treat the result as suspicious. Trigger a re-spawn (R20) on first occurrence; surface to user on second.
+- R20. If the subagent's `claimed_*` fields disagree with verified counts by more than a tolerance band (e.g., claimed says 50 files, verified says 5), this is also a sanity-check failure — the subagent is hallucinating progress. Trigger re-spawn with corrective prompt naming the discrepancy.
+
+**Failure Recovery**
+
+- R21. On detectable subagent failure (Agent tool error, malformed handoff frontmatter, missing required artifact paths, evidence sanity-check failure per R19/R20), Erin re-spawns once with a corrective prompt that includes the prior return value verbatim plus the specific complaint.
+- R22. If the second attempt also fails, Erin surfaces to Jeff with: phase name, both return values, the specific complaint, and four options — retry, fall back to in-thread skill execution, edit the handoff manually, or abandon.
+- R23. **`needs-input` is a hard halt in v1, not a round-trip.** When the subagent encounters a question that requires user dialogue, it captures the question(s) in `## Judgment Calls For Erin` and returns `status: needs-input`. Erin reads the questions, surfaces them to Jeff, and stops. Jeff resolves the questions and re-invokes `/ce:run erin <args-with-answers-folded-in>` for a fresh workflow. **There is no in-flight resume.** This is a deliberate non-goal: re-spawning a partially-completed `/ce:work` is destructive (duplicate commits, conflicting state, silent skips), and supporting safe resume requires a skill-level redesign that's out of v1 scope.
+
+**Erin Disagrees Response**
+
+- R24. After reading a successful handoff and verifying evidence, Erin may judge the subagent's `Recommended Next Phase Action` differently. Three responses are available — Erin uses judgment, not a protocol, to choose:
+  - **Override:** Erin proceeds with a different next-phase decision based on her between-phase view.
+  - **Re-dispatch:** Erin re-spawns the subagent with a corrective prompt naming the specific judgment Erin disagrees with.
+  - **Escalate:** Erin surfaces the disagreement to Jeff with both views.
+- R25. Run-state.md (R26) records every phase transition regardless of whether Erin agreed or overrode. This avoids the asymmetric-logging pathology (logging only divergence biases toward override).
+
+**Run-State for `/compact` Survival**
+
+- R26. Erin MUST maintain `docs/plans/<plan-filename-stem>/run-state.md` as the durable workflow thread. The file's content includes: phases complete, current phase, key decisions made by Erin, recent user input, recommended next action.
+- R27. Write ordering for wrapped phases:
+  - **Pre-dispatch:** Erin updates run-state.md with `current_phase: <name>`, `current_phase_status: dispatched`, `pre_dispatch_sha: <git HEAD>`, then dispatches the subagent.
+  - **Post-return:** Erin updates run-state.md with `current_phase_status: completed` (or `failed`/`needs-input`), records the verified evidence, and notes the next action.
+  - The pre-dispatch write ensures that a `/compact` between dispatch and return doesn't leave run-state.md silently stale.
+- R28. On `/ce:run erin` resume after `/compact`: Erin's first action is to (a) read `run-state.md`, (b) scan `docs/plans/<plan-filename-stem>/handoffs/` for any handoff file with mtime newer than run-state.md's last write — if found, the handoff returned during a `/compact` window and Erin reconciles it before continuing.
+- R29. For pre-plan phases (brainstorm, plan-creation), there is no plan-filename-stem yet. Run-state during these phases lives at `docs/runs/<YYYY-MM-DD-HH-MM-SS>-<topic-slug>/run-state.md`. Once a plan file exists, Erin migrates run-state.md into the plan's directory and links forward from the run-state location.
+
+**Explicit Non-Goals (v1 scope)**
+
+- R30. v1 does not build a dialogue-relay protocol for wrapped phases. Interactive phases stay in-thread; wrapped phases that hit `needs-input` halt and require a fresh `/ce:run`. This may be revisited if real-world usage shows `needs-input` halts are common AND a safe resume mechanism is feasible.
+- R31. v1 does not support parallel wrapped phases. The wrapped subagent itself may spawn parallel reviewer/worker subagents (`/ce:work` already does), but Erin doesn't run multiple wrapped phases concurrently.
+- R32. v1 does not thread persistence across separate `/ce:run` invocations. `run-state.md` covers within a single workflow only.
+
+## Success Criteria
+
+- **Spike clears the gate.** R1 passes (a)/(b)/(c) at the numeric thresholds. R1(d) baseline data is captured and confirms `work` is the right phase to wrap first; if not, v1 scope adjusts.
+- **Measurable context savings.** Running `/ce:run erin <a representative feature>` with `work` wrapped consumes meaningfully less main-thread context than the same workflow today. Concrete target: ≥30% reduction in main-thread tokens at the point of entering `compound`, measured against the baseline captured in R1(d). If R1(d) shows `work` is, e.g., 70% of main-thread tokens today, the actual savings should approach that number minus subagent-summary overhead.
+- **No regressions.** Existing orchestrators behave exactly as today. Only `erin.md` changes (and only the `work` phase entry).
+- **`/compact` survives mid-run.** Jeff can `/compact` after any phase completes; Erin reads `run-state.md` on resume, scans for newer handoffs, and continues without losing thread. Pre-dispatch write ordering ensures the `/compact`-between-dispatch-and-return window doesn't corrupt state.
+- **Sanity checks fire on synthetic tests.** Test cases: (a) subagent returns `success` with verified zero files modified → re-spawn fires; (b) subagent's `claimed_files_modified` substantially exceeds verified count → re-spawn fires; (c) `partial` with zero verified evidence → suspicious-success path triggers.
+- **Failure recovery exercised.** Test cases for malformed handoff (R21), double failure (R22), and `needs-input` halt (R23) all behave as specified — particularly R23: the workflow stops cleanly and Jeff sees the captured questions.
+
+## Scope Boundaries
+
+- **Out of scope: a `tina.md` persona file.** v1 treats wrapping as infrastructure. The wrapped subagent is "the work-phase subagent," not a named character.
+- **Out of scope: cross-repo persona work.** v1 only changes `erin.md` (in `ce-reviewers-jsl`). No new files in either repo.
+- **Out of scope: generic `wrapped: <persona>` primitive in `ce-run`.** Erin handles dispatch herself.
+- **Out of scope: JSONL event log.** Foreground subagent terminal streaming is sufficient visibility for v1.
+- **Out of scope: wrapping more than `work` in v1.** `review`, `plan-review` are v2 contingent on v1 evidence.
+- **Out of scope: dialogue-relay for `needs-input`.** See R23 / R30.
+- **Out of scope: stable `plan_id` in plan frontmatter.** v1 uses filename-stem; introduce when rename collisions are observed in practice.
+- **Out of scope: parallel wrapped phases.** See R31.
+- **Out of scope: cross-`/ce:run` continuity.** See R32.
+
+## Key Decisions
+
+- **Spike first with numeric thresholds.** Document review surfaced that the prior spike spec was a vibes-check ("token delta," "reliable"). v1 specifies pass/fail numbers per axis (R1) so the gate is testable, not arguable.
+- **Erin-verified evidence via `git diff`, not subagent self-report.** The first sanity-check spec relied on the subagent reporting its own work counts — same entity as the one reporting `success`, so a hallucinating subagent fools both checks identically. v1 has Erin run `git diff --stat` against a pre-dispatch snapshot in the main thread for independent ground truth.
+- **`needs-input` is a hard halt in v1 (C-forbid path).** Re-running `/ce:work` from scratch with augmented args is destructive (duplicate commits, conflicts, silent skips). Building safe resume is a skill-level redesign out of v1 scope. v1 halts cleanly; Jeff re-runs.
+- **Erin-specific dispatch, not a generic primitive.** Reversed under document review (premature framework). Extracts to `ce-run` later when orchestrator #2 has stated need.
+- **Wrap one phase in v1, not five.** `work` is the working hypothesis for largest bloat source; spike R1(d) validates. `review`, `plan-review` are v2.
+- **Infrastructure, not persona.** `wrapped: true`, no `tina.md`. The wrapper has no judgment; it's plumbing.
+- **Handoff includes claimed evidence; Erin verifies independently.** Two-source check catches subagent hallucination, not just honest-zero-work.
+- **`run-state.md` separate from `phase-handoff.md`.** Different readers, different writers, different lifecycles. Pre-dispatch write ordering prevents the `/compact`-corruption window.
+- **Erin's three responses to disagreement are judgment, not protocol.** Original draft over-formalized this with audit logging; review flagged the asymmetry pathology. v1: Erin uses judgment; run-state logs all phase transitions equally.
+- **Filename-stem for handoff dir, not stable `plan_id`.** Premature infrastructure for v1; legacy fallback already covers what `plan_id` was meant to fix.
+- **Drop `tool_calls_count` from handoff fields.** Redundant for the v1 sanity check.
+
+## Dependencies / Assumptions
+
+- Spike (R1) confirms depth-2 Agent dispatch behavior. **Hard prerequisite — failure triggers redesign.**
+- The `Agent` tool reliably passes a custom system prompt / persona definition to subagents. (Already used at depth-1 by reviewer panels; spike validates depth-2.)
+- Erin can shell out to `git diff --stat` and `git log` in the main thread for evidence verification. (Bash tool is available; this is essentially free.)
+- Plans created by `/ce:plan` have stable filenames during a single `/ce:run` workflow. (Today's convention; rename mid-workflow is rare and out of v1 scope.)
+
+## Outstanding Questions
+
+### Resolve Before Planning
+*(none — strategic decisions are locked)*
+
+### Deferred to Planning
+
+- [Affects R1][Needs research] Spike implementation form: ad-hoc Bash invocation, or written-to-disk regression test we keep around? Recommend the latter — durable artifact, runnable on every Claude Code platform update.
+- [Affects R20][Technical] Tolerance band for evidence-discrepancy sanity check. "Substantially exceeds" needs a number. 2× ratio? Absolute delta? Lock during planning.
+- [Affects R8, R9][Technical] If/when v2 wraps `review`, the subagent will need access to Erin's review-preferences (named-persona team composition). Plan for this explicitly when v2 lands; v1 ducks it because review isn't wrapped.
+- [Affects R29][Technical] Migration of run-state when a plan file is created mid-workflow (brainstorm → plan transition). Linking forward + leaving a pointer at the original location is the obvious shape; lock specifics during planning.
+- [Affects R27][Technical] Atomic-write semantics for run-state.md. If Erin crashes between writes, the file shouldn't be left half-written. Probably write-to-temp + rename. Lock during planning.
+
+## Next Steps
+
+→ `/ce:plan` for structured implementation planning. The plan MUST schedule R1 (the spike) as the first implementation unit, with a hard gate that the rest of the plan only executes if R1's pass/fail thresholds are met. The plan MUST also explicitly include capturing R1(d) baseline data before declaring v1 success criteria measurable.
diff --git a/docs/plans/2026-05-07-001-feat-erin-phase-isolation-plan.md b/docs/plans/2026-05-07-001-feat-erin-phase-isolation-plan.md
new file mode 100644
index 000000000..94e619ec3
--- /dev/null
+++ b/docs/plans/2026-05-07-001-feat-erin-phase-isolation-plan.md
@@ -0,0 +1,348 @@
+---
+title: "feat(erin): subagent wrapping for the work phase"
+type: feat
+status: active
+date: 2026-05-07
+origin: docs/brainstorms/2026-05-07-erin-phase-isolation-requirements.md
+spike: docs/solutions/2026-05-07-agent-tool-depth-2-spike.md
+---
+
+> **🔬 Unit 1 spike result (2026-05-07):** Direct depth-2 Agent dispatch is NOT supported by the platform — subagents lack the `Agent` tool. Two workarounds verified: (A) constrain `/ce:work-wrapped` to inline-only execution with no Agent-dispatching sub-skills, sufficient for v1; (B) use `claude -p` subprocess to bypass the constraint when nested dispatch is genuinely needed (works, ~16s overhead per subprocess, streaming fidelity regression). v1 proceeds with Workaround A. See [findings](../solutions/2026-05-07-agent-tool-depth-2-spike.md). Units 2–4 below are revised to reflect Workaround A.
+
+# Erin Phase Isolation: Subagent Wrapping for the Work Phase
+
+**Target repos:**
+- `ce-reviewers-jsl` — `orchestrators/erin.md`
+- `compound-engineering-plugin` — `ce-run` skill (small hook), spike findings, dogfood notes
+
+## Overview
+
+Wrap Erin's `work` phase as an `Agent`-tool subagent so its tool churn runs in an isolated context instead of the main thread. Add a small `run-state.md` Erin maintains so workflows survive `/compact` mid-flow. Test it by running `/ce:run erin` on a real feature; let dogfood evidence drive what comes next.
+
+This plan was rewritten substantially after Jason/Charles/Marty/Melissa/Sandy plan-review converged on "the previous plan is overbuilt — ship the cupcake, eat it, let usage tell you which layer to add next." (see origin: `docs/brainstorms/2026-05-07-erin-phase-isolation-requirements.md`)
+
+## Problem Frame
+
+Two distinct pains drive this work, and the prior version of this plan re-conflated them:
+
+1. **Context bloat.** `/ce:work` runs in-thread; its tool churn fills the main conversation context, forcing manual `/compact` mid-workflow.
+2. **Memory loss across `/compact`.** When Jeff `/compact`s, Erin loses the workflow thread.
+
+Pain #2 is solved by **a single small file Erin reads on resume** — no subagent infrastructure. Pain #1 is solved by **dispatching `/ce:work` as a subagent**. v1 ships both, but treats them as independent mechanisms with the smallest possible coupling.
+
+The success signal is qualitative: did Jeff complete a representative real workflow with `/ce:run erin` *without* manually `/compact`-ing mid-flow, AND if he chose to `/compact`, did Erin resume cleanly? Token-percentage targets were reviewer-flagged as output metrics in disguise; the real outcome is Jeff's flow state.
+
+## Requirements Trace
+
+This plan satisfies the load-bearing requirements from the origin document. Many requirements (R30–R32 explicit non-goals, several deferred-to-planning items) are now covered by *omission* rather than positive design — the simplification was the point.
+
+- **R1, R2, R3** (platform spike, hard gate): Unit 1 — qualitative gate, not numeric thresholds (per plan-review)
+- **R4–R9** (`wrapped: true` flag, dispatch, args, prompt scope, model resolution): Units 2 + 3
+- **R10–R12** (only `work` wrapped in v1): Unit 3
+- **R13–R17** (handoff contract): inlined into Unit 3 prose, not extracted as a spec — minimal schema
+- **R18, R19** (Erin verifies via git diff, empty-success check): Unit 3 — keep these; drop floor/ratio/calibration as premature
+- **R20** (discrepancy ratio): **DROPPED from v1.** Reviewer consensus: ship just the empty-success check; if false positives bite, add tiers. Re-add on evidence, not anticipation.
+- **R21** (re-spawn once with corrective prompt): Unit 3
+- **R22** (surface to user on second failure): Unit 3
+- **R23** (needs-input is hard halt): Unit 3
+- **R24, R25** (Erin disagrees / override / re-dispatch / escalate): **DROPPED as a formal protocol.** Erin uses judgment as she always has. Re-add if a real disagreement loop appears.
+- **R26–R28** (run-state for `/compact` survival): Unit 3
+- **R29** (pre-plan run-state migration): **DROPPED.** Run-state stays at `docs/runs/<run-id>/` for the entire workflow; plan file gets a single frontmatter line forward-linking to the run dir if desired. No migration, no `MOVED.md`, no recovery branch.
+- **R30–R32** (explicit non-goals): preserved as omissions in Erin's prose
+- **Synthetic test scenarios**: **DROPPED.** Replaced by Unit 4 (one real-feature dogfood). The mock fixture was reviewer-flagged as "fiction validating fiction."
+
+## Scope Boundaries
+
+- **No `tina.md` persona file.** Wrapping is infrastructure.
+- **No contract spec document.** The handoff format is described inline in Erin's prose. If a second consumer needs the contract documented separately, extract then.
+- **No FLOOR / ratio / calibration logic.** Just empty-success check.
+- **No discrepancy claimed-vs-verified machinery.** Drop `claimed_*` fields from the handoff entirely; `git diff` is ground truth, no need to police a self-report Erin can verify directly.
+- **No mock fixture, no scenario test suite.** Dogfood validates v1.
+- **No atomic write-temp-rename ceremony.** Erin uses the Write tool; mtime reconciliation is the recovery mechanism.
+- **No pre-plan migration.** Run-state lives in `docs/runs/<run-id>/` for the workflow's lifetime.
+- **No plan-rename detection.** Acknowledged as rare; surface naturally if it happens.
+- **No formal Erin-disagrees protocol.** Just judgment.
+- **No bounded re-dispatch counts.** Trust Erin; revisit if loops appear.
+- **No JSONL event log.** Foreground subagent terminal streaming is the visibility channel.
+- **No dialogue-relay for needs-input.** Hard halt; user re-runs `/ce:run erin` with answers in args.
+- **No generic `wrapped: <persona>` primitive in `ce-run`.** Just a small `wrapped: true` recognition hook.
+- **Only the `work` phase wrapped in v1.** `review`, `plan-review`, `everyday-usability` are reconsidered after dogfood evidence.
+- **Wrapped `/ce:work` is pinned to Inline execution strategy.** Per Unit 1 spike: subagents lack the `Agent` tool, so `/ce:work`'s Serial-subagent / Parallel-subagent / Swarm strategies cannot run inside the wrapped subagent. Erin's dispatch prompt explicitly forces Inline strategy and forbids nested skill invocations that would dispatch (e.g., `/ce:review`). Plans that genuinely need internal fan-out are out of scope for v1; if the need appears, escalate to Workaround B per the spike findings.
+
+## Context & Research
+
+### Relevant Code and Patterns
+
+- `ce-reviewers-jsl/orchestrators/erin.md` — confirmed exists; gains `wrapped: true` on the work phase + a new behavior section.
+- `compound-engineering-plugin/plugins/compound-engineering/skills/ce-run/SKILL.md` — confirmed exists; gains a small wrapped-phase hook.
+- Existing reviewer subagent dispatches in `/ce:review` — depth-1 pattern reference.
+- Other orchestrator files (`angie.md`, `lfg.md`, `max.md`) — prose style reference.
+
+### Institutional Learnings
+
+- CLAUDE.md flags subagent-spawning-subagents as a known constraint for *orchestrators*. This plan dispatches the `/ce:work` *skill* at depth-2, not Erin herself. Spike (Unit 1) verifies depth-2 skill dispatch.
+- The `orchestrating-swarms` skill exists for a different shape (TeammateTool, persistent inboxes); not applicable.
+
+## Key Technical Decisions
+
+- **Ship the cupcake.** v1 is: dispatch `/ce:work` via Agent + write a tiny run-state file + read it on resume + verify with `git diff`. That's it. Every defensive subsystem the prior plan added (FLOOR, ratio rule, mock fixture, atomic ceremony, plan-rename detection, three-option Erin-disagrees protocol) was reviewer-flagged as defending against problems that haven't happened. Cut to minimum; add tiers on evidence.
+- **Run-state and wrapping are independent mechanisms.** Pain #2 (`/compact` survival) is fully solved by run-state.md alone. Pain #1 (context bloat) is solved by wrapping. They ship together because the work touches one file (Erin's persona), but they don't depend on each other functionally — if the spike fails, run-state still ships and is useful.
+- **Qualitative spike gate, not numeric thresholds.** Per Charles + Jason + Marty: the platform doesn't expose per-subagent token cost as a public API. Inventing 20%/50% thresholds against an undefined primitive is fighting the constraint. Real gate: 5 successful depth-2 dispatches, terminal streaming visible, AND the dogfood run feels meaningfully less context-cramped than today. The latter is Jeff's call.
+- **Drop `claimed_*` from the handoff.** Per Charles: if `git diff` is truth, the subagent shouldn't be asked to claim. Erin reads `git diff --stat <pre-sha>` (no `..HEAD`, includes uncommitted) for verified evidence. The handoff carries outcome prose and artifacts only. This eliminates the entire discrepancy-detection subsystem.
+- **Just the empty-success check, no tiered sanity.** If `status: success` but `git diff` shows zero files and zero lines → re-spawn once with corrective prompt; surface on second failure. Floor / ratio rules added on evidence, not anticipation.
+- **`ce-run` gets a small recognition hook.** One paragraph in ce-run's phase loop: "if `wrapped: true`, follow the orchestrator's wrapped-phase behavior instead of in-thread skill invocation." The orchestrator (Erin) owns dispatch parameters and post-return logic. Hook is a deterministic executor branch, not a generic primitive.
+- **`needs-input` is a hard halt.** Subagent surfaces the question via the handoff; Erin halts; user re-runs `/ce:run erin` with answers in args. No in-flight resume; preserves /ce:work idempotency.
+- **Run-state in `docs/runs/<run-id>/` for the entire workflow.** No migration to plan dir. The plan file may include a `run_dir:` frontmatter line forward-linking, but that's optional decoration. One location, no branching state-handling.
+- **Use `git diff --stat <pre-sha>` (no `..HEAD`).** Includes working-tree changes; legitimate `/ce:work` runs may stage but not commit.
+- **No ceremony around atomic writes.** Erin uses the Write tool; on a partial-write crash, the resume protocol reads run-state and reconciles against handoff mtime. The recovery is mtime-based, not write-protocol-based.
+- **Trust Erin's between-phase judgment.** Drop the formal override/re-dispatch/escalate protocol. If real disagreement loops appear in dogfood, add a bound; otherwise leave Erin's prose unencumbered.
+
+## Open Questions
+
+### Resolved During Planning
+
+- **Spike form:** qualitative observable gate (5 successful dispatches + streaming visible + dogfood-feel). No numeric thresholds.
+- **Discrepancy strategy:** empty-success only in v1; tiered rules added on evidence.
+- **Atomic write strategy:** none required; mtime reconciliation is recovery.
+- **Run-state location across plan transition:** stays in `docs/runs/<run-id>/`; no migration.
+- **Erin disagrees protocol:** none; judgment as today.
+- **`ce-run` change:** small recognition hook, one paragraph.
+
+### Deferred to Implementation
+
+- **Handoff frontmatter exact fields.** Probably just `phase`, `started`, `completed`, `status`. Locked during Unit 3.
+- **Run-state frontmatter exact fields.** Probably `current_phase`, `current_phase_status`, `pre_dispatch_sha` (when applicable), `last_updated`. Locked during Unit 3.
+- **Spike harness form.** Bash if non-interactive Claude Code supports depth-2 dispatch reliably; manual reproducible scenarios otherwise. Decided in Unit 1.
+- **Corrective-prompt template for re-spawn.** Drafted during Unit 3, refined post-dogfood if needed.
+
+## High-Level Technical Design
+
+> *Directional guidance for review, not implementation specification.*
+
+**Erin's wrapped-phase behavior (Unit 3 prose):**
+
+```
+on entering a wrapped phase (ce-run yields control via the hook):
+  1. write run-state.md: current_phase=<name>, current_phase_status=about_to_dispatch
+  2. capture pre_dispatch_sha = git rev-parse HEAD
+  3. update run-state.md: current_phase_status=dispatched, pre_dispatch_sha=<sha>
+  4. dispatch via Agent tool (model: opus, prompt: skill invocation + args)
+
+on subagent return:
+  5. read handoff at docs/runs/<run-id>/handoffs/<phase>.md (or docs/plans/...
+     handoffs/ if a plan dir is preferred — locked in Unit 3)
+  6. parse(git diff --stat <pre_dispatch_sha>) → verified files, verified lines
+  7. empty-success check: if status==success AND verified files=0 AND verified lines=0
+        → suspicious; re-spawn once with corrective prompt
+        → if second attempt also empty: surface to Jeff with options
+  8. if status==needs-input: halt, surface questions, no resume
+  9. else: update run-state.md with verified evidence + outcome
+  10. proceed (Erin's judgment as today; no formal disagreement protocol)
+
+on /ce:run erin resume after /compact:
+  i. read run-state.md
+  ii. ls handoffs/ for any handoff with mtime > run-state.md mtime → reconcile
+  iii. proceed from current_phase + current_phase_status
+```
+
+**Run-state.md (minimal):**
+
+```markdown
+---
+run_id: 2026-05-07-14-23-00-erin-foo
+current_phase: work
+current_phase_status: completed
+pre_dispatch_sha: abc123  # only when wrapped phase is in flight or recently completed
+last_updated: 2026-05-07T14:48:30Z
+---
+
+## Workflow Trace
+- 2026-05-07T13:00Z brainstorm: completed
+- 2026-05-07T13:45Z plan: completed
+- 2026-05-07T14:23Z work: dispatched (wrapped, sha abc123)
+- 2026-05-07T14:48Z work: completed (verified 7 files / 142 lines)
+
+## Recommended Next Action
+[One sentence — what comes next.]
+```
+
+**Handoff.md (minimal — no claimed_*):**
+
+```markdown
+---
+phase: work
+started: 2026-05-07T14:23Z
+completed: 2026-05-07T14:48Z
+status: success
+---
+
+## Outcome
+[One sentence.]
+
+## Artifacts
+- Plan: docs/plans/...
+- Commits: <sha1> <sha2>
+
+## Recommended Next Phase Action
+[One sentence.]
+
+## Judgment Calls For Erin
+[Optional — omit when nothing to flag.]
+```
+
+## Implementation Units
+
+- [ ] **Unit 1: Platform spike — depth-2 dispatch reliability + streaming visibility (HARD GATE)**
+
+**Goal:** Verify Claude Code's `Agent` tool supports depth-2 subagent dispatch (Opus parent → Opus child → 2 parallel Sonnet grandchildren) with visible streaming. Qualitative gate, not numeric thresholds.
+
+**Requirements:** R1, R2, R3 (revised — qualitative)
+
+**Dependencies:** None.
+
+**Files:**
+- Create: `compound-engineering-plugin/tests/spikes/depth-2-dispatch.<ext>` (form determined by feasibility)
+- Create: `compound-engineering-plugin/docs/solutions/2026-05-07-agent-tool-depth-2-spike.md`
+
+**Approach:**
+- **Sub-step (0): Feasibility probe.** Determine if non-interactive Claude Code (`-p` mode or equivalent) supports depth-2 Agent dispatch. If yes → Bash harness. If no → manual reproducible scenarios.
+- **Sub-step (1): Dispatch.** 5 runs of: Opus subagent → 2 parallel Sonnet subagents performing small synthetic tasks. Capture: success/failure per run, qualitative streaming observation.
+- **Sub-step (2): Findings doc.** Records: probe result, harness form, 5/5 dispatch result, streaming observation. Optional notes on observed parent-context impact (qualitative; "felt lighter / similar / heavier" is acceptable).
+
+Time-box to one week. If after one week the probe shows non-interactive depth-2 isn't reliable, harness becomes manual scenarios — that's not a fail, it's a form choice.
+
+**Test scenarios:**
+- *Happy path:* 5/5 runs succeed; streaming visible.
+- *Failure path — dispatch unreliable:* if any run fails (excluding transient model rate limits), spike fails; redesign before further units.
+- *Failure path — streaming opaque:* if leaf tool calls are invisible during the dispatch, spike fails on streaming; redesign.
+
+**Verification:**
+- Findings doc exists.
+- 5/5 dispatch and streaming both pass: proceed to Unit 2.
+- Either fails: halt and surface to Jeff with the findings.
+
+---
+
+- [ ] **Unit 2: ce-run wrapped-phase recognition hook**
+
+**Goal:** Add a small branch to ce-run's phase loop that recognizes `wrapped: true` and yields to the orchestrator's wrapped-phase behavior.
+
+**Requirements:** R5 (revised — hook, not primitive)
+
+**Dependencies:** Unit 1 passes.
+
+**Files:**
+- Modify: `compound-engineering-plugin/plugins/compound-engineering/skills/ce-run/SKILL.md`
+
+**Approach:**
+- In Step 5 ("Execute phases"), after the existing optional/skip-when logic, add: "If the phase has `wrapped: true` in its frontmatter, do NOT invoke the skill in-thread. Instead, follow the orchestrator's wrapped-phase behavior section in its persona file. The orchestrator owns dispatch parameters; ce-run only recognizes the flag and yields control."
+- One-paragraph rationale note: this is a hook with one defined consumer (Erin), not a generic primitive. No schemas defined here.
+
+**Test scenarios:**
+- *Integration:* `/ce:run erin <feature>` reaches the work phase → ce-run yields to Erin's wrapped-phase prose. Verified by Unit 4 dogfood.
+- *No-regression:* other orchestrators (Angie, Edith, Max, etc.) without `wrapped: true` continue dispatching in-thread. Verified by inspection.
+
+**Verification:**
+- ce-run/SKILL.md contains the new branch + rationale.
+- All other ce-run instructions unchanged.
+
+---
+
+- [ ] **Unit 3: Erin orchestrator behavior update**
+
+**Goal:** Add `wrapped: true` to the work phase entry. Add minimal behavior prose covering dispatch, git-diff verification, empty-success check, needs-input halt, run-state writes, and resume protocol.
+
+**Requirements:** R4, R7-R12, R18-R19, R21-R23, R26-R28 (revised — minimal)
+
+**Dependencies:** Units 1 and 2 complete.
+
+**Files:**
+- Modify: `ce-reviewers-jsl/orchestrators/erin.md`
+
+**Approach:**
+- Frontmatter: add `wrapped: true` to the `work` phase entry.
+- New behavior section "## Wrapped phases":
+  - **Dispatch sequence.** Write run-state.md `about_to_dispatch` → capture pre-dispatch SHA → update run-state with SHA → Agent dispatch (model opus, prompt: skill invocation + args; no review-preferences in v1).
+  - **Inline-only execution constraint (Workaround A).** Subagents in Claude Code do not have the `Agent` tool (verified by Unit 1 spike). The wrapped `/ce:work` invocation MUST therefore be constrained to inline execution. Erin's dispatch prompt must include: (1) "Choose **Inline** execution strategy in Phase 1 step 4; do NOT use Serial subagents, Parallel subagents, or Swarm Mode — they will fail at depth-2." (2) "Do NOT invoke `/ce:review`, `/ce:plan`, or any other skill that internally dispatches via `Agent` — those are separate Erin phases and must not be nested inside the wrapped `work` phase." Without these clauses the wrapped phase will error mid-flight when it tries to dispatch and the Agent tool isn't available. If a future plan genuinely needs internal fan-out, escalate to Workaround B (`claude -p` subprocess) per the spike findings — out of scope for v1.
+  - **Verification on return.** Read handoff. Run `git diff --stat <pre_sha>` (no `..HEAD`) for verified files/lines.
+  - **Empty-success check.** If `status in {success, partial}` AND verified files=0 AND verified lines=0 → re-spawn once with corrective prompt naming the missing evidence; surface on second occurrence.
+  - **Failure recovery.** Re-spawn once on detectable failure (Agent error, malformed handoff frontmatter, missing referenced artifacts); surface on second.
+  - **needs-input is a hard halt.** Subagent's questions surfaced to Jeff; workflow stops; Jeff re-runs `/ce:run erin` with answers in args.
+  - **Run-state writes.** Pre-dispatch (with SHA), post-return (with verified evidence + outcome). Write tool's built-in atomicity is sufficient.
+  - **Resume protocol.** Read run-state.md; scan `<run-dir>/handoffs/` for any handoff with mtime newer than run-state.md mtime → reconcile; proceed from `current_phase` + `current_phase_status`.
+- **Run-state location.** `docs/runs/<run-id>/run-state.md` for the workflow's entire lifetime. No migration.
+- **No formal Erin-disagrees protocol** — Erin reads the handoff and uses judgment as she always has between phases.
+
+**Test scenarios:** Coverage by Unit 4 dogfood.
+
+**Verification:**
+- Erin's frontmatter has `wrapped: true` on exactly the work phase.
+- Body contains the new "## Wrapped phases" section.
+- Existing prose (review-preferences, persona voice, gate descriptions, non-wrapped behavior) preserved unchanged.
+
+---
+
+- [ ] **Unit 4: Dogfood on one real feature**
+
+**Goal:** Run `/ce:run erin` on a real feature end-to-end with the wrapped work phase. Note what worked, what broke, what feels different. Decide v2 scope from the experience, not anticipation.
+
+**Requirements:** Validates the entire v1 (success criteria, all behavioral requirements).
+
+**Dependencies:** Units 1, 2, 3 complete.
+
+**Files:**
+- Create: `compound-engineering-plugin/docs/solutions/2026-05-XX-erin-wrapping-dogfood.md` — running notes during the dogfood run + post-mortem
+
+**Approach:**
+- Pick a real feature Jeff would run `/ce:run erin` on anyway (not synthetic). Run the workflow end-to-end with v1's wrapped work phase.
+- During the run, note: did the work-phase dispatch work? Did streaming feel adequate? Did context feel less crowded than today? Did the empty-success check fire (or not)? Did `needs-input` hit (and what happened)? Did Jeff `/compact` mid-flow (intentionally or not)? Did resume work?
+- After the run, write a short post-mortem in solutions/: what behaved as designed, what surprised, what should v2 add or change. This becomes the input to whatever comes next (wrapping `review`? Running per-phase token measurement? Adding the FLOOR rule because of a real false positive?).
+
+**Test scenarios:**
+- *Happy path:* workflow completes; Jeff's qualitative read is "context felt better."
+- *Friction path:* something breaks (sanity check fires wrongly, streaming opaque, resume confused) — captured in the post-mortem with enough detail to inform v2.
+
+**Verification:**
+- The dogfood run completed (success or instructive failure).
+- Post-mortem doc exists with concrete observations and v2 recommendations.
+
+## System-Wide Impact
+
+- **Interaction graph:** ce-run gains a small wrapped-phase recognition branch (Unit 2). Erin's prose owns dispatch and post-return behavior (Unit 3). Other orchestrators unaffected.
+- **Error propagation:** Failures in the wrapped subagent surface via the handoff doc and return value. Erin retries once or surfaces.
+- **State lifecycle:**
+  - `/compact` between dispatch and return: handoff-mtime > run-state-mtime detection on resume reconciles.
+  - `/compact` between SHA capture and run-state SHA-update: pre-dispatch run-state write happens FIRST as recovery anchor.
+  - `/compact` between subagent return and Erin's post-return run-state write: same handoff-mtime detection.
+  - Half-written run-state.md: Write tool's built-in atomicity covers this on most platforms; mtime reconciliation provides ground truth if state looks stale.
+- **API surface parity:** No external API changes. Wrapped subagent prompt scope excludes review-preferences in v1 — fine because review isn't wrapped.
+- **Unchanged invariants:** all other orchestrators, `/ce:work` skill, `/ce:review` reviewer dispatch (depth-1), existing plans/runs.
+
+## Risks & Dependencies
+
+| Risk | Type | Response |
+|------|------|----------|
+| Spike (Unit 1) finds depth-2 dispatch unreliable | Failure response | Halt; surface findings; redesign |
+| Spike finds streaming opaque ("black box for N minutes") | Failure response | Halt; depth-2 isn't viable for v1's UX bet |
+| ce-run hook doesn't reliably trigger Erin's wrapped behavior | Mitigation via Unit 4 | Dogfood catches it; iterate ce-run prose if needed |
+| `/ce:work` is not idempotent enough for `needs-input` re-runs | Acceptance | R23 hard halt accepts this in v1; documented "wasted work, not destructive" |
+| Empty-success check false-positives (e.g., legitimate /ce:work runs that stage but don't commit) | Mitigation | `git diff --stat <pre-sha>` without `..HEAD` includes working tree |
+| Wrapping `work` doesn't actually save much main-thread context (assumption unmeasured) | Acceptance | Unit 4 dogfood validates qualitatively. If v1 doesn't help, redesign — don't build v2 |
+| Plan file renamed mid-workflow → handoff dir orphaned (kept at runs/<run-id>/, not plans/<stem>/, in v1) | Mitigation by location | Run-state lives at `docs/runs/<run-id>/` regardless of plan filename. Decoupled. |
+| Future contributor reads erin.md and doesn't grok the wrapping pattern | Acceptance + Unit 4 + post-merge | Sandy's cognitive-load concern noted; if confusion appears in real use, add a "How wrapping works" doc. v1 keeps it inline. |
+
+## Documentation / Operational Notes
+
+- **Spike findings** (Unit 1) → `docs/solutions/`. Institutional learning for future depth-2 work.
+- **Dogfood post-mortem** (Unit 4) → `docs/solutions/`. Drives v2 scope decisions on evidence.
+- **ce-run hook** (Unit 2) — documented inline in SKILL.md as "hook, not primitive."
+- **Erin persona** (Unit 3) — only user-visible behavior change. Jeff will notice that `/ce:run erin` runs `work` differently.
+- **No deployment, monitoring, feature-flag, or rollout concerns.** Local tooling. Rollout = merge → behavior changes on next `/ce:run erin`.
+
+## Sources & References
+
+- **Origin document:** `docs/brainstorms/2026-05-07-erin-phase-isolation-requirements.md`
+- **Plan-review feedback** (Jason / Charles / Marty / Melissa / Sandy) drove the v3 simplification. Their convergent guidance: ship the cupcake; let dogfood drive what comes next.
+- **CLAUDE.md (compound-engineering-plugin):** subagent-spawning-subagents constraints
+- **Reference orchestrators:** `ce-reviewers-jsl/orchestrators/erin.md`, `lfg.md`, `max.md`
+- **Modified skill:** `compound-engineering-plugin/plugins/compound-engineering/skills/ce-run/SKILL.md` — small wrapped-phase hook
diff --git a/docs/solutions/2026-05-07-agent-tool-depth-2-spike.md b/docs/solutions/2026-05-07-agent-tool-depth-2-spike.md
new file mode 100644
index 000000000..e354f3fd6
--- /dev/null
+++ b/docs/solutions/2026-05-07-agent-tool-depth-2-spike.md
@@ -0,0 +1,107 @@
+# Agent Tool Depth-2 Dispatch Spike — Findings
+
+**Date:** 2026-05-07
+**Plan reference:** [docs/plans/2026-05-07-001-feat-erin-phase-isolation-plan.md](../plans/2026-05-07-001-feat-erin-phase-isolation-plan.md), Unit 1
+**Harness:** [tests/spikes/depth-2-dispatch.md](../../tests/spikes/depth-2-dispatch.md)
+**Verdict:** ⚠️ **CONDITIONAL — direct depth-2 Agent dispatch is not supported, but two workarounds verified to sidestep the constraint.**
+
+## Summary
+
+The spike was the hard gate for the Erin phase-isolation feature. It tested whether Claude Code's `Agent` tool supports depth-2 subagent dispatch — i.e., whether a subagent spawned via `Agent` from the main thread can itself spawn further subagents via `Agent`. **It does not.** Subagents in Claude Code receive `Bash`, `Read`, `Edit`, `Write`, `Skill`, and several utility tools, but **not** the `Agent` tool. They have no mechanism to dispatch sub-subagents.
+
+The plan's entire wrapping architecture rests on the assumption that depth-2 dispatch works: Erin (main thread, depth-0) would dispatch `/ce:work` as a subagent (depth-1), and `/ce:work` would then spawn its internal reviewer panel via `Agent` (depth-2). Since depth-2 dispatch is unavailable, **the architecture as planned cannot ship.** All subsequent units (ce-run hook, erin.md update, dogfood) are gated on this finding and do not proceed.
+
+The spike did its job: it surfaced the architectural blocker cheaply, before any of the dependent code was written. This is exactly the failure mode the hard-gate sequencing was designed to catch.
+
+## What was tested
+
+**Form chosen:** Manual reproducible scenario (markdown harness in `tests/spikes/depth-2-dispatch.md`), executed inline from a main-thread Claude Code session via 5 parallel Agent dispatches. A `claude -p` Bash harness was not pursued because main-thread interactive Claude can run depth-2 dispatch directly when the platform supports it — which is the actual question. The markdown form is durable; re-running it on any future Claude Code update is a paste-and-go operation.
+
+**Synthetic task:** Each trial coordinator was instructed to spawn two parallel Sonnet leaf subagents via `Agent`. Each leaf would run a single `Bash` command counting markdown files in a deterministic directory (orchestrators or reviewers in `ce-reviewers-jsl`). Coordinator would sum the counts and return.
+
+**Trials run:** 5 (in parallel from main thread).
+
+## Results
+
+| Trial | Coordinator type | Outcome | Reason |
+|-------|------------------|---------|--------|
+| 1 | Opus general-purpose subagent | ❌ Could not dispatch | Reported "Agent tool is not available in this environment - it's not in the deferred tools list and ToolSearch found no match" |
+| 2 | Opus general-purpose subagent | ❌ Could not dispatch | Same — fell back to direct Bash; reported orchestrators=10, reviewers=14, total=24 |
+| 3 | Opus general-purpose subagent | ❌ Could not dispatch | Same — fallback Bash, same totals |
+| 4 | Opus general-purpose subagent | ❌ Could not dispatch | Same — fallback Bash, same totals |
+| 5 | Opus general-purpose subagent | ❌ Could not dispatch | Same — fallback Bash, same totals |
+
+**0 of 5 trials achieved depth-2 dispatch.** All 5 coordinators independently and consistently reported the same finding: their tool environment does not include `Agent`, even though the platform's general-purpose subagent description claims `(Tools: *)`. The wildcard description is misleading; in practice `Agent` is excluded from subagent contexts.
+
+Four of the five coordinators recovered gracefully by computing the leaf task directly via `Bash` (the wrong primitive for the spike's intent, but at least proves leaf-task competence). All four returned the same correct totals (`orchestrators=10 reviewers=14 total=24`), which validates that leaf-level work isn't itself the failure mode — only the depth-2 dispatch capability.
+
+## Pass/fail per gate criterion
+
+- **(a) Main-thread token isolation:** Cannot be measured — there was no successful depth-2 dispatch to measure isolation against.
+- **(b) Streaming fidelity:** Cannot be observed at depth-2 — the depth-2 layer doesn't exist.
+- **(c) Reliability (5/5 success):** **FAIL.** 0 of 5 trials succeeded.
+
+Per the plan: "If any of (a), (b), or (c) fail at thresholds OR sub-step (1) yields no measurement primitive, halt the rest of the plan and surface to Jeff with the findings doc." (c) failed unambiguously.
+
+## Why the spike caught what design review couldn't
+
+The earlier adversarial-document review (F1, HIGH 0.85 confidence) flagged exactly this risk: *"The foundational Agent tool assumption is asserted, not verified... CLAUDE.md flags subagent-spawning-subagents as a known failure mode."* The plan responded by making R1 a hard gate with explicit numeric thresholds — but the gate's value turns out to be even simpler than calibrating thresholds: just running the dispatch once was enough. The platform doesn't permit it.
+
+The 0-of-5 result is a binary platform constraint, not a calibration question. No amount of threshold-tuning, harness sophistication, or persona-prose precision would have changed the answer.
+
+## Implications for the feature
+
+1. **The current architecture cannot ship.** Wrapping `/ce:work` via `Agent` would create a subagent that cannot run `/ce:work`'s internal reviewer-panel dispatches (those are also `Agent` calls). The wrapped phase would either error or produce broken output.
+2. **The plan's Units 2–4 do not execute.** ce-run hook, erin.md update, and dogfood all depended on a working depth-2 dispatch. Halted.
+3. **Some narrower designs may still be viable** (see Possible Paths Forward). The spike fails the *original* hypothesis but doesn't preclude other approaches to the same user pain.
+
+## Workarounds verified after the initial fail
+
+### Workaround A: Inline-only wrapped `/ce:work` ✅ Viable for v1
+
+The platform constraint is "subagents cannot call Agent." But `/ce:work` doesn't *intrinsically* need Agent — most of its work is Bash + Read + Edit + Write, all of which subagents have. The Agent calls inside `/ce:work` are:
+
+1. **"Choose Execution Strategy: Subagents"** — an *optional* optimization for large parallel-able plans. Default is inline.
+2. **Invoking `/ce:review`** — but `/ce:review` is a *separate Erin phase* that runs after `work`, not inside it.
+
+If `/ce:work-wrapped` is constrained to **inline execution only** AND **does not invoke `/ce:review` or other Agent-dispatching skills internally**, the depth-2 constraint never applies. Erin dispatches `/ce:work-wrapped` at depth-1 (works), and `/ce:work-wrapped` does its work using only the tools available to subagents.
+
+Trade-off: large plans that would have benefited from parallel subagent execution lose that strategy when wrapped. Acceptable for v1; if a real plan needs it, run unwrapped or use Workaround B.
+
+### Workaround B: Subprocess via `claude -p` ✅ Verified working
+
+**Test setup:** Main thread → Agent subagent (no Agent tool, confirmed) → Bash invocation of `claude -p --dangerously-skip-permissions "..."` → fresh top-level Claude session (HAS Agent) → 2 parallel Sonnet sub-subagents performing the same file-counting task.
+
+**Result:** Subprocess completed cleanly, exit code 0, ~16s wall-clock. SUBPROCESS_RESULT line returned correct counts (orchestrators=10, reviewers=14, total=24). No permission prompts. No errors.
+
+**Mechanism:** The platform's depth-2 ban applies within a single Claude Code process. Spawning a fresh `claude -p` process resets the depth count from that subprocess's perspective. The subprocess can then dispatch its own Agent subagents normally.
+
+**Trade-offs:**
+- Latency overhead ~5-15s per subprocess (process startup, fresh session). Real fan-outs will be dominated by subagent work itself, so the relative cost is small.
+- **Streaming fidelity regression.** The subprocess's tool calls return as a single block to the parent's Bash output rather than streaming during execution. The user terminal would see "subprocess running..." rather than live tool calls during the wrapped phase. This is a real UX regression vs. depth-1 streaming.
+- Each subprocess is a fresh API session — system prompt re-cached, CLAUDE.md re-loaded. Modest cost.
+- Permission model: `--dangerously-skip-permissions` was used. For wrapped phases inside an already-authorized workflow, this is probably fine, but worth flagging.
+- Authentication inherits from environment (Anthropic API key, OAuth, etc.). Worked transparently.
+- Recursion: nothing prevents subprocess-of-subprocess, but each level adds overhead.
+
+**When to use B over A:** A is sufficient for v1's "wrap `/ce:work`" use case. B is the escape hatch when future wrapped phases genuinely need internal Agent fan-out (e.g., wrapping `/ce:review` or any phase that internally dispatches reviewer panels).
+
+## Possible paths forward (for Jeff to decide)
+
+These are sketches, not commitments. Each needs its own brainstorm/plan if pursued.
+
+1. **Ship the run-state.md half only.** The brainstorm explicitly noted that pain #2 (loss of workflow thread across `/compact`) is fully solved by Erin writing a small persistent file — no subagent isolation needed. That half doesn't depend on depth-2 dispatch and could ship for all orchestrators in days. Pain #1 (context bloat in `/ce:work`) remains unaddressed.
+2. **Wrap `/ce:work` only at the top level (no internal reviewer panels).** If `/ce:work` could be restructured so its internal Agent dispatches happen *outside* the wrapped boundary, depth-1 wrapping would be sufficient. This is a substantial `/ce:work` redesign with unclear feasibility.
+3. **Trim `/ce:work`'s in-thread footprint instead of wrapping it.** Marty's plan-review point: the cheapest fix may be upstream — restructure `/ce:work` to produce less main-thread chatter (e.g., write tool transcripts to disk rather than echoing them, summarize rather than log). Doesn't require depth-2 dispatch.
+4. **Accept the constraint and use external session boundaries.** Jeff already has the option of running `/ce:work` in a dedicated Claude Code session (or worktree). Tooling around starting/resuming such sessions cleanly would address the same pain without changing dispatch semantics.
+5. **Petition the platform.** If depth-2 dispatch is a deliberate Claude Code constraint, working around it may be wrong; if it's an oversight, requesting the capability through proper channels may unblock the original architecture cleanly.
+
+## Recommended next step
+
+**Halt this plan.** Update `docs/plans/2026-05-07-001-feat-erin-phase-isolation-plan.md` status to `blocked` (or close it as `failed`) with a pointer to this findings doc. Decide which of the paths above to pursue and start a new brainstorm for that direction. Do not attempt Units 2–4 of the current plan.
+
+## Artifact integrity notes
+
+- Trial output is deterministic on this codebase as of 2026-05-07: `orchestrators=10, reviewers=14, total=24`. If a future re-run of the harness produces different numbers, the codebase has changed (which is fine) — what matters is whether the *dispatch* succeeds, not the totals.
+- The harness doc (`tests/spikes/depth-2-dispatch.md`) remains useful as a regression test: any future Claude Code update that DOES enable depth-2 dispatch will produce 5/5 successful trials when the harness is re-run, signaling that the architectural blocker has lifted.
+- This findings doc is the canonical record of the spike's outcome. It is not superseded by future re-runs unless explicitly revised.
diff --git a/plugins/compound-engineering/skills/ce-run/SKILL.md b/plugins/compound-engineering/skills/ce-run/SKILL.md
index e12d2802e..54557e09d 100644
--- a/plugins/compound-engineering/skills/ce-run/SKILL.md
+++ b/plugins/compound-engineering/skills/ce-run/SKILL.md
@@ -52,15 +52,17 @@ For each phase in the `phases` list from frontmatter:
 
 1. **Check if optional** — If the phase has `optional: true` and `skip-when`, evaluate whether to skip based on the feature description and context. Explain your reasoning to the user.
 
-2. **Invoke the skill** — Run the skill specified in `skill:`, passing `args:` with variable substitution:
+2. **Check if wrapped** — If the phase has `wrapped: true` in its frontmatter, do NOT invoke the skill in-thread. Instead, yield control to the orchestrator: follow the orchestrator persona's wrapped-phase behavior section (e.g., Erin's `## Wrapped phases`) for dispatch, post-return verification, and run-state management. The orchestrator owns dispatch parameters, the constraints injected into the wrapped subagent's prompt, and all post-return logic. ce-run's role here is recognition only — it does not define a generic `wrapped: <persona>` primitive, schema, or dispatch shape; it just yields. After the orchestrator's wrapped-phase behavior returns control, continue with steps 4–6 (gate evaluation, state tracking, signals) for this phase as normal. If `wrapped: true` is absent (or false), proceed to step 3 (in-thread invocation).
+
+3. **Invoke the skill** — Run the skill specified in `skill:`, passing `args:` with variable substitution:
    - `$ARGUMENTS` → the feature description from step 1
    - `$PLAN_PATH` → the path to the plan file created during the plan phase
 
-3. **Evaluate the gate** — If the phase has a `gate:`, verify the gate conditions are met before proceeding to the next phase. If the gate fails, retry or ask the user for guidance (depending on orchestrator personality).
+4. **Evaluate the gate** — If the phase has a `gate:`, verify the gate conditions are met before proceeding to the next phase. If the gate fails, retry or ask the user for guidance (depending on orchestrator personality).
 
-4. **Track state** — Remember the plan file path when created, track which phases have completed, note key decisions.
+5. **Track state** — Remember the plan file path when created, track which phases have completed, note key decisions.
 
-5. **Handle signals** — If the phase has a `signal:` instead of a `skill:`, output that signal (e.g., `<promise>DONE</promise>`).
+6. **Handle signals** — If the phase has a `signal:` instead of a `skill:`, output that signal (e.g., `<promise>DONE</promise>`).
 
 ### Variable threading
 
diff --git a/tests/spikes/depth-2-dispatch.md b/tests/spikes/depth-2-dispatch.md
new file mode 100644
index 000000000..dd6393eee
--- /dev/null
+++ b/tests/spikes/depth-2-dispatch.md
@@ -0,0 +1,45 @@
+# Depth-2 Agent Dispatch Spike
+
+**Purpose:** Verify Claude Code's `Agent` tool supports depth-2 subagent dispatch (Opus parent → Opus child → 2 parallel Sonnet grandchildren) with visible terminal streaming. This is the hard gate for the Erin phase-isolation feature (`docs/plans/2026-05-07-001-feat-erin-phase-isolation-plan.md`, Unit 1).
+
+**Form:** Manual reproducible scenario. The form choice is documented in the findings doc (`docs/solutions/2026-05-07-agent-tool-depth-2-spike.md`). A Bash harness using `claude -p` was not pursued for the initial validation because main-thread interactive Claude can run depth-2 dispatch directly, which is the actual platform behavior the spike needs to validate. This markdown scenario is durable and re-runnable on any Claude Code update.
+
+## How to run
+
+Open a Claude Code session in any repo (the synthetic task references absolute paths so the cwd doesn't matter). Paste the prompt below five times in sequence (or in parallel using multiple Agent tool calls in one message). Each run is one trial.
+
+## The prompt to paste
+
+```
+Spawn an Opus subagent via the Agent tool with this prompt:
+
+"You are a trial coordinator for a depth-2 dispatch spike. Spawn TWO parallel Sonnet subagents using the Agent tool with these prompts:
+
+  Sonnet A prompt: 'Count the markdown files in /Users/jeffcasimir/Projects/ce-reviewers-jsl/orchestrators/ using a single shell command. Return the integer count.'
+
+  Sonnet B prompt: 'Count the markdown files in /Users/jeffcasimir/Projects/ce-reviewers-jsl/reviewers/ using a single shell command. Return the integer count.'
+
+Wait for both Sonnets to return. Sum their results. Return a single line of the form 'TRIAL_RESULT: orchestrators=N reviewers=M total=K' where N, M, K are integers."
+
+Pass model: 'opus' to the outer Agent call and let the Sonnet subagents inherit. After the Opus subagent returns, report:
+  - Did the dispatch succeed end-to-end?
+  - Did you (the user, observing your terminal during the run) see the Sonnet subagents' tool calls streaming in real time, or did the dispatch appear as a single opaque block?
+  - The TRIAL_RESULT line.
+```
+
+## What to observe per trial
+
+1. **Did depth-2 succeed?** The trial coordinator (Opus) was supposed to spawn 2 Sonnet leaves and return a TRIAL_RESULT. If it errored, hung, or returned without the expected format, that trial failed.
+2. **Did leaf streaming render in your terminal?** During the trial, did you see Sonnet tool calls (their `Bash` invocations or similar) streaming in real time, or did the terminal sit silent until the Opus subagent returned?
+3. **Did parent context feel materially different?** Subjective — did your main-thread context grow noticeably during the trial, or did it stay light because the leaf activity was isolated?
+
+## Pass/fail gates (per the plan)
+
+- **5/5 dispatch success.** Transient model rate limits don't count as fails — re-run those iterations.
+- **Streaming visible.** Leaf tool call names + abbreviated input visible in real time during the run. If the trial appears as a "black box for N seconds," streaming fidelity fails.
+
+If either gate fails, the spike fails and the rest of the plan halts pending redesign.
+
+## Recording results
+
+Append per-trial observations to `docs/solutions/2026-05-07-agent-tool-depth-2-spike.md` under the "Trials" section. The findings doc is the canonical artifact; this file is the runnable harness.