Skip to content

feat: report supervisor startup errors to CLI via gateway #289

@johntmyers

Description

@johntmyers

Problem

When a sandbox starts with --from (a custom container image), the sandbox supervisor can fail to start for various reasons:

  1. Policy parse failure — the container ships a stale policy.yaml with removed fields (e.g. inference). discover_policy_from_path catches the error and silently falls back to the restrictive default. The user sees a working sandbox with the wrong policy and no explanation.
  2. User validation failure — the container doesn't have a sandbox user. The pod crashloops with no actionable message.
  3. Network namespace setup failure — same crashloop-with-no-message outcome.
  4. Any other fatal startup error — OPA engine construction, TLS generation, etc. Same outcome.

The root cause is that the supervisor has no channel to report errors back to the gateway. When run_sandbox returns Err(...), the process exits, the pod restarts, and the CRD watcher sees DependenciesNotReadyProvisioning forever. The CLI eventually times out (120s) with a generic message.

Regression context

Commit e3ea796 removed the inference field from the PolicyFile serde struct (which uses deny_unknown_fields). Any container image that still ships a policy.yaml with an inference: section fails to parse. The error is swallowed and the restrictive default (all network blocked) is synced to the gateway as the baseline. The intent is NOT to add backward-compat serde fields — the YAML schema is the schema. Instead, the error should be reported to the user.

Solution

Add a generalized supervisor startup error reporting path:

  1. Sandbox supervisor hits a fatal startup error
  2. Before exiting, calls a new ReportSupervisorError gRPC RPC (fire-and-forget, 5s timeout)
  3. Gateway sets phase=Error with condition Ready=False, reason=SupervisorError, message=<the error>, persists, notifies watch bus, acks immediately, then spawns async K8s resource deletion
  4. CLI watch loop (existing) sees Error phase, displays the real error message, exits
  5. Supervisor exits non-zero — pod is already being deleted, no crashloop

Key design decisions

  • Fire-and-forget RPC — supervisor tries once with a short timeout, then exits regardless
  • Async resource deletion — gateway acks the RPC immediately, spawns tokio::spawn to delete K8s resources (avoids deadlock where pod can't terminate while waiting for RPC response)
  • handle_applied preservation — CRD watcher must not overwrite a gateway-set SupervisorError phase with Provisioning
  • Error truncation — messages truncated to ~4KB on the gateway
  • discover_policy_from_path becomes fallible — parse/validation errors propagate instead of being silently swallowed
  • Zero CLI changes — existing watch loop at run.rs:2030 already handles Error phase

Implementation Plan

Full plan with code sketches: architecture/plans/supervisor-startup-error-reporting.md

Sequenced work items

Step Scope Files Independently Mergeable?
1 Proto: add ReportSupervisorError RPC + messages proto/openshell.proto Yes (additive)
2 Supervisor: fire-and-forget gRPC client function grpc_client.rs, lib.rs Yes (dead code until step 3)
3 Supervisor: wrap main.rs error path to call error reporter main.rs Yes (fire-and-forget, gateway rejects until step 5)
4 Supervisor: make discover_policy_from_path fallible lib.rs (3 functions) Yes (behavior change, testable standalone)
5 Gateway: implement ReportSupervisorError handler grpc.rs Yes (no callers until step 2+3 deployed)
6 Gateway: protect handle_applied from overwriting SupervisorError sandbox/mod.rs Yes (purely defensive)

Recommended merge order: 1 → 2+5 (parallel) → 3+6 (parallel) → 4 (last — this is the behavior change)

Acceptance Criteria

  • Sandbox created with --from using a container with a stale/malformed policy.yaml shows the parse error to the user during openshell sandbox create
  • Sandbox transitions to Error phase with the actual error message visible via openshell sandbox get
  • K8s resources are cleaned up automatically (no orphaned crashlooping pods)
  • If the gateway is unreachable when the supervisor tries to report, the supervisor still exits (no hang)
  • Other supervisor startup errors (user validation, netns setup, etc.) are also reported through the same channel
  • handle_applied does not overwrite a SupervisorError phase with Provisioning

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions