-
Notifications
You must be signed in to change notification settings - Fork 115
Description
Problem
When a sandbox starts with --from (a custom container image), the sandbox supervisor can fail to start for various reasons:
- Policy parse failure — the container ships a stale
policy.yamlwith removed fields (e.g.inference).discover_policy_from_pathcatches the error and silently falls back to the restrictive default. The user sees a working sandbox with the wrong policy and no explanation. - User validation failure — the container doesn't have a
sandboxuser. The pod crashloops with no actionable message. - Network namespace setup failure — same crashloop-with-no-message outcome.
- Any other fatal startup error — OPA engine construction, TLS generation, etc. Same outcome.
The root cause is that the supervisor has no channel to report errors back to the gateway. When run_sandbox returns Err(...), the process exits, the pod restarts, and the CRD watcher sees DependenciesNotReady → Provisioning forever. The CLI eventually times out (120s) with a generic message.
Regression context
Commit e3ea796 removed the inference field from the PolicyFile serde struct (which uses deny_unknown_fields). Any container image that still ships a policy.yaml with an inference: section fails to parse. The error is swallowed and the restrictive default (all network blocked) is synced to the gateway as the baseline. The intent is NOT to add backward-compat serde fields — the YAML schema is the schema. Instead, the error should be reported to the user.
Solution
Add a generalized supervisor startup error reporting path:
- Sandbox supervisor hits a fatal startup error
- Before exiting, calls a new
ReportSupervisorErrorgRPC RPC (fire-and-forget, 5s timeout) - Gateway sets
phase=Errorwith conditionReady=False, reason=SupervisorError, message=<the error>, persists, notifies watch bus, acks immediately, then spawns async K8s resource deletion - CLI watch loop (existing) sees
Errorphase, displays the real error message, exits - Supervisor exits non-zero — pod is already being deleted, no crashloop
Key design decisions
- Fire-and-forget RPC — supervisor tries once with a short timeout, then exits regardless
- Async resource deletion — gateway acks the RPC immediately, spawns
tokio::spawnto delete K8s resources (avoids deadlock where pod can't terminate while waiting for RPC response) handle_appliedpreservation — CRD watcher must not overwrite a gateway-setSupervisorErrorphase withProvisioning- Error truncation — messages truncated to ~4KB on the gateway
discover_policy_from_pathbecomes fallible — parse/validation errors propagate instead of being silently swallowed- Zero CLI changes — existing watch loop at
run.rs:2030already handlesErrorphase
Implementation Plan
Full plan with code sketches: architecture/plans/supervisor-startup-error-reporting.md
Sequenced work items
| Step | Scope | Files | Independently Mergeable? |
|---|---|---|---|
| 1 | Proto: add ReportSupervisorError RPC + messages |
proto/openshell.proto |
Yes (additive) |
| 2 | Supervisor: fire-and-forget gRPC client function | grpc_client.rs, lib.rs |
Yes (dead code until step 3) |
| 3 | Supervisor: wrap main.rs error path to call error reporter |
main.rs |
Yes (fire-and-forget, gateway rejects until step 5) |
| 4 | Supervisor: make discover_policy_from_path fallible |
lib.rs (3 functions) |
Yes (behavior change, testable standalone) |
| 5 | Gateway: implement ReportSupervisorError handler |
grpc.rs |
Yes (no callers until step 2+3 deployed) |
| 6 | Gateway: protect handle_applied from overwriting SupervisorError |
sandbox/mod.rs |
Yes (purely defensive) |
Recommended merge order: 1 → 2+5 (parallel) → 3+6 (parallel) → 4 (last — this is the behavior change)
Acceptance Criteria
- Sandbox created with
--fromusing a container with a stale/malformedpolicy.yamlshows the parse error to the user duringopenshell sandbox create - Sandbox transitions to
Errorphase with the actual error message visible viaopenshell sandbox get - K8s resources are cleaned up automatically (no orphaned crashlooping pods)
- If the gateway is unreachable when the supervisor tries to report, the supervisor still exits (no hang)
- Other supervisor startup errors (user validation, netns setup, etc.) are also reported through the same channel
-
handle_applieddoes not overwrite aSupervisorErrorphase withProvisioning