Skip to content

feat(stack): confirm down/purge when services are serving traffic#512

Open
bussyjd wants to merge 1 commit into
mainfrom
feat/down-purge-confirm-live-services
Open

feat(stack): confirm down/purge when services are serving traffic#512
bussyjd wants to merge 1 commit into
mainfrom
feat/down-purge-confirm-live-services

Conversation

@bussyjd
Copy link
Copy Markdown
Collaborator

@bussyjd bussyjd commented May 23, 2026

Summary

  • New safety gate on obol stack down and obol stack purge that lists live ServiceOffers + sell-inference host gateways before tearing them down.
  • In a TTY the operator gets a [y/N] prompt; in non-interactive shells the call fails closed with an error pointing at --yes.
  • --yes / -y flag added to both subcommands as the explicit non-interactive escape hatch. --force keeps its existing meaning (delete data dir even when root-owned) and does not imply --yes.

Why

inference.v1337.org went dark for ~17h on 2026-05-22. Forensics on spark2 showed the k3d cluster received a manual stop (hasBeenManuallyStopped=true in dockerd log) inside a 22-second session-4154 cgroup that originated from an inbound SSH from 127.0.0.1 — i.e. a script on the box ssh'd to localhost as claude and ran obol stack down (or equivalent) without ever touching a tty. Today there is nothing in the CLI that distinguishes "operator typed this" from "a stale cleanup script targeted the wrong stack". This PR adds that check.

How

  • internal/stack/safety.go::ConfirmRunningServicesLoss(cfg, u, action, skipConfirm)
    • Cluster side: kubectl get serviceoffers.obol.org -A -o json, filter to PaymentGateReady=True AND RoutePublished=True (deliberately not requiring Ready=True, because Registered stays False for unregistered offers indefinitely — gating on Ready would let the prompt skip real production offers like aeon).
    • Host side: walk <StateDir>/sell-inference/*/gateway.pid, keep the ones whose PID is alive (syscall.Signal(0)).
  • stack.Down(cfg, u, skipConfirm) and stack.Purge(cfg, u, force, skipConfirm) call the gate first.
  • The internal stack up cleanup path (helmfile-sync failure auto-rollback) passes skipConfirm=true — the operator didn't trigger that call.
  • cmd/obol/main.go: adds --yes / -y on down and purge, threads it through.

Test plan

  • go build ./...
  • go test ./internal/stack/ ./cmd/obol/ ./internal/tunnel/ -count=1 — all passing except a pre-existing failure (TestWarnIfNoChatModel_EmitsWarnWhenNoModels) that reproduces on main without this change
  • New tests cover: empty snapshot passes through silently, live gateway in non-interactive mode returns --yes-mentioning error and proceed=false, --yes passes through with a warn line, dead PID files are ignored, gateReady requires both conditions True, priceSummary covers per-request/per-MTok/per-hour/empty
  • go run ./cmd/obol stack down --help and stack purge --help render the new --yes flag with usage copy
  • Smoke-test interactively against a running stack — would have run on spark2 but didn't want to gate this PR on round-tripping a prod cluster cycle

Notes for reviewers

  • Safety gate runs before the existing tunnel.ConfirmQuickTunnelLoss gate, since live offers are a bigger blast radius than a quick-tunnel URL change.
  • Default answer is No (opposite of ConfirmQuickTunnelLoss's default-Yes), since the destructive action is also bigger.
  • internal/stack/safety.go defines a local rawOffer struct rather than importing monetizeapi — keeps the safety package free of CRD typing for the same reasons as the rest of internal/stack.

Add a safety gate to `obol stack down` and `obol stack purge` that
inspects in-cluster ServiceOffers (PaymentGateReady=True AND
RoutePublished=True) and host-side sell-inference gateways with alive
PID files, then prompts before tearing the stack down. In non-
interactive shells the gate fails closed unless --yes is passed.

This closes the silent-down vector that took inference.v1337.org
offline for 17 hours: a non-interactive `ssh host '<obol stack down>'`
from a stale worktree's script wiped the production stack with no
operator confirmation and no audit trail of the running services.

- internal/stack/safety.go: ConfirmRunningServicesLoss + discovery
- stack.Down(cfg, u, skipConfirm), stack.Purge(cfg, u, force, skipConfirm)
- cmd/obol: --yes / -y flag on `stack down` and `stack purge`
- internal stack-up cleanup passes skipConfirm=true (system-initiated)
- --force keeps its existing meaning (delete data dir); --yes is the
  new escape hatch for the safety prompt only
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant