ateapi/syncer: release actor when host pod is deleted#75
Open
Davanum Srinivas (dims) wants to merge 1 commit into
Open
ateapi/syncer: release actor when host pod is deleted#75Davanum Srinivas (dims) wants to merge 1 commit into
Davanum Srinivas (dims) wants to merge 1 commit into
Conversation
Davanum Srinivas (dims)
added a commit
to dims/openshell-driver-substrate
that referenced
this pull request
May 24, 2026
Six-beat OpenShell-on-Substrate scenario: cold ask, suspend, idle, follow-up with memory preserved, exfil deny, and pod-kill migration. Verified end-to-end on a kind cluster running substrate `main` plus `agent-substrate/substrate#75` (`ateapi/syncer: release actor when host pod is deleted`), which closes the gap behind Beat 6. The example reuses `tests/integration/build-image.sh` for the supervisor image; the helpdesk-specific files (Python agent, OPA policy data, OpenShell route config, substrate ActorTemplate, thin derivative Dockerfile, six-beat driver script) live under `examples/helpdesk/`. `README.md` is self-contained — prereqs, quick-start, expected output, troubleshooting, cleanup. `routes.yaml` ships as a template carrying a `<your-ollama-cloud-key>` placeholder; operators stage `routes.local.yaml` with a real Ollama Cloud free-tier key. `*.local.yaml` is gitignored at the example root. Signed-off-by: Davanum Srinivas <dsrinivas@nvidia.com>
Davanum Srinivas (dims)
added a commit
to dims/openshell-driver-substrate
that referenced
this pull request
May 24, 2026
Both the top-level README and docs/poc-intro.md were dated to before the M3 wiring landed and before the driver became load-bearing in a real gateway. Update both to: - Note M3.14 + M3.16 commits on dims/OpenShell@chore/gvisor-degraded-netns as the gateway-side wiring that makes the crate load-bearing. - Surface the driver-driven 10-beat helpdesk demo at examples/helpdesk/ as the canonical integration showcase. - Add agent-substrate/substrate#75 (actor migration on pod loss) to the companion-change list in poc-intro.md. Signed-off-by: Davanum Srinivas <dsrinivas@nvidia.com>
Davanum Srinivas (dims)
added a commit
to dims/openshell-driver-substrate
that referenced
this pull request
May 24, 2026
The base README only mentioned the helpdesk example in passing inside the "What's in the box" table, and never linked to docs/poc-intro.md at all. Both are the actual entry points for newcomers — the joint architecture overview and the demo walkthrough. Add a "Read first" block right under the title that names both with one-line summaries, so a teammate landing on the repo can pick the right entry point without scanning past the prerequisites. Also add agent-substrate/substrate#75 (actor migration on pod loss) to the Companion Changes table; it was missing despite being a prerequisite for the helpdesk demo's pod-kill migration beat. Signed-off-by: Davanum Srinivas <dsrinivas@nvidia.com>
94ff296 to
314c39d
Compare
Today, if a worker pod that hosts an active actor is forcibly
destroyed, substrate does not migrate the actor. `Actor.AteomPodName`
still points at the dead pod and the router times out forwarding to a
dead IP.
The `WorkerPoolSyncer`'s pod-informer DeleteFunc and soft-delete branch
(`syncWorkerToStore` with `DeletionTimestamp != nil`) now check whether
the worker being removed is bound to an actor, and if so, reset that
actor to `STATUS_SUSPENDED` before the worker row goes away.
The new helper `releaseActorOnDeadWorker` reads the actor, only acts if
the actor still claims this worker (a concurrent SuspendActor /
DeleteActor that already advanced state is respected), clears
`AteomPod{Namespace,Name,Ip}` and `InProgressSnapshot`, preserves
`LastSnapshot` (the previous *successful* checkpoint), and writes via
version-checked `UpdateActor`. On `ErrPersistenceRetry` we drop the
attempt and rely on the next informer event (resync, late delete) to
retry — there is no separate lock or in-handler retry budget.
`STATUS_SUSPENDED` is reused as the target state rather than
introducing a new value — the post-orphan invariant is identical and
reusing keeps the proto / printer / dialer / router contracts
unchanged. The next request through atenet triggers an implicit
resume; `findFreeWorker` picks any free worker in the pool;
`LastSnapshot` is restored.
Test plan:
- `TestSyncer_DeleteBoundWorker_ClearsActor` — RUNNING actor → pod
delete → SUSPENDED with cleared bind fields, InProgressSnapshot
dropped, LastSnapshot preserved.
All existing tests in `cmd/ateapi/internal/controlapi/` still pass.
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
314c39d to
803304b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Today, if a worker pod that hosts an active actor is forcibly destroyed, substrate does not migrate the actor. The actor's AteomPodName still points at the dead pod and the router times out forwarding to a dead IP.
WorkerPoolSyncer's pod-informer DeleteFunc and soft-delete branch now check whether the worker being removed is bound to an actor, and if so, reset that actor to
STATUS_SUSPENDEDbefore the worker row goes away.The new helper releaseActorOnDeadWorker reads the actor, only acts if the actor still claims this worker (a concurrent SuspendActor or DeleteActor that already advanced state is respected), clears the pod-binding fields and InProgressSnapshot, preserves LastSnapshot, and writes via version-checked UpdateActor. On ErrPersistenceRetry we drop the attempt and let the next informer event retry — no separate lock, no in-handler retry budget.
STATUS_SUSPENDEDis reused rather than introducing a new state. The post-orphan invariant is identical and reusing keeps the proto, printer, dialer, and router contracts unchanged. The next request through atenet triggers an implicit resume; findFreeWorker picks any free worker in the pool; LastSnapshot is restored.Test plan:
TestSyncer_DeleteBoundWorker_ClearsActorcovers RUNNING actor → pod delete → SUSPENDED with cleared bind fields, InProgressSnapshot dropped, LastSnapshot preserved. All existing tests incmd/ateapi/internal/controlapi/still pass.