Track runtime ownership to prevent cross-runtime status corruption#4434
Track runtime ownership to prevent cross-runtime status corruption#4434
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4434 +/- ##
==========================================
+ Coverage 69.57% 69.62% +0.04%
==========================================
Files 489 489
Lines 50193 50251 +58
==========================================
+ Hits 34920 34985 +65
+ Misses 12590 12574 -16
- Partials 2683 2692 +9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
amirejaz
left a comment
There was a problem hiding this comment.
Review: Track runtime ownership to prevent cross-runtime status corruption
The approach is correct and well-structured — RuntimeName in RunConfig is the right place to record ownership, the isOwnedByActiveRuntime guard is clean, and the backwards-compatibility intent is understood. The comments below are a mix of one correctness concern and follow-up UX issues that the new RuntimeName field now makes tractable.
Overall: the correctness concern in pkg/workloads/statuses/file_status.go (isOwnedByActiveRuntime, line 103) should be addressed before merging; the rest are follow-up issues.
Follow-up issues worth tracking
Stale status in thv list for cross-runtime workloads
When Docker is active and a go-microvm workload is in file state as running, handleRuntimeMissing now correctly skips reconciliation — but the workload still appears as running in thv list with no indication it belongs to a different runtime. RuntimeName is now persisted and could be surfaced as a RUNTIME column in thv list.
thv stop/restart on a cross-runtime workload silently succeeds
stopContainerWorkload in pkg/workloads/manager.go treats ErrWorkloadNotFound as a non-error — logs a warning, returns nil, and never updates the status file. So the workload stays running from the user's perspective with no error surfaced. Combined with the stale-status issue above, a user can: (1) see foo as running, (2) run thv stop foo — no error, (3) see foo still as running. Now that RuntimeName is available in RunConfig, stopSingleWorkload could detect the mismatch early and return a clear error like: "workload 'foo' was created with the 'go-microvm' runtime; switch to that runtime to manage it."
| } | ||
|
|
||
| // Empty means legacy workload; treat as owned by active runtime. | ||
| if config.RuntimeName == "" { |
There was a problem hiding this comment.
Correctness concern: the legacy-workload default preserves the exact bug this PR is fixing.
Any workload created before this PR has RuntimeName = "". The empty → assume ours default means: if a user has a go-microvm workload (no RuntimeName) and switches to Docker, isOwnedByActiveRuntime returns true, handleRuntimeMissing is reached, Docker doesn't know the workload, and it gets marked unhealthy — exactly the bug in #4432.
The PR description calls this "preserving current behavior until they are restarted", but the current behavior IS the bug. The conservative default is only safe for single-runtime users, which is precisely not the scenario this PR targets.
A safer default for legacy workloads would be to return false (skip reconciliation) rather than true (assume ours). The trade-off: a genuinely broken legacy workload whose container is missing would no longer be auto-marked unhealthy until it is restarted with a newer ToolHive version. That seems like an acceptable temporary regression compared to continuing to corrupt cross-runtime status files for all existing workloads.
There was a problem hiding this comment.
Good catch, and I agree this is the most important thing to get right here.
So, the concern about flipping to false is that it breaks health checks for the 100% of current users who are on a single runtime. Nobody is using the experimental go-microvm runtime yet... it's not even on main. Defaulting to "skip reconciliation" means a genuinely broken single-runtime workload just silently stays marked as running when it shouldn't be. That's a real regression for everyone today vs. a theoretical fix for a scenario that doesn't exist yet.
That said, I did add a migration in the follow-up commit that addresses the root of your concern. migrateRuntimeName now stamps legacy workloads with the active runtime's name, but only after reconciliation confirms the workload is actually healthy (container running, proxy OK). So the empty-string case gets progressively eliminated as workloads are validated. By the time anyone actually uses multi-runtime, their workloads will already be stamped.
|
|
||
| // Skip reconciliation for workloads owned by a different runtime to avoid | ||
| // corrupting their status files (see #4432). | ||
| if !f.isOwnedByActiveRuntime(ctx, workloadName) { |
There was a problem hiding this comment.
Nit: the fileWorkload.Remote guard above (line 999) makes this isOwnedByActiveRuntime check unreachable for remote workloads — they already exit before reaching here.
This is correct behavior (remote workloads belong to no runtime), but the ordering makes the intent implicit. A short comment on the Remote guard like // Remote workloads bypass runtime-ownership checks — they have no owning runtime would make it clear this is intentional and not a gap in the cross-runtime protection.
There was a problem hiding this comment.
Done! Added a comment that explains both why the early return exists and that it intentionally bypasses the runtime-ownership check below.
| // RunConfig and compares it against f.runtime.Name(). Legacy workloads that | ||
| // predate the runtime_name field (empty string) are conservatively treated as | ||
| // owned by the active runtime so that existing validation behaviour is preserved. | ||
| func (f *fileStatusManager) isOwnedByActiveRuntime(ctx context.Context, workloadName string) bool { |
There was a problem hiding this comment.
Follow-up: double GetReader per workload during list.
For each workload, GetReader is called once in isRemoteWorkload and again here in isOwnedByActiveRuntime. Both involve file I/O and happen sequentially per workload. For users with many workloads this adds up. Already noted in the PR special notes — worth a tracked follow-up issue so it doesn't get lost.
There was a problem hiding this comment.
Filed #4456 to track this. Note that with the migration commit it's actually 3 reads per legacy workload now (isRemoteWorkload + isOwnedByActiveRuntime + migrateRuntimeName), though that drops to 2 once migration is done. Worth cleaning up properly.
| // RuntimeName is the registered name of the container runtime that owns this | ||
| // workload (e.g., "docker", "kubernetes"). Used during reconciliation to avoid | ||
| // corrupting status files of workloads managed by a different runtime. | ||
| RuntimeName string `json:"runtime_name,omitempty" yaml:"runtime_name,omitempty"` |
There was a problem hiding this comment.
Note: existing workloads are never auto-migrated.
RuntimeName is stamped only at creation time (SaveState). Existing workloads on disk have RuntimeName = "" and will never be updated unless the workload is deleted and recreated. This is fine operationally, but means the cross-runtime protection only applies to workloads created after this PR is deployed — which ties back to the correctness concern in isOwnedByActiveRuntime.
There was a problem hiding this comment.
Addressed this in the follow-up commit. Legacy workloads now get their RuntimeName stamped opportunistically during reconciliation, but only when the runtime confirms the workload is healthy. So we're not blindly claiming ownership... we only stamp when we have strong evidence the workload actually belongs to the active runtime. The empty-string fallback in isOwnedByActiveRuntime stays as a safety net, but it should be a transient state now rather than permanent.
When multiple runtimes coexist (e.g., podman and go-microvm), running CLI commands with one runtime active permanently corrupts the status files of workloads managed by the other runtime, marking them as "unhealthy." This happens because the reconciliation logic assumes all workloads belong to the single active runtime. Add a RuntimeName field to RunConfig that records which runtime created a workload, and a Name() method to the Runtime interface so the active runtime can identify itself. During reconciliation, skip workloads whose owning runtime differs from the active one. Legacy workloads without a RuntimeName are conservatively treated as owned by the active runtime. Fixes #4432 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…iliation Legacy workloads created before runtime ownership tracking have an empty RuntimeName field. Rather than leaving them permanently in the legacy state, opportunistically stamp them with the active runtime's name when reconciliation confirms the workload is healthy (container running, proxy OK). This ensures the cross-runtime protection from the previous commit covers existing workloads progressively. Also adds a clarifying comment on the Remote guard in handleRuntimeMissing per reviewer feedback, and improves diagnostics with a DEBUG log when GetReader fails during migration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
c26952a to
6841de3
Compare
Summary
When multiple runtimes coexist (e.g., podman and an experimental runtime like go-microvm), running any CLI command (
thv list,thv status,thv stop,thv restart) with one runtime active permanently corrupts the status files of workloads managed by the other runtime, marking them as "unhealthy." The root cause is that no data structure tracks which runtime owns a workload, and the reconciliation logic infileStatusManagerassumes all workloads belong to the single active runtime.This PR adds runtime ownership tracking so that reconciliation skips workloads belonging to a different runtime:
Name() stringto theRuntimeinterface so the active runtime can identify itselfRuntimeNamefield toRunConfig(persisted to disk) to record which runtime created a workloadhandleRuntimeMissing()andvalidateRunningWorkload()in the file status manager to skip workloads owned by a different runtimeRuntimeNameat all workload creation entry points (CLI, API, MCP server)Fixes #4432
Type of change
Test plan
task test)task lint-fix)Changes
pkg/container/runtime/types.goName() stringtoRuntimeinterfacepkg/container/docker/client.goName()returning"docker"pkg/container/kubernetes/client.goName()returning"kubernetes"pkg/runner/config.goRuntimeNamefield toRunConfigcmd/thv/app/run.goRuntimeNameat both CLI save sitespkg/api/v1/workload_service.goRuntimeNamein API create pathpkg/mcp/server/run_server.goRuntimeNamein MCP server pathpkg/workloads/statuses/file_status.goisOwnedByActiveRuntimehelper; guardvalidateRunningWorkloadandhandleRuntimeMissing; addmigrateRuntimeNamefor legacy workloadsDoes this introduce a user-facing change?
Workloads managed by one runtime are no longer incorrectly marked as "unhealthy" when a different runtime is active. Existing workloads without the new
runtime_namefield get migrated automatically... the first time reconciliation confirms they are healthy, they get stamped with the active runtime's name.Special notes for reviewers
RuntimeNameusesomitempty, so old RunConfigs deserialize with"". TheisOwnedByActiveRuntimehelper treats empty as "assume ours" to preserve existing health-check behavior for single-runtime users. ThemigrateRuntimeNamefunction then progressively stamps legacy workloads, so the empty-string fallback is a transient state rather than permanent.migrateRuntimeNameonly stamps a workload after the runtime confirms it is actually healthy (container running, proxy OK). We never blindly claim ownership of a workload we haven't verified.Name()here. It will need to add that method when it merges.GetReaderis called multiple times per workload during reconciliation (isRemoteWorkload,isOwnedByActiveRuntime,migrateRuntimeName). Tracked in RunConfig is read from disk multiple times per workload during reconciliation #4456.Generated with Claude Code