Skip to content

Track runtime ownership to prevent cross-runtime status corruption#4434

Open
JAORMX wants to merge 3 commits intomainfrom
worktree-fix+runtime-ownership-status
Open

Track runtime ownership to prevent cross-runtime status corruption#4434
JAORMX wants to merge 3 commits intomainfrom
worktree-fix+runtime-ownership-status

Conversation

@JAORMX
Copy link
Copy Markdown
Collaborator

@JAORMX JAORMX commented Mar 30, 2026

Summary

When multiple runtimes coexist (e.g., podman and an experimental runtime like go-microvm), running any CLI command (thv list, thv status, thv stop, thv restart) with one runtime active permanently corrupts the status files of workloads managed by the other runtime, marking them as "unhealthy." The root cause is that no data structure tracks which runtime owns a workload, and the reconciliation logic in fileStatusManager assumes all workloads belong to the single active runtime.

This PR adds runtime ownership tracking so that reconciliation skips workloads belonging to a different runtime:

  • Add Name() string to the Runtime interface so the active runtime can identify itself
  • Add RuntimeName field to RunConfig (persisted to disk) to record which runtime created a workload
  • Guard handleRuntimeMissing() and validateRunningWorkload() in the file status manager to skip workloads owned by a different runtime
  • Stamp RuntimeName at all workload creation entry points (CLI, API, MCP server)
  • Opportunistically migrate legacy workloads (those created before this change) by stamping them with the active runtime's name when reconciliation confirms they are healthy

Fixes #4432

Type of change

  • Bug fix

Test plan

  • Unit tests (task test)
  • Linting (task lint-fix)

Changes

File Change
pkg/container/runtime/types.go Add Name() string to Runtime interface
pkg/container/docker/client.go Implement Name() returning "docker"
pkg/container/kubernetes/client.go Implement Name() returning "kubernetes"
pkg/runner/config.go Add RuntimeName field to RunConfig
cmd/thv/app/run.go Set RuntimeName at both CLI save sites
pkg/api/v1/workload_service.go Set RuntimeName in API create path
pkg/mcp/server/run_server.go Set RuntimeName in MCP server path
pkg/workloads/statuses/file_status.go Add isOwnedByActiveRuntime helper; guard validateRunningWorkload and handleRuntimeMissing; add migrateRuntimeName for legacy workloads

Does this introduce a user-facing change?

Workloads managed by one runtime are no longer incorrectly marked as "unhealthy" when a different runtime is active. Existing workloads without the new runtime_name field get migrated automatically... the first time reconciliation confirms they are healthy, they get stamped with the active runtime's name.

Special notes for reviewers

  • Backwards compatibility: RuntimeName uses omitempty, so old RunConfigs deserialize with "". The isOwnedByActiveRuntime helper treats empty as "assume ours" to preserve existing health-check behavior for single-runtime users. The migrateRuntimeName function then progressively stamps legacy workloads, so the empty-string fallback is a transient state rather than permanent.
  • Migration safety: migrateRuntimeName only stamps a workload after the runtime confirms it is actually healthy (container running, proxy OK). We never blindly claim ownership of a workload we haven't verified.
  • go-microvm runtime is not yet on main, so it does not implement Name() here. It will need to add that method when it merges.
  • There is an inefficiency where GetReader is called multiple times per workload during reconciliation (isRemoteWorkload, isOwnedByActiveRuntime, migrateRuntimeName). Tracked in RunConfig is read from disk multiple times per workload during reconciliation #4456.

Generated with Claude Code

@github-actions github-actions bot added the size/S Small PR: 100-299 lines changed label Mar 30, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 30, 2026

Codecov Report

❌ Patch coverage is 71.18644% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.62%. Comparing base (11cd266) to head (6841de3).

Files with missing lines Patch % Lines
pkg/workloads/statuses/file_status.go 73.46% 7 Missing and 6 partials ⚠️
pkg/container/kubernetes/client.go 0.00% 2 Missing ⚠️
pkg/mcp/server/run_server.go 60.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4434      +/-   ##
==========================================
+ Coverage   69.57%   69.62%   +0.04%     
==========================================
  Files         489      489              
  Lines       50193    50251      +58     
==========================================
+ Hits        34920    34985      +65     
+ Misses      12590    12574      -16     
- Partials     2683     2692       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Mar 30, 2026
Copy link
Copy Markdown
Contributor

@amirejaz amirejaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Track runtime ownership to prevent cross-runtime status corruption

The approach is correct and well-structured — RuntimeName in RunConfig is the right place to record ownership, the isOwnedByActiveRuntime guard is clean, and the backwards-compatibility intent is understood. The comments below are a mix of one correctness concern and follow-up UX issues that the new RuntimeName field now makes tractable.

Overall: the correctness concern in pkg/workloads/statuses/file_status.go (isOwnedByActiveRuntime, line 103) should be addressed before merging; the rest are follow-up issues.

Follow-up issues worth tracking

Stale status in thv list for cross-runtime workloads
When Docker is active and a go-microvm workload is in file state as running, handleRuntimeMissing now correctly skips reconciliation — but the workload still appears as running in thv list with no indication it belongs to a different runtime. RuntimeName is now persisted and could be surfaced as a RUNTIME column in thv list.

thv stop/restart on a cross-runtime workload silently succeeds
stopContainerWorkload in pkg/workloads/manager.go treats ErrWorkloadNotFound as a non-error — logs a warning, returns nil, and never updates the status file. So the workload stays running from the user's perspective with no error surfaced. Combined with the stale-status issue above, a user can: (1) see foo as running, (2) run thv stop foo — no error, (3) see foo still as running. Now that RuntimeName is available in RunConfig, stopSingleWorkload could detect the mismatch early and return a clear error like: "workload 'foo' was created with the 'go-microvm' runtime; switch to that runtime to manage it."

}

// Empty means legacy workload; treat as owned by active runtime.
if config.RuntimeName == "" {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correctness concern: the legacy-workload default preserves the exact bug this PR is fixing.

Any workload created before this PR has RuntimeName = "". The empty → assume ours default means: if a user has a go-microvm workload (no RuntimeName) and switches to Docker, isOwnedByActiveRuntime returns true, handleRuntimeMissing is reached, Docker doesn't know the workload, and it gets marked unhealthy — exactly the bug in #4432.

The PR description calls this "preserving current behavior until they are restarted", but the current behavior IS the bug. The conservative default is only safe for single-runtime users, which is precisely not the scenario this PR targets.

A safer default for legacy workloads would be to return false (skip reconciliation) rather than true (assume ours). The trade-off: a genuinely broken legacy workload whose container is missing would no longer be auto-marked unhealthy until it is restarted with a newer ToolHive version. That seems like an acceptable temporary regression compared to continuing to corrupt cross-runtime status files for all existing workloads.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, and I agree this is the most important thing to get right here.

So, the concern about flipping to false is that it breaks health checks for the 100% of current users who are on a single runtime. Nobody is using the experimental go-microvm runtime yet... it's not even on main. Defaulting to "skip reconciliation" means a genuinely broken single-runtime workload just silently stays marked as running when it shouldn't be. That's a real regression for everyone today vs. a theoretical fix for a scenario that doesn't exist yet.

That said, I did add a migration in the follow-up commit that addresses the root of your concern. migrateRuntimeName now stamps legacy workloads with the active runtime's name, but only after reconciliation confirms the workload is actually healthy (container running, proxy OK). So the empty-string case gets progressively eliminated as workloads are validated. By the time anyone actually uses multi-runtime, their workloads will already be stamped.


// Skip reconciliation for workloads owned by a different runtime to avoid
// corrupting their status files (see #4432).
if !f.isOwnedByActiveRuntime(ctx, workloadName) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the fileWorkload.Remote guard above (line 999) makes this isOwnedByActiveRuntime check unreachable for remote workloads — they already exit before reaching here.

This is correct behavior (remote workloads belong to no runtime), but the ordering makes the intent implicit. A short comment on the Remote guard like // Remote workloads bypass runtime-ownership checks — they have no owning runtime would make it clear this is intentional and not a gap in the cross-runtime protection.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Added a comment that explains both why the early return exists and that it intentionally bypasses the runtime-ownership check below.

// RunConfig and compares it against f.runtime.Name(). Legacy workloads that
// predate the runtime_name field (empty string) are conservatively treated as
// owned by the active runtime so that existing validation behaviour is preserved.
func (f *fileStatusManager) isOwnedByActiveRuntime(ctx context.Context, workloadName string) bool {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up: double GetReader per workload during list.

For each workload, GetReader is called once in isRemoteWorkload and again here in isOwnedByActiveRuntime. Both involve file I/O and happen sequentially per workload. For users with many workloads this adds up. Already noted in the PR special notes — worth a tracked follow-up issue so it doesn't get lost.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #4456 to track this. Note that with the migration commit it's actually 3 reads per legacy workload now (isRemoteWorkload + isOwnedByActiveRuntime + migrateRuntimeName), though that drops to 2 once migration is done. Worth cleaning up properly.

// RuntimeName is the registered name of the container runtime that owns this
// workload (e.g., "docker", "kubernetes"). Used during reconciliation to avoid
// corrupting status files of workloads managed by a different runtime.
RuntimeName string `json:"runtime_name,omitempty" yaml:"runtime_name,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: existing workloads are never auto-migrated.

RuntimeName is stamped only at creation time (SaveState). Existing workloads on disk have RuntimeName = "" and will never be updated unless the workload is deleted and recreated. This is fine operationally, but means the cross-runtime protection only applies to workloads created after this PR is deployed — which ties back to the correctness concern in isOwnedByActiveRuntime.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this in the follow-up commit. Legacy workloads now get their RuntimeName stamped opportunistically during reconciliation, but only when the runtime confirms the workload is healthy. So we're not blindly claiming ownership... we only stamp when we have strong evidence the workload actually belongs to the active runtime. The empty-string fallback in isOwnedByActiveRuntime stays as a safety net, but it should be a transient state now rather than permanent.

JAORMX and others added 2 commits March 31, 2026 09:53
When multiple runtimes coexist (e.g., podman and go-microvm), running
CLI commands with one runtime active permanently corrupts the status
files of workloads managed by the other runtime, marking them as
"unhealthy." This happens because the reconciliation logic assumes all
workloads belong to the single active runtime.

Add a RuntimeName field to RunConfig that records which runtime created
a workload, and a Name() method to the Runtime interface so the active
runtime can identify itself. During reconciliation, skip workloads whose
owning runtime differs from the active one. Legacy workloads without a
RuntimeName are conservatively treated as owned by the active runtime.

Fixes #4432

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…iliation

Legacy workloads created before runtime ownership tracking have an empty
RuntimeName field. Rather than leaving them permanently in the legacy
state, opportunistically stamp them with the active runtime's name when
reconciliation confirms the workload is healthy (container running,
proxy OK). This ensures the cross-runtime protection from the previous
commit covers existing workloads progressively.

Also adds a clarifying comment on the Remote guard in handleRuntimeMissing
per reviewer feedback, and improves diagnostics with a DEBUG log when
GetReader fails during migration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@JAORMX JAORMX force-pushed the worktree-fix+runtime-ownership-status branch from c26952a to 6841de3 Compare March 31, 2026 07:17
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/S Small PR: 100-299 lines changed size/M Medium PR: 300-599 lines changed labels Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Medium PR: 300-599 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Single-runtime assumption corrupts status files of workloads managed by other runtimes

2 participants