Track runtime ownership to prevent cross-runtime status corruption by JAORMX · Pull Request #4434 · stacklok/toolhive

JAORMX · 2026-03-30T08:32:02Z

Summary

When multiple runtimes coexist (e.g., podman and an experimental runtime like go-microvm), running any CLI command (thv list, thv status, thv stop, thv restart) with one runtime active permanently corrupts the status files of workloads managed by the other runtime, marking them as "unhealthy." The root cause is that no data structure tracks which runtime owns a workload, and the reconciliation logic in fileStatusManager assumes all workloads belong to the single active runtime.

This PR adds runtime ownership tracking so that reconciliation skips workloads belonging to a different runtime:

Add Name() string to the Runtime interface so the active runtime can identify itself
Add RuntimeName field to RunConfig (persisted to disk) to record which runtime created a workload
Guard handleRuntimeMissing() and validateRunningWorkload() in the file status manager to skip workloads owned by a different runtime
Stamp RuntimeName at all workload creation entry points (CLI, API, MCP server)
Opportunistically migrate legacy workloads (those created before this change) by stamping them with the active runtime's name when reconciliation confirms they are healthy

Fixes #4432

Type of change

Bug fix

Test plan

Unit tests (task test)
Linting (task lint-fix)

Changes

File	Change
`pkg/container/runtime/types.go`	Add `Name() string` to `Runtime` interface
`pkg/container/docker/client.go`	Implement `Name()` returning `"docker"`
`pkg/container/kubernetes/client.go`	Implement `Name()` returning `"kubernetes"`
`pkg/runner/config.go`	Add `RuntimeName` field to `RunConfig`
`cmd/thv/app/run.go`	Set `RuntimeName` at both CLI save sites
`pkg/api/v1/workload_service.go`	Set `RuntimeName` in API create path
`pkg/mcp/server/run_server.go`	Set `RuntimeName` in MCP server path
`pkg/workloads/statuses/file_status.go`	Add `isOwnedByActiveRuntime` helper; guard `validateRunningWorkload` and `handleRuntimeMissing`; add `migrateRuntimeName` for legacy workloads

Does this introduce a user-facing change?

Workloads managed by one runtime are no longer incorrectly marked as "unhealthy" when a different runtime is active. Existing workloads without the new runtime_name field get migrated automatically... the first time reconciliation confirms they are healthy, they get stamped with the active runtime's name.

Special notes for reviewers

Backwards compatibility: RuntimeName uses omitempty, so old RunConfigs deserialize with "". The isOwnedByActiveRuntime helper treats empty as "assume ours" to preserve existing health-check behavior for single-runtime users. The migrateRuntimeName function then progressively stamps legacy workloads, so the empty-string fallback is a transient state rather than permanent.
Migration safety: migrateRuntimeName only stamps a workload after the runtime confirms it is actually healthy (container running, proxy OK). We never blindly claim ownership of a workload we haven't verified.
go-microvm runtime is not yet on main, so it does not implement Name() here. It will need to add that method when it merges.
There is an inefficiency where GetReader is called multiple times per workload during reconciliation (isRemoteWorkload, isOwnedByActiveRuntime, migrateRuntimeName). Tracked in RunConfig is read from disk multiple times per workload during reconciliation #4456.

Generated with Claude Code

codecov · 2026-03-30T08:40:09Z

Codecov Report

❌ Patch coverage is 71.18644% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.62%. Comparing base (11cd266) to head (6841de3).

Files with missing lines	Patch %	Lines
pkg/workloads/statuses/file_status.go	73.46%	7 Missing and 6 partials ⚠️
pkg/container/kubernetes/client.go	0.00%	2 Missing ⚠️
pkg/mcp/server/run_server.go	60.00%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4434      +/-   ##
==========================================
+ Coverage   69.57%   69.62%   +0.04%     
==========================================
  Files         489      489              
  Lines       50193    50251      +58     
==========================================
+ Hits        34920    34985      +65     
+ Misses      12590    12574      -16     
- Partials     2683     2692       +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

amirejaz

Review: Track runtime ownership to prevent cross-runtime status corruption

The approach is correct and well-structured — RuntimeName in RunConfig is the right place to record ownership, the isOwnedByActiveRuntime guard is clean, and the backwards-compatibility intent is understood. The comments below are a mix of one correctness concern and follow-up UX issues that the new RuntimeName field now makes tractable.

Overall: the correctness concern in pkg/workloads/statuses/file_status.go (isOwnedByActiveRuntime, line 103) should be addressed before merging; the rest are follow-up issues.

Follow-up issues worth tracking

Stale status in thv list for cross-runtime workloads
When Docker is active and a go-microvm workload is in file state as running, handleRuntimeMissing now correctly skips reconciliation — but the workload still appears as running in thv list with no indication it belongs to a different runtime. RuntimeName is now persisted and could be surfaced as a RUNTIME column in thv list.

thv stop/restart on a cross-runtime workload silently succeeds
stopContainerWorkload in pkg/workloads/manager.go treats ErrWorkloadNotFound as a non-error — logs a warning, returns nil, and never updates the status file. So the workload stays running from the user's perspective with no error surfaced. Combined with the stale-status issue above, a user can: (1) see foo as running, (2) run thv stop foo — no error, (3) see foo still as running. Now that RuntimeName is available in RunConfig, stopSingleWorkload could detect the mismatch early and return a clear error like: "workload 'foo' was created with the 'go-microvm' runtime; switch to that runtime to manage it."

amirejaz · 2026-03-30T11:48:23Z

pkg/workloads/statuses/file_status.go

+	}
+
+	// Empty means legacy workload; treat as owned by active runtime.
+	if config.RuntimeName == "" {


Correctness concern: the legacy-workload default preserves the exact bug this PR is fixing.

Any workload created before this PR has RuntimeName = "". The empty → assume ours default means: if a user has a go-microvm workload (no RuntimeName) and switches to Docker, isOwnedByActiveRuntime returns true, handleRuntimeMissing is reached, Docker doesn't know the workload, and it gets marked unhealthy — exactly the bug in #4432.

The PR description calls this "preserving current behavior until they are restarted", but the current behavior IS the bug. The conservative default is only safe for single-runtime users, which is precisely not the scenario this PR targets.

A safer default for legacy workloads would be to return false (skip reconciliation) rather than true (assume ours). The trade-off: a genuinely broken legacy workload whose container is missing would no longer be auto-marked unhealthy until it is restarted with a newer ToolHive version. That seems like an acceptable temporary regression compared to continuing to corrupt cross-runtime status files for all existing workloads.

Good catch, and I agree this is the most important thing to get right here.

So, the concern about flipping to false is that it breaks health checks for the 100% of current users who are on a single runtime. Nobody is using the experimental go-microvm runtime yet... it's not even on main. Defaulting to "skip reconciliation" means a genuinely broken single-runtime workload just silently stays marked as running when it shouldn't be. That's a real regression for everyone today vs. a theoretical fix for a scenario that doesn't exist yet.

That said, I did add a migration in the follow-up commit that addresses the root of your concern. migrateRuntimeName now stamps legacy workloads with the active runtime's name, but only after reconciliation confirms the workload is actually healthy (container running, proxy OK). So the empty-string case gets progressively eliminated as workloads are validated. By the time anyone actually uses multi-runtime, their workloads will already be stamped.

amirejaz · 2026-03-30T11:48:23Z

pkg/workloads/statuses/file_status.go


+	// Skip reconciliation for workloads owned by a different runtime to avoid
+	// corrupting their status files (see #4432).
+	if !f.isOwnedByActiveRuntime(ctx, workloadName) {


Nit: the fileWorkload.Remote guard above (line 999) makes this isOwnedByActiveRuntime check unreachable for remote workloads — they already exit before reaching here.

This is correct behavior (remote workloads belong to no runtime), but the ordering makes the intent implicit. A short comment on the Remote guard like // Remote workloads bypass runtime-ownership checks — they have no owning runtime would make it clear this is intentional and not a gap in the cross-runtime protection.

Done! Added a comment that explains both why the early return exists and that it intentionally bypasses the runtime-ownership check below.

amirejaz · 2026-03-30T11:48:23Z

pkg/workloads/statuses/file_status.go

+// RunConfig and compares it against f.runtime.Name(). Legacy workloads that
+// predate the runtime_name field (empty string) are conservatively treated as
+// owned by the active runtime so that existing validation behaviour is preserved.
+func (f *fileStatusManager) isOwnedByActiveRuntime(ctx context.Context, workloadName string) bool {


Follow-up: double GetReader per workload during list.

For each workload, GetReader is called once in isRemoteWorkload and again here in isOwnedByActiveRuntime. Both involve file I/O and happen sequentially per workload. For users with many workloads this adds up. Already noted in the PR special notes — worth a tracked follow-up issue so it doesn't get lost.

Filed #4456 to track this. Note that with the migration commit it's actually 3 reads per legacy workload now (isRemoteWorkload + isOwnedByActiveRuntime + migrateRuntimeName), though that drops to 2 once migration is done. Worth cleaning up properly.

amirejaz · 2026-03-30T11:48:23Z

pkg/runner/config.go

+	// RuntimeName is the registered name of the container runtime that owns this
+	// workload (e.g., "docker", "kubernetes"). Used during reconciliation to avoid
+	// corrupting status files of workloads managed by a different runtime.
+	RuntimeName string `json:"runtime_name,omitempty" yaml:"runtime_name,omitempty"`


Note: existing workloads are never auto-migrated.

RuntimeName is stamped only at creation time (SaveState). Existing workloads on disk have RuntimeName = "" and will never be updated unless the workload is deleted and recreated. This is fine operationally, but means the cross-runtime protection only applies to workloads created after this PR is deployed — which ties back to the correctness concern in isOwnedByActiveRuntime.

Addressed this in the follow-up commit. Legacy workloads now get their RuntimeName stamped opportunistically during reconciliation, but only when the runtime confirms the workload is healthy. So we're not blindly claiming ownership... we only stamp when we have strong evidence the workload actually belongs to the active runtime. The empty-string fallback in isOwnedByActiveRuntime stays as a safety net, but it should be a transient state now rather than permanent.

When multiple runtimes coexist (e.g., podman and go-microvm), running CLI commands with one runtime active permanently corrupts the status files of workloads managed by the other runtime, marking them as "unhealthy." This happens because the reconciliation logic assumes all workloads belong to the single active runtime. Add a RuntimeName field to RunConfig that records which runtime created a workload, and a Name() method to the Runtime interface so the active runtime can identify itself. During reconciliation, skip workloads whose owning runtime differs from the active one. Legacy workloads without a RuntimeName are conservatively treated as owned by the active runtime. Fixes #4432 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…iliation Legacy workloads created before runtime ownership tracking have an empty RuntimeName field. Rather than leaving them permanently in the legacy state, opportunistically stamp them with the active runtime's name when reconciliation confirms the workload is healthy (container running, proxy OK). This ensures the cross-runtime protection from the previous commit covers existing workloads progressively. Also adds a clarifying comment on the Remote guard in handleRuntimeMissing per reviewer feedback, and improves diagnostics with a DEBUG log when GetReader fails during migration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

JAORMX requested review from ChrisJBurns, amirejaz, blkt, jhrozek, lujunsan, rdimitrov and yrobla as code owners March 30, 2026 08:32

github-actions bot added the size/S Small PR: 100-299 lines changed label Mar 30, 2026

github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Mar 30, 2026

amirejaz requested changes Mar 30, 2026

View reviewed changes

JAORMX and others added 2 commits March 31, 2026 09:53

Regenerate swagger docs for RuntimeName field

e71df08

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

JAORMX mentioned this pull request Mar 31, 2026

RunConfig is read from disk multiple times per workload during reconciliation #4456

Open

JAORMX force-pushed the worktree-fix+runtime-ownership-status branch from c26952a to 6841de3 Compare March 31, 2026 07:17

github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/S Small PR: 100-299 lines changed size/M Medium PR: 300-599 lines changed labels Mar 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track runtime ownership to prevent cross-runtime status corruption#4434

Track runtime ownership to prevent cross-runtime status corruption#4434
JAORMX wants to merge 3 commits intomainfrom
worktree-fix+runtime-ownership-status

JAORMX commented Mar 30, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 30, 2026 •

edited

Loading

Uh oh!

amirejaz left a comment

Uh oh!

amirejaz Mar 30, 2026

Uh oh!

JAORMX Mar 31, 2026

Uh oh!

amirejaz Mar 30, 2026

Uh oh!

JAORMX Mar 31, 2026

Uh oh!

amirejaz Mar 30, 2026

Uh oh!

JAORMX Mar 31, 2026

Uh oh!

amirejaz Mar 30, 2026

Uh oh!

JAORMX Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JAORMX commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of change

Test plan

Changes

Does this introduce a user-facing change?

Special notes for reviewers

Uh oh!

codecov bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

amirejaz left a comment

Choose a reason for hiding this comment

Review: Track runtime ownership to prevent cross-runtime status corruption

Follow-up issues worth tracking

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JAORMX commented Mar 30, 2026 •

edited

Loading

codecov bot commented Mar 30, 2026 •

edited

Loading