fix: reduce retained tool output memory#2854
Conversation
This comment was marked as low quality.
This comment was marked as low quality.
2104c15 to
bb2f838
Compare
This comment was marked as outdated.
This comment was marked as outdated.
aheritier
left a comment
There was a problem hiding this comment.
The PR attacks four distinct memory sources and the implementation is mostly clean. The WithoutPayload slimming, the watchDone shutdown guarantee, and the ReadFileMeta.Content removal are all sound. One blocking issue must be resolved before merge.
Blocking
Disk leak: spooled MCP media files are never cleaned up.
writeMediaFile creates temp directories under os.TempDir() with prefix docker-agent-mcp-media-* but nothing removes them when the session ends or the ToolCallResult is GC'd. The PR itself calls defer os.RemoveAll(filepath.Dir(img.FilePath)) in the test, which highlights the production gap. Over a long session with many large media responses this trades the memory leak for a disk leak.
Fix options:
- Register a cleanup func on the
GatewayToolsetlifecycle (cf.cleanUppattern already used there), called onStop. - Use a single session-scoped temp directory created once at
Toolsetconstruction and removed atomically on teardown — simpler than one dir per payload and easier to reason about.
Non-blocking
-
Double
WithoutPayload()inreasoningblock.go—messages.goalready callsmsg.Result.WithoutPayload()before passing the result toUpdateToolResult;reasoningblock.gothen callsresult.WithoutPayload()again internally. Harmless (idempotent on a stripped result) but confusing. Choose one call site and remove the other. -
Missing test: disk-write fallback in
encodeMedia— theslog.Warn+ inline fallback branch is untested. Worth covering with an injected error (e.g. a package-levelwriteMediaFilevar that tests can replace). -
Missing test: threshold boundary —
TestProcessMCPContentSpoolsLargeMediaonly checksmaxInlineMediaBytes+1. Add a case for exactlymaxInlineMediaBytesbytes → inline, andmaxInlineMediaBytes+1→ spooled. -
TestSupervisor_StopWaitsForWatcheris weak coverage for the concurrency fix — the test exercises the sequential path (Start → Stop), not the concurrent-Stop path that thes.stoppingguard is specifically designed to handle. A test callingStoptwice concurrently from two goroutines while the watcher is live would be a stronger regression guard. The< time.Secondtiming assertion is also brittle under heavy CI load;goleakwould be more idiomatic. -
readImageFileinfilesystem.gostill inlines large local images — up tochat.MaxImageBytes(~4.5 MB) as base64. This is inconsistent with the new MCP spooling strategy. Out of scope for this PR, but worth a follow-up issue.
|
|
||
| func writeMediaFile(data []byte, mimeType string) (string, error) { | ||
| dir, err := os.MkdirTemp("", "docker-agent-mcp-media-*") | ||
| if err != nil { |
There was a problem hiding this comment.
[BLOCKING] Disk leak — these temp directories are never removed.
os.MkdirTemp creates a directory in os.TempDir() on every call to writeMediaFile, but nothing cleans it up when the ToolCallResult is discarded or the session ends. Over a long session with many large media tool responses this will exhaust disk space.
Suggested fix: introduce a session-scoped temp directory (created once, removed on Stop) instead of a per-payload directory. Alternatively, register a cleanup closure on the Toolset lifecycle the same way GatewayToolset does today.
| assert.Equal(t, "image/png", img.MimeType) | ||
| require.NotEmpty(t, img.FilePath) | ||
| defer os.RemoveAll(filepath.Dir(img.FilePath)) | ||
|
|
There was a problem hiding this comment.
The test correctly cleans up with defer os.RemoveAll(...), but this highlights that the production code does not. This defer will be unnecessary once a lifecycle cleanup hook is added on the production side.
| entry.msg.ToolStatus = status | ||
| entry.msg.ToolResult = result | ||
| entry.msg.ToolResult = result.WithoutPayload() | ||
|
|
There was a problem hiding this comment.
[Non-blocking] Double WithoutPayload() call.
The caller in messages.go already passes msg.Result.WithoutPayload() here, so result is already a stripped copy. Calling result.WithoutPayload() a second time is idempotent but redundant and may confuse future readers. Either remove WithoutPayload() from this line (since it's already stripped at the call site), or remove it from the call site in messages.go and let UpdateToolResult own the stripping — but not both.
| start := time.Now() | ||
| assert.NilError(t, s.Stop(t.Context())) | ||
| assert.Check(t, time.Since(start) < time.Second) | ||
| assert.Check(t, is.Equal(s.State().State, lifecycle.StateStopped)) |
There was a problem hiding this comment.
[Non-blocking] Brittle timing assertion and weak concurrency coverage.
This < time.Second bound will be flaky under heavy CI load. More importantly, the test only covers the sequential Start→Stop path, not the concurrent-Stop path that the s.stopping guard was added to fix. A test with two goroutines calling Stop concurrently while the watcher is alive (e.g. blocked in sess.Wait()) would be a stronger regression guard. Consider using goleak to assert no goroutines are left behind.
bb2f838 to
6bbc74b
Compare
docker-agent
left a comment
There was a problem hiding this comment.
Assessment: 🟢 APPROVE
Reviewed fix: reduce retained tool output memory (PR #2854).
Scope: 10 files changed — watcher shutdown sequencing (supervisor.go), metadata deduplication (filesystem.go), MCP media spooling to disk (mcp.go), TUI payload slimming (messages.go), and supporting tests.
Verification summary: The drafter raised 5 hypotheses (3 medium, 2 low). After code-level verification, all 5 were dismissed:
| # | File | Hypothesis | Verdict | Reason |
|---|---|---|---|---|
| 1 | supervisor.go |
RestartAndWait blocks indefinitely on ctx cancel during backoff |
✅ DISMISSED | RestartAndWait is not changed by this PR; the new watchDone mechanism applies to Stop() only. No regression introduced. |
| 2 | filesystem.go |
TOCTOU between resolveAndCheckPath and readFile in WalkDir |
✅ DISMISSED | The diff only removes a duplicate metadata field and stale assignment. No I/O path changed; TOCTOU was already addressed in prior commit eb7bb600. |
| 3 | messages.go |
Data race on renderedItems LRU cache between View()/Update() |
✅ DISMISSED | bubbletea runs View() and Update() on the same single-threaded loop goroutine — no concurrency. The diff only swaps msg.Result → msg.Result.WithoutPayload() with no new cache access. |
| 4 | supervisor.go |
Panic in OnFailed/OnRestart callbacks uncaught in watcher |
✅ DISMISSED | Callback invocations are unchanged; lack of panic recovery is pre-existing, not introduced by this PR. |
| 5 | filesystem.go |
Repair loop capped at 3 passes may silently fail | ✅ DISMISSED | No such loop exists anywhere in the changed code. |
Positive observations:
- The
watchDonechannel approach insupervisor.gois clean and correct: created before the watcher goroutine spawns, closed in the goroutine'sdefer, and waited on by both the first and concurrentStop()callers. Thestoppingguard path now correctly blocks instead of silently returning, which is the right fix. - Spooling MCP media to disk via
ensureMediaDir/cleanupMediaDiris well-structured, with proper fallback to inline base64 on spool failure and cleanup tied toStop(). WithoutPayload()inmessages.gocorrectly avoids retaining large payloads in the TUI state.- The new tests (
TestSupervisor_StopWaitsForWatcher,TestSupervisor_StopConcurrent) are solid —waitParkedeliminates the racy connect-then-stop path and the concurrent variant exercises thestoppingguard properly.
Uh oh!
There was an error while loading. Please reload this page.