feat(069-A2): actor-owned usage aggregate + persistence#560
Conversation
Spec 069 Stream A2 (T005–T010): in-memory usage rollup owned by the ActivityService goroutine, published to readers as a copy-on-write snapshot via atomic pointer, persisted to BBolt with cold-start load-or-rebuild. - UsageAggregate/ToolUsage/TimeBucket + incremental Apply (internal/runtime/usage_aggregate.go) - UsageStore: atomic-pointer COW snapshot; lock-free, non-blocking reads - ActivityService owns it: Apply on save in handleEvent, UsageSnapshot() reader, periodic 30s flush + flush-on-shutdown, cold-start load-or-rebuild (one full scan) - activity_stats BBolt bucket (versioned key) + byte-oriented persist/load + ScanAllActivities - observability.usage_cache_ttl (5s) + usage_persist_interval (30s) config: defaults + hot-reload - docs (configuration.md) + spec trace (tasks T005–T010, data-model A2 notes) TDD: aggregate math, snapshot/reads-never-block, persistence round-trip, cold-start rebuild vs load, and config defaults all covered. Full internal/runtime -race suite green (approval-hash canary safe). Related #745 Co-Authored-By: Paperclip <noreply@paperclip.ing>
Deploying mcpproxy-docs with
|
| Latest commit: |
17d8537
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://90a4451f.mcpproxy-docs.pages.dev |
| Branch Preview URL: | https://069-a2-usage-aggregate.mcpproxy-docs.pages.dev |
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
📦 Build ArtifactsWorkflow Run: View Run Available Artifacts
How to DownloadOption 1: GitHub Web UI (easiest)
Option 2: GitHub CLI gh run download 26756712352 --repo smart-mcp-proxy/mcpproxy-go
|
|
Critic (Codex) review — Dumbris's PR #560
|
|
Critic (Codex) review — Dumbris's PR #560 Strengths: The aggregate is scoped to the A2 backend stream, uses copy-on-write snapshots for readers, adds persistence tests, and updates config docs/OpenAPI for the new observability fields. Findings:
Checks: Provenance check: ok |
Apply previously cloned the entire usage aggregate (every tool + every
time bucket) on every activity write via publishLocked, making the
write hot path O(tools×buckets) instead of O(1) — violating spec 069
CN-002 ("aggregate update O(1) per activity write; must not block the
request hot path").
Decouple publish from write: Apply now mutates the working aggregate
under a short writer lock and only marks the snapshot stale (atomic
dirty flag, O(1), no clone). The clone is deferred to Snapshot
(publish-on-read): the first reader after a write burst materializes
one fresh snapshot off the hot path; reads with no pending writes stay
lock-free. The A3 endpoint and the 30s persist flush are the only
readers, so clones are rare relative to writes.
Test-first (ENG-1): TestUsageStore_ApplyDoesNotPublishPerWrite asserts
500 writes trigger zero publishes and exactly one clone on first read;
BenchmarkUsageStore_Apply shows 1 alloc/op with 1000 primed tools
(was O(tools) allocs/op). Existing snapshot/replace contract tests and
the full -race runtime suite stay green.
Related #560
Related MCP-835
|
Addressed CodexReviewer's hot-path regression finding in commit |
Addresses CodexReviewer findings #2 and #3 on PR #560 (the O(1) fix cleared #1 once CI went green). #2 — blocked tool attempts were missing from the usage aggregate. They are persisted as blocked `policy_decision` records, but `Apply` dropped all non-tool_call records and `handlePolicyDecision` never fed the aggregate, so the contract's per-tool `blocked` field was permanently 0. `Apply` now also folds blocked policy_decisions: a blocked attempt never executed, so it increments only `Blocked` + `LastUsed` — not `Calls`, latency, bytes, or the executed-call timeline. `handlePolicyDecision` calls `usage.Apply` on save success so the live path matches a cold-start rebuild-from-scan. Extracted a `tool()` get-or-create helper. #3 — `observability.usage_persist_interval` claimed hot-reload but was only read at construction. `DetectConfigChanges` now flags an `observability` change and `ApplyConfig` pushes the new cadence into the running ActivityService via `SetUsagePersistInterval` (the flush loop already re-reads the interval each cycle). Test-first (ENG-1): aggregate counts blocked-only (not Calls/latency/ timeline); live `handlePolicyDecision` folds blocked into the snapshot; `DetectConfigChanges` detects observability as hot-reloadable; end-to-end `ApplyConfig` applies the new interval to a running runtime. Repointed the "ignores non-tool_calls" test to a non-blocked decision. Contract documents `blocked` semantics. Full internal/runtime+config+storage -race green; lint 0; personal+server builds. Related #560 Related MCP-835 Co-Authored-By: Paperclip <noreply@paperclip.ing>
|
Commit
Tests: |
Spec 069 Stream A2 (Backend) — usage aggregate + persistence
Implements tasks T005–T010 from
specs/069-observability-usage-graphs/tasks.md. Builds on A1 (#553, byte capture). Unblocks A3 (/api/v1/activity/usageendpoint, MCP-750).What landed
internal/runtime/usage_aggregate.go—UsageAggregate/ToolUsage/TimeBucketwith incrementalApply(record).UsageStorepublishes an immutable deep copy viaatomic.Pointer(copy-on-write);Snapshot()reads are lock-free and never block.ActivityServiceownership — the aggregate is mutated only on the ActivityService goroutine (Applyon save success inhandleEvent), exposed viaUsageSnapshot(). Periodic flush (default 30s) + flush-on-shutdown; cold start does exactly one full-scan rebuild when no snapshot exists.internal/storage/activity_stats.go— newactivity_statsBBolt bucket, versioned keyusage_aggregate_v1. Byte-orientedSaveUsageSnapshot/LoadUsageSnapshot+ScanAllActivities(avoids a storage→runtime import cycle; runtime owns JSON encode/decode + orchestration).observability.usage_cache_ttl(5s) +usage_persist_interval(30s): defaults inDefaultConfig, repaired inValidate(hot-reload re-runs it), and the flush loop re-reads the interval each cycle.Design decisions (informed assumptions — see
data-model.md§2 "A2 implementation notes")start: …T11:00:00Z), bounded ring of24*90buckets (~90d retention), oldest evicted. Minute/5-min 24h granularity deferred — hourly bars render cleanly and keep the aggregate bounded. A3 trims/selects the window over these.SizedRespCalls/SizedReqCallssoavg_resp_bytes/avg_req_byteseach exclude their own 0-byte (legacy) records exactly. Contractsized_calls→SizedRespCalls.{10,25,50,100,250,500,1000,2500,5000,10000}+ overflow);Percentile(p)returns the bucket upper bound.Verification (ENG-7/ENG-8)
-race), persistence round-trip, cold-start load-vs-rebuild, handleEvent integration, and config defaults.go test ./internal/runtime/... ./internal/storage/... ./internal/config/... -race✅ (full runtime suite incl.TestCalculateToolApprovalHash_Stabilitycanary —ActivityRecordconsumers unchanged)../scripts/run-linter.sh→ 0 issues.gofmtclean. Personal +-tags serverbuilds ✅.docs/configuration.mdObservability section;oas/swagger.yamlregenerated (config schema). No REST endpoint added (that's A3).Related #745