feat(069-A2): actor-owned usage aggregate + persistence by Dumbris · Pull Request #560 · smart-mcp-proxy/mcpproxy-go

Dumbris · 2026-06-01T09:44:08Z

Spec 069 Stream A2 (Backend) — usage aggregate + persistence

Implements tasks T005–T010 from specs/069-observability-usage-graphs/tasks.md. Builds on A1 (#553, byte capture). Unblocks A3 (/api/v1/activity/usage endpoint, MCP-750).

What landed

internal/runtime/usage_aggregate.go — UsageAggregate / ToolUsage / TimeBucket with incremental Apply(record). UsageStore publishes an immutable deep copy via atomic.Pointer (copy-on-write); Snapshot() reads are lock-free and never block.
ActivityService ownership — the aggregate is mutated only on the ActivityService goroutine (Apply on save success in handleEvent), exposed via UsageSnapshot(). Periodic flush (default 30s) + flush-on-shutdown; cold start does exactly one full-scan rebuild when no snapshot exists.
internal/storage/activity_stats.go — new activity_stats BBolt bucket, versioned key usage_aggregate_v1. Byte-oriented SaveUsageSnapshot/LoadUsageSnapshot + ScanAllActivities (avoids a storage→runtime import cycle; runtime owns JSON encode/decode + orchestration).
Config — observability.usage_cache_ttl (5s) + usage_persist_interval (30s): defaults in DefaultConfig, repaired in Validate (hot-reload re-runs it), and the flush loop re-reads the interval each cycle.

Design decisions (informed assumptions — see `data-model.md` §2 "A2 implementation notes")

Hourly time buckets (matches the contract's start: …T11:00:00Z), bounded ring of 24*90 buckets (~90d retention), oldest evicted. Minute/5-min 24h granularity deferred — hourly bars render cleanly and keep the aggregate bounded. A3 trims/selects the window over these.
Separate SizedRespCalls/SizedReqCalls so avg_resp_bytes/avg_req_bytes each exclude their own 0-byte (legacy) records exactly. Contract sized_calls → SizedRespCalls.
Fixed latency histogram ({10,25,50,100,250,500,1000,2500,5000,10000} + overflow); Percentile(p) returns the bucket upper bound.

Verification (ENG-7/ENG-8)

TDD: failing tests first for aggregate math, snapshot/reads-never-block (incl. concurrent readers under -race), persistence round-trip, cold-start load-vs-rebuild, handleEvent integration, and config defaults.
go test ./internal/runtime/... ./internal/storage/... ./internal/config/... -race ✅ (full runtime suite incl. TestCalculateToolApprovalHash_Stability canary — ActivityRecord consumers unchanged).
./scripts/run-linter.sh → 0 issues. gofmt clean. Personal + -tags server builds ✅.
Docs: docs/configuration.md Observability section; oas/swagger.yaml regenerated (config schema). No REST endpoint added (that's A3).

Related #745

Spec 069 Stream A2 (T005–T010): in-memory usage rollup owned by the ActivityService goroutine, published to readers as a copy-on-write snapshot via atomic pointer, persisted to BBolt with cold-start load-or-rebuild. - UsageAggregate/ToolUsage/TimeBucket + incremental Apply (internal/runtime/usage_aggregate.go) - UsageStore: atomic-pointer COW snapshot; lock-free, non-blocking reads - ActivityService owns it: Apply on save in handleEvent, UsageSnapshot() reader, periodic 30s flush + flush-on-shutdown, cold-start load-or-rebuild (one full scan) - activity_stats BBolt bucket (versioned key) + byte-oriented persist/load + ScanAllActivities - observability.usage_cache_ttl (5s) + usage_persist_interval (30s) config: defaults + hot-reload - docs (configuration.md) + spec trace (tasks T005–T010, data-model A2 notes) TDD: aggregate math, snapshot/reads-never-block, persistence round-trip, cold-start rebuild vs load, and config defaults all covered. Full internal/runtime -race suite green (approval-hash canary safe). Related #745 Co-Authored-By: Paperclip <noreply@paperclip.ing>

cloudflare-workers-and-pages · 2026-06-01T09:45:26Z

Deploying mcpproxy-docs with Cloudflare Pages

Latest commit:	`17d8537`
Status:	✅ Deploy successful!
Preview URL:	https://90a4451f.mcpproxy-docs.pages.dev
Branch Preview URL:	https://069-a2-usage-aggregate.mcpproxy-docs.pages.dev

View logs

codecov-commenter · 2026-06-01T09:48:28Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 73.92739% with 79 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
internal/runtime/usage_service.go	45.71%	23 Missing and 15 partials ⚠️
internal/runtime/usage_aggregate.go	85.27%	20 Missing and 4 partials ⚠️
internal/storage/activity_stats.go	67.56%	7 Missing and 5 partials ⚠️
internal/runtime/activity_service.go	61.53%	5 Missing ⚠️

📢 Thoughts on this report? Let us know!

github-actions · 2026-06-01T09:52:21Z

📦 Build Artifacts

Workflow Run: View Run
Branch: 069-a2-usage-aggregate

Available Artifacts

archive-darwin-amd64 (28 MB)
archive-darwin-arm64 (25 MB)
archive-linux-amd64 (16 MB)
archive-linux-arm64 (14 MB)
archive-windows-amd64 (27 MB)
archive-windows-arm64 (24 MB)
frontend-dist-pr (0 MB)
installer-dmg-darwin-amd64 (21 MB)
installer-dmg-darwin-arm64 (19 MB)

How to Download

Option 1: GitHub Web UI (easiest)

Go to the workflow run page linked above
Scroll to the bottom "Artifacts" section
Click on the artifact you want to download

Option 2: GitHub CLI

gh run download 26756712352 --repo smart-mcp-proxy/mcpproxy-go

Note: Artifacts expire in 14 days.

Dumbris · 2026-06-01T12:23:55Z

Critic (Codex) review — Dumbris's PR #560
Verdict: request_changes
Strengths: The PR wires config, docs, persistence, and tests for the actor-owned usage aggregate, and all GitHub checks are green at head 62523fe2f82d4d69c68bcec21a19c3311e7fbaaf.
Findings:

internal/runtime/usage_aggregate.go:275 / internal/runtime/usage_aggregate.go:292: UsageStore.Apply republishes on every activity record, and publishLocked deep-clones the full aggregate (working.clone()) each time. That makes the ActivityService write path O(number of tools + buckets) per saved record, not the O(1) hot-path update promised by Spec 069 CN-002/FR-005 and by the PR description. With high-cardinality tools or many retained buckets, a single tool-call activity write can copy the whole 90-day aggregate while holding the store mutex. The snapshot publication needs to be decoupled/batched or otherwise bounded so activity writes remain incremental.
Provenance check: ok

Dumbris · 2026-06-01T12:25:10Z

Critic (Codex) review — Dumbris's PR #560
Verdict: request_changes
Head: 62523fe2f82d4d69c68bcec21a19c3311e7fbaaf

Strengths: The aggregate is scoped to the A2 backend stream, uses copy-on-write snapshots for readers, adds persistence tests, and updates config docs/OpenAPI for the new observability fields.

Findings:

Blocked tool calls will be missing from the usage aggregate. The usage endpoint contract exposes status=blocked filtering and a per-tool blocked count (specs/069-observability-usage-graphs/contracts/usage-endpoint.md:14, specs/069-observability-usage-graphs/contracts/usage-endpoint.md:37). In the current runtime, blocked tool attempts are emitted via EmitActivityPolicyDecision (internal/runtime/event_bus.go:451) and the server call sites use that path for quarantine/validation blocks (internal/server/mcp.go:1571, internal/server/mcp.go:1613, internal/server/output_validation.go:13). ActivityService persists those as ActivityTypePolicyDecision records (internal/runtime/activity_service.go:204, internal/runtime/activity_service.go:347), but UsageAggregate.Apply ignores anything that is not ActivityTypeToolCall (internal/runtime/usage_aggregate.go:153). The test suite even locks in that policy decisions with Status: "blocked" are ignored (internal/runtime/usage_aggregate_test.go:58). Result: the A3 endpoint will report zero blocked calls for the real blocked-call path, breaking the contract/status filter.
observability.usage_persist_interval is not actually hot-reloaded. The data model says the observability config is hot-reloadable (specs/069-observability-usage-graphs/data-model.md:67), and SetUsagePersistInterval is documented as hot-reloadable (internal/runtime/usage_service.go:55). But the runtime only calls it during construction (internal/runtime/runtime.go:209), while config hot-reload change detection never includes observability (internal/runtime/config_hotreload.go:83) and ApplyConfig only applies logging/tool-response-limit component changes before updating the config snapshot (internal/runtime/runtime.go:1250, internal/runtime/runtime.go:1262). Editing usage_persist_interval at runtime therefore saves the config but leaves the existing flush loop on its old interval.

Checks: gh pr checks 560 --watch=false showed no failing or pending checks.

Provenance check: ok

Apply previously cloned the entire usage aggregate (every tool + every time bucket) on every activity write via publishLocked, making the write hot path O(tools×buckets) instead of O(1) — violating spec 069 CN-002 ("aggregate update O(1) per activity write; must not block the request hot path"). Decouple publish from write: Apply now mutates the working aggregate under a short writer lock and only marks the snapshot stale (atomic dirty flag, O(1), no clone). The clone is deferred to Snapshot (publish-on-read): the first reader after a write burst materializes one fresh snapshot off the hot path; reads with no pending writes stay lock-free. The A3 endpoint and the 30s persist flush are the only readers, so clones are rare relative to writes. Test-first (ENG-1): TestUsageStore_ApplyDoesNotPublishPerWrite asserts 500 writes trigger zero publishes and exactly one clone on first read; BenchmarkUsageStore_Apply shows 1 alloc/op with 1000 primed tools (was O(tools) allocs/op). Existing snapshot/replace contract tests and the full -race runtime suite stay green. Related #560 Related MCP-835

Dumbris · 2026-06-01T12:42:10Z

Addressed CodexReviewer's hot-path regression finding in commit 65540ab1: UsageStore.Apply previously deep-cloned the whole aggregate on every activity write (publish-on-write → O(tools×buckets)). Now Apply mutates under a short writer lock + atomic dirty flag only (O(1), no per-write clone); the clone is deferred to publish-on-read in Snapshot() (lock-free fast path when clean). New TestUsageStore_ApplyDoesNotPublishPerWrite (500 writes → 0 publishes, 1 clone on first read) + BenchmarkUsageStore_Apply (1 alloc/op with 1000 primed tools). Full internal/runtime -race suite, build, and linter green locally.

Addresses CodexReviewer findings #2 and #3 on PR #560 (the O(1) fix cleared #1 once CI went green). #2 — blocked tool attempts were missing from the usage aggregate. They are persisted as blocked `policy_decision` records, but `Apply` dropped all non-tool_call records and `handlePolicyDecision` never fed the aggregate, so the contract's per-tool `blocked` field was permanently 0. `Apply` now also folds blocked policy_decisions: a blocked attempt never executed, so it increments only `Blocked` + `LastUsed` — not `Calls`, latency, bytes, or the executed-call timeline. `handlePolicyDecision` calls `usage.Apply` on save success so the live path matches a cold-start rebuild-from-scan. Extracted a `tool()` get-or-create helper. #3 — `observability.usage_persist_interval` claimed hot-reload but was only read at construction. `DetectConfigChanges` now flags an `observability` change and `ApplyConfig` pushes the new cadence into the running ActivityService via `SetUsagePersistInterval` (the flush loop already re-reads the interval each cycle). Test-first (ENG-1): aggregate counts blocked-only (not Calls/latency/ timeline); live `handlePolicyDecision` folds blocked into the snapshot; `DetectConfigChanges` detects observability as hot-reloadable; end-to-end `ApplyConfig` applies the new interval to a running runtime. Repointed the "ignores non-tool_calls" test to a non-blocked decision. Contract documents `blocked` semantics. Full internal/runtime+config+storage -race green; lint 0; personal+server builds. Related #560 Related MCP-835 Co-Authored-By: Paperclip <noreply@paperclip.ing>

Dumbris · 2026-06-01T13:06:32Z

Commit 17d85375 addresses CodexReviewer's remaining findings (the O(1) fix was #1, cleared once CI went green):

Enhance configuration management and signal handling in mcpproxy #2 blocked counts: blocked tool attempts are persisted as blocked policy_decision records and were never folded into the aggregate (ToolUsage.Blocked was permanently 0). Apply now folds them — incrementing only Blocked+LastUsed, not Calls/latency/bytes/timeline — and handlePolicyDecision feeds the aggregate so the live path matches a cold-start rebuild. Contract documents the semantics.
Add dynamic menu updates and server management features #3 observability hot-reload: usage_persist_interval claimed hot-reload but was construction-only. DetectConfigChanges now flags observability and ApplyConfig pushes the new cadence to the running ActivityService.

Tests: TestUsageAggregate_Apply_CountsBlockedPolicyDecisions, TestActivityService_HandleEvent_CountsBlockedPolicyDecision, TestDetectConfigChanges_Observability, TestApplyConfig_ObservabilityHotReload. Full internal/runtime+config+storage -race green; lint 0; personal+server builds.

Dumbris merged commit bf71028 into main Jun 1, 2026
38 checks passed

Dumbris deleted the 069-a2-usage-aggregate branch June 1, 2026 14:22

Dumbris mentioned this pull request Jun 1, 2026

feat(069-A3): GET /api/v1/activity/usage endpoint #565

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(069-A2): actor-owned usage aggregate + persistence#560

feat(069-A2): actor-owned usage aggregate + persistence#560
Dumbris merged 3 commits into
mainfrom
069-a2-usage-aggregate

Dumbris commented Jun 1, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jun 1, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

Dumbris commented Jun 1, 2026

Uh oh!

Dumbris commented Jun 1, 2026

Uh oh!

Dumbris commented Jun 1, 2026

Uh oh!

Dumbris commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Dumbris commented Jun 1, 2026

Spec 069 Stream A2 (Backend) — usage aggregate + persistence

What landed

Design decisions (informed assumptions — see data-model.md §2 "A2 implementation notes")

Verification (ENG-7/ENG-8)

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying mcpproxy-docs with Cloudflare Pages

Uh oh!

codecov-commenter commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📦 Build Artifacts

Available Artifacts

How to Download

Uh oh!

Dumbris commented Jun 1, 2026

Uh oh!

Dumbris commented Jun 1, 2026

Uh oh!

Dumbris commented Jun 1, 2026

Uh oh!

Dumbris commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Design decisions (informed assumptions — see `data-model.md` §2 "A2 implementation notes")

cloudflare-workers-and-pages Bot commented Jun 1, 2026 •

edited

Loading

codecov-commenter commented Jun 1, 2026 •

edited

Loading

github-actions Bot commented Jun 1, 2026 •

edited

Loading