Skip to content

feat(069-A2): actor-owned usage aggregate + persistence#560

Merged
Dumbris merged 3 commits into
mainfrom
069-a2-usage-aggregate
Jun 1, 2026
Merged

feat(069-A2): actor-owned usage aggregate + persistence#560
Dumbris merged 3 commits into
mainfrom
069-a2-usage-aggregate

Conversation

@Dumbris

@Dumbris Dumbris commented Jun 1, 2026

Copy link
Copy Markdown
Member

Spec 069 Stream A2 (Backend) — usage aggregate + persistence

Implements tasks T005–T010 from specs/069-observability-usage-graphs/tasks.md. Builds on A1 (#553, byte capture). Unblocks A3 (/api/v1/activity/usage endpoint, MCP-750).

What landed

  • internal/runtime/usage_aggregate.goUsageAggregate / ToolUsage / TimeBucket with incremental Apply(record). UsageStore publishes an immutable deep copy via atomic.Pointer (copy-on-write); Snapshot() reads are lock-free and never block.
  • ActivityService ownership — the aggregate is mutated only on the ActivityService goroutine (Apply on save success in handleEvent), exposed via UsageSnapshot(). Periodic flush (default 30s) + flush-on-shutdown; cold start does exactly one full-scan rebuild when no snapshot exists.
  • internal/storage/activity_stats.go — new activity_stats BBolt bucket, versioned key usage_aggregate_v1. Byte-oriented SaveUsageSnapshot/LoadUsageSnapshot + ScanAllActivities (avoids a storage→runtime import cycle; runtime owns JSON encode/decode + orchestration).
  • Configobservability.usage_cache_ttl (5s) + usage_persist_interval (30s): defaults in DefaultConfig, repaired in Validate (hot-reload re-runs it), and the flush loop re-reads the interval each cycle.

Design decisions (informed assumptions — see data-model.md §2 "A2 implementation notes")

  • Hourly time buckets (matches the contract's start: …T11:00:00Z), bounded ring of 24*90 buckets (~90d retention), oldest evicted. Minute/5-min 24h granularity deferred — hourly bars render cleanly and keep the aggregate bounded. A3 trims/selects the window over these.
  • Separate SizedRespCalls/SizedReqCalls so avg_resp_bytes/avg_req_bytes each exclude their own 0-byte (legacy) records exactly. Contract sized_callsSizedRespCalls.
  • Fixed latency histogram ({10,25,50,100,250,500,1000,2500,5000,10000} + overflow); Percentile(p) returns the bucket upper bound.

Verification (ENG-7/ENG-8)

  • TDD: failing tests first for aggregate math, snapshot/reads-never-block (incl. concurrent readers under -race), persistence round-trip, cold-start load-vs-rebuild, handleEvent integration, and config defaults.
  • go test ./internal/runtime/... ./internal/storage/... ./internal/config/... -race ✅ (full runtime suite incl. TestCalculateToolApprovalHash_Stability canary — ActivityRecord consumers unchanged).
  • ./scripts/run-linter.sh → 0 issues. gofmt clean. Personal + -tags server builds ✅.
  • Docs: docs/configuration.md Observability section; oas/swagger.yaml regenerated (config schema). No REST endpoint added (that's A3).

Related #745

Spec 069 Stream A2 (T005–T010): in-memory usage rollup owned by the
ActivityService goroutine, published to readers as a copy-on-write snapshot
via atomic pointer, persisted to BBolt with cold-start load-or-rebuild.

- UsageAggregate/ToolUsage/TimeBucket + incremental Apply (internal/runtime/usage_aggregate.go)
- UsageStore: atomic-pointer COW snapshot; lock-free, non-blocking reads
- ActivityService owns it: Apply on save in handleEvent, UsageSnapshot() reader,
  periodic 30s flush + flush-on-shutdown, cold-start load-or-rebuild (one full scan)
- activity_stats BBolt bucket (versioned key) + byte-oriented persist/load + ScanAllActivities
- observability.usage_cache_ttl (5s) + usage_persist_interval (30s) config: defaults + hot-reload
- docs (configuration.md) + spec trace (tasks T005–T010, data-model A2 notes)

TDD: aggregate math, snapshot/reads-never-block, persistence round-trip,
cold-start rebuild vs load, and config defaults all covered. Full
internal/runtime -race suite green (approval-hash canary safe).

Related #745

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 1, 2026

Copy link
Copy Markdown

Deploying mcpproxy-docs with  Cloudflare Pages  Cloudflare Pages

Latest commit: 17d8537
Status: ✅  Deploy successful!
Preview URL: https://90a4451f.mcpproxy-docs.pages.dev
Branch Preview URL: https://069-a2-usage-aggregate.mcpproxy-docs.pages.dev

View logs

@codecov-commenter

codecov-commenter commented Jun 1, 2026

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 73.92739% with 79 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
internal/runtime/usage_service.go 45.71% 23 Missing and 15 partials ⚠️
internal/runtime/usage_aggregate.go 85.27% 20 Missing and 4 partials ⚠️
internal/storage/activity_stats.go 67.56% 7 Missing and 5 partials ⚠️
internal/runtime/activity_service.go 61.53% 5 Missing ⚠️

📢 Thoughts on this report? Let us know!

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown

📦 Build Artifacts

Workflow Run: View Run
Branch: 069-a2-usage-aggregate

Available Artifacts

  • archive-darwin-amd64 (28 MB)
  • archive-darwin-arm64 (25 MB)
  • archive-linux-amd64 (16 MB)
  • archive-linux-arm64 (14 MB)
  • archive-windows-amd64 (27 MB)
  • archive-windows-arm64 (24 MB)
  • frontend-dist-pr (0 MB)
  • installer-dmg-darwin-amd64 (21 MB)
  • installer-dmg-darwin-arm64 (19 MB)

How to Download

Option 1: GitHub Web UI (easiest)

  1. Go to the workflow run page linked above
  2. Scroll to the bottom "Artifacts" section
  3. Click on the artifact you want to download

Option 2: GitHub CLI

gh run download 26756712352 --repo smart-mcp-proxy/mcpproxy-go

Note: Artifacts expire in 14 days.

@Dumbris

Dumbris commented Jun 1, 2026

Copy link
Copy Markdown
Member Author

Critic (Codex) review — Dumbris's PR #560
Verdict: request_changes
Strengths: The PR wires config, docs, persistence, and tests for the actor-owned usage aggregate, and all GitHub checks are green at head 62523fe2f82d4d69c68bcec21a19c3311e7fbaaf.
Findings:

  • internal/runtime/usage_aggregate.go:275 / internal/runtime/usage_aggregate.go:292: UsageStore.Apply republishes on every activity record, and publishLocked deep-clones the full aggregate (working.clone()) each time. That makes the ActivityService write path O(number of tools + buckets) per saved record, not the O(1) hot-path update promised by Spec 069 CN-002/FR-005 and by the PR description. With high-cardinality tools or many retained buckets, a single tool-call activity write can copy the whole 90-day aggregate while holding the store mutex. The snapshot publication needs to be decoupled/batched or otherwise bounded so activity writes remain incremental.
    Provenance check: ok

@Dumbris

Dumbris commented Jun 1, 2026

Copy link
Copy Markdown
Member Author

Critic (Codex) review — Dumbris's PR #560
Verdict: request_changes
Head: 62523fe2f82d4d69c68bcec21a19c3311e7fbaaf

Strengths: The aggregate is scoped to the A2 backend stream, uses copy-on-write snapshots for readers, adds persistence tests, and updates config docs/OpenAPI for the new observability fields.

Findings:

  1. Blocked tool calls will be missing from the usage aggregate. The usage endpoint contract exposes status=blocked filtering and a per-tool blocked count (specs/069-observability-usage-graphs/contracts/usage-endpoint.md:14, specs/069-observability-usage-graphs/contracts/usage-endpoint.md:37). In the current runtime, blocked tool attempts are emitted via EmitActivityPolicyDecision (internal/runtime/event_bus.go:451) and the server call sites use that path for quarantine/validation blocks (internal/server/mcp.go:1571, internal/server/mcp.go:1613, internal/server/output_validation.go:13). ActivityService persists those as ActivityTypePolicyDecision records (internal/runtime/activity_service.go:204, internal/runtime/activity_service.go:347), but UsageAggregate.Apply ignores anything that is not ActivityTypeToolCall (internal/runtime/usage_aggregate.go:153). The test suite even locks in that policy decisions with Status: "blocked" are ignored (internal/runtime/usage_aggregate_test.go:58). Result: the A3 endpoint will report zero blocked calls for the real blocked-call path, breaking the contract/status filter.

  2. observability.usage_persist_interval is not actually hot-reloaded. The data model says the observability config is hot-reloadable (specs/069-observability-usage-graphs/data-model.md:67), and SetUsagePersistInterval is documented as hot-reloadable (internal/runtime/usage_service.go:55). But the runtime only calls it during construction (internal/runtime/runtime.go:209), while config hot-reload change detection never includes observability (internal/runtime/config_hotreload.go:83) and ApplyConfig only applies logging/tool-response-limit component changes before updating the config snapshot (internal/runtime/runtime.go:1250, internal/runtime/runtime.go:1262). Editing usage_persist_interval at runtime therefore saves the config but leaves the existing flush loop on its old interval.

Checks: gh pr checks 560 --watch=false showed no failing or pending checks.

Provenance check: ok

Apply previously cloned the entire usage aggregate (every tool + every
time bucket) on every activity write via publishLocked, making the
write hot path O(tools×buckets) instead of O(1) — violating spec 069
CN-002 ("aggregate update O(1) per activity write; must not block the
request hot path").

Decouple publish from write: Apply now mutates the working aggregate
under a short writer lock and only marks the snapshot stale (atomic
dirty flag, O(1), no clone). The clone is deferred to Snapshot
(publish-on-read): the first reader after a write burst materializes
one fresh snapshot off the hot path; reads with no pending writes stay
lock-free. The A3 endpoint and the 30s persist flush are the only
readers, so clones are rare relative to writes.

Test-first (ENG-1): TestUsageStore_ApplyDoesNotPublishPerWrite asserts
500 writes trigger zero publishes and exactly one clone on first read;
BenchmarkUsageStore_Apply shows 1 alloc/op with 1000 primed tools
(was O(tools) allocs/op). Existing snapshot/replace contract tests and
the full -race runtime suite stay green.

Related #560
Related MCP-835
@Dumbris

Dumbris commented Jun 1, 2026

Copy link
Copy Markdown
Member Author

Addressed CodexReviewer's hot-path regression finding in commit 65540ab1: UsageStore.Apply previously deep-cloned the whole aggregate on every activity write (publish-on-write → O(tools×buckets)). Now Apply mutates under a short writer lock + atomic dirty flag only (O(1), no per-write clone); the clone is deferred to publish-on-read in Snapshot() (lock-free fast path when clean). New TestUsageStore_ApplyDoesNotPublishPerWrite (500 writes → 0 publishes, 1 clone on first read) + BenchmarkUsageStore_Apply (1 alloc/op with 1000 primed tools). Full internal/runtime -race suite, build, and linter green locally.

Addresses CodexReviewer findings #2 and #3 on PR #560 (the O(1) fix
cleared #1 once CI went green).

#2 — blocked tool attempts were missing from the usage aggregate. They
are persisted as blocked `policy_decision` records, but `Apply` dropped
all non-tool_call records and `handlePolicyDecision` never fed the
aggregate, so the contract's per-tool `blocked` field was permanently 0.
`Apply` now also folds blocked policy_decisions: a blocked attempt never
executed, so it increments only `Blocked` + `LastUsed` — not `Calls`,
latency, bytes, or the executed-call timeline. `handlePolicyDecision`
calls `usage.Apply` on save success so the live path matches a
cold-start rebuild-from-scan. Extracted a `tool()` get-or-create helper.

#3 — `observability.usage_persist_interval` claimed hot-reload but was
only read at construction. `DetectConfigChanges` now flags an
`observability` change and `ApplyConfig` pushes the new cadence into the
running ActivityService via `SetUsagePersistInterval` (the flush loop
already re-reads the interval each cycle).

Test-first (ENG-1): aggregate counts blocked-only (not Calls/latency/
timeline); live `handlePolicyDecision` folds blocked into the snapshot;
`DetectConfigChanges` detects observability as hot-reloadable; end-to-end
`ApplyConfig` applies the new interval to a running runtime. Repointed
the "ignores non-tool_calls" test to a non-blocked decision. Contract
documents `blocked` semantics. Full internal/runtime+config+storage
-race green; lint 0; personal+server builds.

Related #560
Related MCP-835

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@Dumbris

Dumbris commented Jun 1, 2026

Copy link
Copy Markdown
Member Author

Commit 17d85375 addresses CodexReviewer's remaining findings (the O(1) fix was #1, cleared once CI went green):

  • Enhance configuration management and signal handling in mcpproxy #2 blocked counts: blocked tool attempts are persisted as blocked policy_decision records and were never folded into the aggregate (ToolUsage.Blocked was permanently 0). Apply now folds them — incrementing only Blocked+LastUsed, not Calls/latency/bytes/timeline — and handlePolicyDecision feeds the aggregate so the live path matches a cold-start rebuild. Contract documents the semantics.
  • Add dynamic menu updates and server management features #3 observability hot-reload: usage_persist_interval claimed hot-reload but was construction-only. DetectConfigChanges now flags observability and ApplyConfig pushes the new cadence to the running ActivityService.

Tests: TestUsageAggregate_Apply_CountsBlockedPolicyDecisions, TestActivityService_HandleEvent_CountsBlockedPolicyDecision, TestDetectConfigChanges_Observability, TestApplyConfig_ObservabilityHotReload. Full internal/runtime+config+storage -race green; lint 0; personal+server builds.

@Dumbris Dumbris merged commit bf71028 into main Jun 1, 2026
38 checks passed
@Dumbris Dumbris deleted the 069-a2-usage-aggregate branch June 1, 2026 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants