Skip to content

Refactor monetize reconciliation into the serviceoffer controller#299

Draft
bussyjd wants to merge 11 commits intofeat/monetize-pathfrom
codex/serviceoffer-controller
Draft

Refactor monetize reconciliation into the serviceoffer controller#299
bussyjd wants to merge 11 commits intofeat/monetize-pathfrom
codex/serviceoffer-controller

Conversation

@bussyjd
Copy link
Copy Markdown
Collaborator

@bussyjd bussyjd commented Mar 29, 2026

Summary

This PR moves sell-side monetization reconciliation out of the obol-agent runtime and into a dedicated Kubernetes controller.

The main outcome is that ServiceOffer becomes the single source of truth for monetized HTTP offers, while the request path (x402-verifier) remains a separate stateless service that derives live routes directly from Kubernetes state instead of a shared mutable ConfigMap.

Why

The previous flow relied on monetize.py running inside the agent runtime, periodic polling, and imperative mutation of shared x402 pricing config. That created a few structural problems:

  • reconciliation depended on the obol-agent process being alive
  • route publication lagged behind changes because it was poll-driven
  • live pricing state was stored in a shared rendered artifact (x402-pricing), which introduced race conditions and cleanup complexity
  • external registration side effects were not clearly owned by a controller/finalizer path

This PR keeps the controller and verifier separate, but gives each a cleaner boundary:

  • serviceoffer-controller owns cluster convergence and registration lifecycle
  • x402-verifier owns only the live payment-gating request path

What changed

1. Added a dedicated serviceoffer-controller

  • added a new controller binary: cmd/serviceoffer-controller
  • added controller reconciliation logic under internal/serviceoffercontroller
  • the controller watches ServiceOffer and reconciles the Kubernetes resources needed to publish a paid route
  • the controller updates status.conditions and status.observedGeneration
  • delete-time cleanup is now controller-owned via finalizer logic instead of CLI best-effort cleanup

2. Kept ServiceOffer as the source of truth

An earlier design path introduced a separate PaymentRoute projection. This PR intentionally does not keep that layer.

Instead:

  • ServiceOffer remains the only dynamic intent object
  • the controller reconciles from ServiceOffer
  • the verifier also reads ServiceOffer directly and rebuilds its in-memory route table from informer-backed cluster state

This keeps the model simpler and avoids duplicating routing state in another CRD.

3. Isolated registration side effects with RegistrationRequest

  • added a RegistrationRequest CRD
  • the controller now owns creation and observation of registration work instead of letting it leak into the request-serving path
  • registration publication and cleanup move closer to a proper controller/finalizer model

4. Simplified x402-verifier

  • x402-verifier no longer relies on the dynamic x402-pricing ConfigMap as the live source for per-offer routes
  • it now derives route rules from published ServiceOffer objects
  • /.well-known/agent-registration.json is no longer served by the verifier, which reduces the amount of non-request-path responsibility in that service
  • file-based config remains for static verifier settings, but not for live ServiceOffer routing state

5. Reduced agent-side monetization responsibilities

  • rewrote internal/embed/skills/sell/scripts/monetize.py into a much thinner compatibility layer
  • it now behaves as a CRUD/status/wait/publish helper instead of being the long-lived reconciliation owner
  • reduced the monetize RBAC footprint for the obol agent to reflect that it is no longer the control-plane reconciler

6. Updated schemas, docs, and tests to match the new model

  • extended registration-related schema/types so skills and domains flow through the new control plane
  • updated architecture/design docs and plans so the old poll-driven ConfigMap-mutating design is explicitly historical
  • updated x402 BDD/E2E and controller/unit test coverage for the new source-of-truth and controller model

Key files

  • cmd/serviceoffer-controller/main.go
  • internal/serviceoffercontroller/controller.go
  • internal/serviceoffercontroller/render.go
  • internal/embed/infrastructure/base/templates/registrationrequest-crd.yaml
  • internal/embed/infrastructure/base/templates/serviceoffer-crd.yaml
  • internal/embed/infrastructure/base/templates/x402.yaml
  • internal/embed/infrastructure/base/templates/obol-agent-monetize-rbac.yaml
  • internal/x402/serviceoffer_source.go
  • internal/x402/verifier.go
  • internal/embed/skills/sell/scripts/monetize.py

Validation

Passed locally:

python3 -m py_compile internal/embed/skills/sell/scripts/monetize.py
go test ./internal/serviceoffercontroller ./internal/embed ./internal/x402 -run 'Test(ServiceOfferCRD_|RegistrationRequestCRD_|MonetizeRBAC_|BuildMiddleware|BuildHTTPRoute|BuildRegistrationRequest|BuildActiveRegistrationDocument|RegistrationDataURL|SetConditionUpdatesExistingEntry|RoutesFromStore|RoutesFromStore_IgnoresUnpublishedOffers|Verifier_NoForwardedURI_Returns403|Verifier_FreeRoute_Returns200|Verifier_PaidRoute_NoPayment_Returns402|Verifier_PaidRoute_ValidPayment_Returns200|Verifier_PaidRoute_RejectedPayment_Returns402|Verifier_VerifyOnly_SkipsSettle|Verifier_Readyz|Verifier_InvalidChain|WatchConfig_)'
go test ./... -run TestDoesNotExist
go test -c -tags integration -o /tmp/obol-integration-compile/openclaw.test ./internal/openclaw
go test -c -tags integration -o /tmp/obol-integration-compile/x402.test ./internal/x402

Runtime integration note

I also made a real integration attempt with:

go test -tags integration -run TestIntegration_PaymentGate_FullLifecycle -timeout 30m ./internal/x402

That now gets past the earlier missing-k3d bootstrap blocker, creates the cluster, and enters obol stack up, but it did not complete within the session because it was still in the Docker image build/bootstrap path for the x402 stack. I am calling that out here because it is a meaningful improvement over the previous blocked state, but it is not yet a completed end-to-end runtime pass.

Follow-up

  • finish a full runtime integration pass in CI/local once the image-build/bootstrap path is stable enough to complete consistently
  • if desired, add a repo/issue update back to Replace monetize.py reconciliation loop with controller-runtime operator #296 summarizing the final design choice (ServiceOffer direct watch instead of introducing PaymentRoute); I attempted this through the GitHub app, but the app did not have permission to comment on the issue in this repo

bussyjd and others added 5 commits March 29, 2026 08:56
Resolve CLI/ERC-8004 conflicts for the ServiceOffer controller branch and replace the buyer proxy's x402 retry transport with a replay-safe local implementation so request bodies survive 402 retries under Go 1.26.
@bussyjd bussyjd changed the base branch from main to feat/monetize-path March 30, 2026 04:21
@bussyjd
Copy link
Copy Markdown
Collaborator Author

bussyjd commented Mar 30, 2026

Validation rerun on codex/serviceoffer-controller after resolving the feat/monetize-path merge conflicts.

Passed:

go test ./internal/embed ./internal/serviceoffercontroller ./internal/erc8004 ./internal/kubectl ./internal/schemas
go test ./cmd/obol ./internal/network ./internal/tunnel ./internal/x402/...
python3 -m unittest tests/test_sell_registration_metadata.py
python3 -m py_compile internal/embed/skills/sell/scripts/monetize.py

Notes:

  • Worktree was clean before rerunning.
  • This covers the controller/renderer path, ERC-8004 types/client, kubectl apply helpers, CLI sell path, network/tunnel plumbing, x402 verifier/buyer path, and the Python registration metadata compatibility helpers.
  • I did not rerun the full live seller/buyer integration flows in this pass; this comment is reporting the automated test rerun only.

When OBOL_DEVELOPMENT=true, Docker builds from the project root pick up
.workspace/data/ directories that contain root-owned PVC mounts from
previous clusters, causing "permission denied" errors during context
scanning.

Exclude .workspace/ and .worktrees/ from the Docker build context via
.dockerignore.

Fixes #304
@bussyjd bussyjd force-pushed the codex/serviceoffer-controller branch from badcfc0 to 98fc024 Compare March 30, 2026 05:33
@bussyjd
Copy link
Copy Markdown
Collaborator Author

bussyjd commented Mar 30, 2026

Follow-up validation after rerunning the flow suite on codex/serviceoffer-controller.

What I fixed while running the flows:

  • flow-07-sell-verify.sh: fixed tunnel URL extraction under set -euo pipefail
  • flow-08-buy.sh: fixed the same tunnel extraction bug and made the buy flow fall back to local obol.stack when the quick tunnel URL exists but does not actually return the expected 402 on the service route
  • flow-10-anvil-facilitator.sh: hardened local facilitator startup so it survives after the script exits, and made the cluster facilitator host alias probe the live cluster instead of hardcoding a mac/Linux assumption
  • internal/stack/stack.go: added dev-mode prewarm/import of external images so fresh OBOL_DEVELOPMENT=true clusters do not spend most of bootstrap waiting on internet pulls for third-party images

Flow status after those fixes:

  • flow-01-prerequisites.sh: pass
  • flow-02-stack-init-up.sh: pass
  • flow-03-inference.sh: pass
  • flow-04-agent.sh: pass (after importing the OpenClaw image into the k3d cluster cache)
  • flow-05-network.sh: pass
  • flow-06-sell-setup.sh: pass
  • flow-07-sell-verify.sh: pass
  • flow-10-anvil-facilitator.sh: pass
  • flow-08-buy.sh: pass
  • flow-09-lifecycle.sh: pass

Concrete seller/buyer proof from the successful rerun:

  • local 402: pass
  • tunnel 402: pass
  • paid inference: pass
  • buyer USDC: 1000000000 -> 999999000
  • seller USDC: 291036851 -> 291037851

Root cause of the earlier buy failure:

  • the facilitator path was not actually stable from the cluster’s point of view
  • the old flow hardcoded the cluster host alias and backgrounded the facilitator in a way that let it disappear after startup
  • once the script selected the reachable alias and the facilitator stayed resident, the verifier/facilitator path verified and settled payments successfully again

fix: exclude .workspace from Docker build context (#304)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant