feat(benchmark): scaffold Code→UI Benchmark comparators (chrome-devtools-mcp + WebMCP reference) and /benchmark v1 results page by Narrator · Pull Request #48 · patchorbit/domscribe

Narrator · 2026-06-06T20:10:21Z

Summary

This PR delivers sprint 2613 Task B (benchmark-comparators-and-page) — the comparator implementations and the public results-page scaffold for the Code→UI Benchmark v1.

Three things land:

packages/domscribe-benchmark-comparators/ — new sibling package containing the Comparator interface, the WebMCP-conformant in-repo reference server (per RFC 0004's fallback — no public WebMCP-conformant server installs cleanly in-sprint), and the chrome-devtools-mcp comparator shim. 17 unit tests, 84% line coverage. The runner in @domscribe/benchmark (Task A) imports Comparator implementations from here.
chrome-devtools-mcp@1.1.1 pinned as root dev-dep — exact-version pin (no caret) so the published 45-cell results table is reproducible byte-for-byte. Lockfile diff includes pnpm's routine forward-resolution of cookie / cors / qs / http-errors transitive patches.
/benchmark static results page (docs/code-to-ui-benchmark/v1-results.md) — public artifact RFC 0004's category-naming falsifier reads. Schema-true placeholder cells; Task A's runner populates them from results.json. Linked from README.md (Features section) and TECHNICAL_SPEC.md (§9 Framework Support Matrix). Methodology, comparator table with external-validity flag, reproduction command, and falsifier text all included.

Plus: docs/code-to-ui-benchmark/vendor-submission-amendment-plan.md — the checklist for amending Sprint 2612 IDE-vendor-directory submissions with the /benchmark link while review windows are still open.

Why a separate package boundary

The comparators have a different stability surface than the benchmark spec: the WebMCP reference is intentionally drop-in replaceable once a real external WebMCP-conformant server is installable, and the chrome-devtools-mcp shim is a thin wrapper over a pinned external dev-dep. Co-locating them with the spec/runner in @domscribe/benchmark would couple two distinct evolution rhythms. Each package can publish independently when its surface changes.

Acceptance criteria mapping (PM task B)

Criterion	This PR	Notes
`chrome-devtools-mcp` installed as a pinned dev-dep	✅	`package.json` line; lockfile updated
Comparator wired to run against 5 scenarios × 3 fixtures = 15 cells	⚠️ Partial	Comparator class is wired; fixture instrumentation and the runner invocation live in Task A's `@domscribe/benchmark` — the 15 cells materialise once Task A's runner imports `createChromeDevtoolsMcpComparator` and calls `run(prompt)` against the fixtures
WebMCP-conformant reference wired	✅	In-repo reference per RFC 0004 fallback; `/benchmark` page surfaces the `external-validity: in-repo-fallback` caveat verbatim
`/benchmark` page live, linked from README.md and TECHNICAL_SPEC.md	✅	`docs/code-to-ui-benchmark/v1-results.md` with last-run timestamp slot, comparator versions, methodology link to the spec, and reproduction command snippet
Sprint 2612 vendor-directory submissions amended	⚠️ Conditional	`docs/submissions.md` and the four submission PRs are not yet on `main` as of sprint 2613 kickoff — see `vendor-submission-amendment-plan.md` for the trigger-conditional procedure
Engineer logs ≥8h of OSS reference-integration pairing time	⚠️ Out of scope for this PR	Pairing time will appear in sprint retro notes per PM acceptance

Dependencies on Task A (Sprint 2613 batch A)

This PR is structurally a Task B deliverable; Task A (benchmark-rcp-foundation — @domscribe/benchmark package, docs/code-to-ui-benchmark/v1.md spec, RCP-v1 column with 15 cells) has not yet landed on main as of this PR. The Comparator interface here was designed to be the contract Task A's runner consumes, but:

The benchmark spec doc (docs/code-to-ui-benchmark/v1.md) — referenced from this PR's v1-results.md and from the comparators README — does not yet exist. Links resolve to a 404 until Task A merges.
The fixture instrumentation for S1–S5 lives with Task A. Until that lands, the 15 cells per comparator column remain as — placeholders on /benchmark.

When Task A merges, the runner's import statement should look approximately like:

import {
  createChromeDevtoolsMcpComparator,
  createWebMcpReferenceComparator,
} from '@domscribe/benchmark-comparators';

No rebase conflict expected — Task A and this PR touch disjoint paths (packages/domscribe-benchmark/* vs. packages/domscribe-benchmark-comparators/*).

Pre-existing blocker (escalation, not a blocker for this PR)

Sprint 2492's RCP-v1 stamp (PE RFC 0003 falsifier) has not landed on main as of sprint 2613 kickoff — no docs/rcp/v1.md, no docs/submissions.md. The PM plan's replanning trigger #1 names this as a "full Sprint 2613 replan" trigger because the benchmark has no v1 subject without RCP v1. This PR does not depend on RCP v1 landing because the comparators are framework-agnostic, but Task A and the /benchmark page's RCP-v1 column do depend on it. Surfacing here for visibility; a separate Slack escalation to Kaushik covers the broader sprint-thesis risk.

Test plan

pnpm nx run domscribe-benchmark-comparators:lint — passes
pnpm nx run domscribe-benchmark-comparators:test — 17/17 tests pass; 84% statements / 71% branches / 92% functions / 83% lines coverage (all above configured thresholds)
pnpm nx run domscribe-benchmark-comparators:build — compiles cleanly
pnpm nx affected -t lint --base=main~1 — comparators package + all other affected packages pass; pre-existing domscribe-test-fixtures lint errors are unrelated to this PR (verified by stashing changes and re-running lint on base)
Manual verification that /benchmark v1-results page renders correctly on GitHub — to be confirmed once Task A populates results.json
Comparator integration test against fixtures — owned by Task A's runner PR

Theory-of-Constraints sequencing: install-path readiness (publishConfig flip + protocol-freeze CI gate) is the new trunk; vendor-directory submissions and the Cursor forum reply land in days 9-14 once @domscribe/mcp@1.0.0 is publicly installable. Adopts PE RFC 0003's falsifier (any of three branches missing fails the sprint) and inherits DOP's 2026-07-10 falsifier. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…s with WebMCP-conformant in-repo reference + chrome-devtools-mcp shim Stands up the comparators sibling package consumed by @domscribe/benchmark's runner (sprint 2613 Task A) to score the third and fourth columns of the Code→UI Benchmark v1 results table. Implements the WebMCP-conformant in-repo reference per RFC 0004's fallback (no public WebMCP-conformant server installs cleanly in-sprint) and the chrome-devtools-mcp comparator as a thin client shim. Comparator interface, types, and outcomes are defined here so the runner can call any future external WebMCP server as a drop-in replacement. 17 unit tests, 84% line coverage. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…chmark comparator The chrome-devtools-mcp@1.1.1 dev-dep backs the second comparator column on /benchmark. Version is pinned (no caret) so the published results table is reproducible byte-for-byte against the exact transport release we tested. Lockfile diff includes pnpm's routine forward-resolution of cookie / cors / qs / http-errors transitive patches; these are minor patches with no API impact. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ission amendment plan; link from README and TECHNICAL_SPEC Publishes the static /benchmark v1 results page (docs/code-to-ui-benchmark/v1-results.md) as the public artifact RFC 0004's category-naming falsifier reads. Results table cells are schema-true placeholders; the runner in @domscribe/benchmark (sprint 2613 Task A) populates them from results.json. README "More" section and TECHNICAL_SPEC §9 link to the page, satisfying the "linked from README and TECHNICAL_SPEC" arm of Task B's acceptance criteria. Also publishes the Sprint 2612 vendor-submission amendment plan — the procedure to append the /benchmark link to the four IDE-vendor-directory submissions while their review windows are still open. Plan is explicit about the current sprint-2612 state (submissions tracker not yet on main as of sprint 2613 kickoff) so the trigger to execute is conditional on those landing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

nx-cloud · 2026-06-06T20:11:55Z

View your CI Pipeline Execution ↗ for commit 9250faf

Command	Status	Duration	Result
`nx run domscribe-test-fixtures:integration--web...`	✅ Succeeded	1m 37s	View ↗
`nx run domscribe-test-fixtures:integration--web...`	✅ Succeeded	1m 39s	View ↗
`nx run domscribe-test-fixtures:integration--web...`	✅ Succeeded	1m 38s	View ↗
`nx run domscribe-test-fixtures:integration--web...`	✅ Succeeded	1m 20s	View ↗
`nx run domscribe-test-fixtures:install-fixture-...`	✅ Succeeded	51s	View ↗
`nx run domscribe-test-fixtures:install-fixture-...`	✅ Succeeded	43s	View ↗
`nx run domscribe-test-fixtures:install-fixture-...`	✅ Succeeded	22s	View ↗
`nx run domscribe-test-fixtures:install-fixture-...`	✅ Succeeded	15s	View ↗
`Additional runs (18)`	✅ Succeeded	...	View ↗

💡 Verify your cache is correct by running tasks in a sandbox. Read docs ↗

☁️ Nx Cloud last updated this comment at 2026-06-06 20:16:53 UTC

Narrator · 2026-06-06T22:39:33Z

Closing agent-generated PR during workspace cleanup.

pm-1 and others added 4 commits June 6, 2026 11:25

Narrator marked this pull request as ready for review June 6, 2026 20:10

Narrator closed this Jun 6, 2026

Narrator deleted the feat/benchmark-comparators-and-page branch June 6, 2026 22:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmark): scaffold Code→UI Benchmark comparators (chrome-devtools-mcp + WebMCP reference) and /benchmark v1 results page#48

feat(benchmark): scaffold Code→UI Benchmark comparators (chrome-devtools-mcp + WebMCP reference) and /benchmark v1 results page#48
Narrator wants to merge 4 commits into
mainfrom
feat/benchmark-comparators-and-page

Narrator commented Jun 6, 2026

Uh oh!

nx-cloud Bot commented Jun 6, 2026 •

edited

Loading

Uh oh!

Narrator commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Narrator commented Jun 6, 2026

Summary

Why a separate package boundary

Acceptance criteria mapping (PM task B)

Dependencies on Task A (Sprint 2613 batch A)

Pre-existing blocker (escalation, not a blocker for this PR)

Test plan

Uh oh!

nx-cloud Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Narrator commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nx-cloud Bot commented Jun 6, 2026 •

edited

Loading