Skip to content

feat(benchmark): scaffold Code→UI Benchmark comparators (chrome-devtools-mcp + WebMCP reference) and /benchmark v1 results page#48

Closed
Narrator wants to merge 4 commits into
mainfrom
feat/benchmark-comparators-and-page
Closed

feat(benchmark): scaffold Code→UI Benchmark comparators (chrome-devtools-mcp + WebMCP reference) and /benchmark v1 results page#48
Narrator wants to merge 4 commits into
mainfrom
feat/benchmark-comparators-and-page

Conversation

@Narrator
Copy link
Copy Markdown
Member

@Narrator Narrator commented Jun 6, 2026

Summary

This PR delivers sprint 2613 Task B (benchmark-comparators-and-page) — the comparator implementations and the public results-page scaffold for the Code→UI Benchmark v1.

Three things land:

  1. packages/domscribe-benchmark-comparators/ — new sibling package containing the Comparator interface, the WebMCP-conformant in-repo reference server (per RFC 0004's fallback — no public WebMCP-conformant server installs cleanly in-sprint), and the chrome-devtools-mcp comparator shim. 17 unit tests, 84% line coverage. The runner in @domscribe/benchmark (Task A) imports Comparator implementations from here.

  2. chrome-devtools-mcp@1.1.1 pinned as root dev-dep — exact-version pin (no caret) so the published 45-cell results table is reproducible byte-for-byte. Lockfile diff includes pnpm's routine forward-resolution of cookie / cors / qs / http-errors transitive patches.

  3. /benchmark static results page (docs/code-to-ui-benchmark/v1-results.md) — public artifact RFC 0004's category-naming falsifier reads. Schema-true placeholder cells; Task A's runner populates them from results.json. Linked from README.md (Features section) and TECHNICAL_SPEC.md (§9 Framework Support Matrix). Methodology, comparator table with external-validity flag, reproduction command, and falsifier text all included.

Plus: docs/code-to-ui-benchmark/vendor-submission-amendment-plan.md — the checklist for amending Sprint 2612 IDE-vendor-directory submissions with the /benchmark link while review windows are still open.

Why a separate package boundary

The comparators have a different stability surface than the benchmark spec: the WebMCP reference is intentionally drop-in replaceable once a real external WebMCP-conformant server is installable, and the chrome-devtools-mcp shim is a thin wrapper over a pinned external dev-dep. Co-locating them with the spec/runner in @domscribe/benchmark would couple two distinct evolution rhythms. Each package can publish independently when its surface changes.

Acceptance criteria mapping (PM task B)

Criterion This PR Notes
chrome-devtools-mcp installed as a pinned dev-dep package.json line; lockfile updated
Comparator wired to run against 5 scenarios × 3 fixtures = 15 cells ⚠️ Partial Comparator class is wired; fixture instrumentation and the runner invocation live in Task A's @domscribe/benchmark — the 15 cells materialise once Task A's runner imports createChromeDevtoolsMcpComparator and calls run(prompt) against the fixtures
WebMCP-conformant reference wired In-repo reference per RFC 0004 fallback; /benchmark page surfaces the external-validity: in-repo-fallback caveat verbatim
/benchmark page live, linked from README.md and TECHNICAL_SPEC.md docs/code-to-ui-benchmark/v1-results.md with last-run timestamp slot, comparator versions, methodology link to the spec, and reproduction command snippet
Sprint 2612 vendor-directory submissions amended ⚠️ Conditional docs/submissions.md and the four submission PRs are not yet on main as of sprint 2613 kickoff — see vendor-submission-amendment-plan.md for the trigger-conditional procedure
Engineer logs ≥8h of OSS reference-integration pairing time ⚠️ Out of scope for this PR Pairing time will appear in sprint retro notes per PM acceptance

Dependencies on Task A (Sprint 2613 batch A)

This PR is structurally a Task B deliverable; Task A (benchmark-rcp-foundation@domscribe/benchmark package, docs/code-to-ui-benchmark/v1.md spec, RCP-v1 column with 15 cells) has not yet landed on main as of this PR. The Comparator interface here was designed to be the contract Task A's runner consumes, but:

  • The benchmark spec doc (docs/code-to-ui-benchmark/v1.md) — referenced from this PR's v1-results.md and from the comparators README — does not yet exist. Links resolve to a 404 until Task A merges.
  • The fixture instrumentation for S1–S5 lives with Task A. Until that lands, the 15 cells per comparator column remain as placeholders on /benchmark.

When Task A merges, the runner's import statement should look approximately like:

import {
  createChromeDevtoolsMcpComparator,
  createWebMcpReferenceComparator,
} from '@domscribe/benchmark-comparators';

No rebase conflict expected — Task A and this PR touch disjoint paths (packages/domscribe-benchmark/* vs. packages/domscribe-benchmark-comparators/*).

Pre-existing blocker (escalation, not a blocker for this PR)

Sprint 2492's RCP-v1 stamp (PE RFC 0003 falsifier) has not landed on main as of sprint 2613 kickoff — no docs/rcp/v1.md, no docs/submissions.md. The PM plan's replanning trigger #1 names this as a "full Sprint 2613 replan" trigger because the benchmark has no v1 subject without RCP v1. This PR does not depend on RCP v1 landing because the comparators are framework-agnostic, but Task A and the /benchmark page's RCP-v1 column do depend on it. Surfacing here for visibility; a separate Slack escalation to Kaushik covers the broader sprint-thesis risk.

Test plan

  • pnpm nx run domscribe-benchmark-comparators:lint — passes
  • pnpm nx run domscribe-benchmark-comparators:test — 17/17 tests pass; 84% statements / 71% branches / 92% functions / 83% lines coverage (all above configured thresholds)
  • pnpm nx run domscribe-benchmark-comparators:build — compiles cleanly
  • pnpm nx affected -t lint --base=main~1 — comparators package + all other affected packages pass; pre-existing domscribe-test-fixtures lint errors are unrelated to this PR (verified by stashing changes and re-running lint on base)
  • Manual verification that /benchmark v1-results page renders correctly on GitHub — to be confirmed once Task A populates results.json
  • Comparator integration test against fixtures — owned by Task A's runner PR

pm-1 and others added 4 commits June 6, 2026 11:25
Theory-of-Constraints sequencing: install-path readiness (publishConfig
flip + protocol-freeze CI gate) is the new trunk; vendor-directory
submissions and the Cursor forum reply land in days 9-14 once
@domscribe/mcp@1.0.0 is publicly installable.

Adopts PE RFC 0003's falsifier (any of three branches missing fails
the sprint) and inherits DOP's 2026-07-10 falsifier.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…s with WebMCP-conformant in-repo reference + chrome-devtools-mcp shim

Stands up the comparators sibling package consumed by @domscribe/benchmark's runner
(sprint 2613 Task A) to score the third and fourth columns of the Code→UI Benchmark
v1 results table. Implements the WebMCP-conformant in-repo reference per RFC 0004's
fallback (no public WebMCP-conformant server installs cleanly in-sprint) and the
chrome-devtools-mcp comparator as a thin client shim. Comparator interface, types,
and outcomes are defined here so the runner can call any future external WebMCP
server as a drop-in replacement. 17 unit tests, 84% line coverage.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…chmark comparator

The chrome-devtools-mcp@1.1.1 dev-dep backs the second comparator column on
/benchmark. Version is pinned (no caret) so the published results table is
reproducible byte-for-byte against the exact transport release we tested.
Lockfile diff includes pnpm's routine forward-resolution of cookie / cors / qs /
http-errors transitive patches; these are minor patches with no API impact.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ission amendment plan; link from README and TECHNICAL_SPEC

Publishes the static /benchmark v1 results page (docs/code-to-ui-benchmark/v1-results.md)
as the public artifact RFC 0004's category-naming falsifier reads. Results table cells
are schema-true placeholders; the runner in @domscribe/benchmark (sprint 2613 Task A)
populates them from results.json. README "More" section and TECHNICAL_SPEC §9 link to
the page, satisfying the "linked from README and TECHNICAL_SPEC" arm of Task B's
acceptance criteria.

Also publishes the Sprint 2612 vendor-submission amendment plan — the procedure to
append the /benchmark link to the four IDE-vendor-directory submissions while their
review windows are still open. Plan is explicit about the current sprint-2612 state
(submissions tracker not yet on main as of sprint 2613 kickoff) so the trigger to
execute is conditional on those landing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Narrator Narrator marked this pull request as ready for review June 6, 2026 20:10
@nx-cloud
Copy link
Copy Markdown

nx-cloud Bot commented Jun 6, 2026

View your CI Pipeline Execution ↗ for commit 9250faf

Command Status Duration Result
nx run domscribe-test-fixtures:integration--web... ✅ Succeeded 1m 37s View ↗
nx run domscribe-test-fixtures:integration--web... ✅ Succeeded 1m 39s View ↗
nx run domscribe-test-fixtures:integration--web... ✅ Succeeded 1m 38s View ↗
nx run domscribe-test-fixtures:integration--web... ✅ Succeeded 1m 20s View ↗
nx run domscribe-test-fixtures:install-fixture-... ✅ Succeeded 51s View ↗
nx run domscribe-test-fixtures:install-fixture-... ✅ Succeeded 43s View ↗
nx run domscribe-test-fixtures:install-fixture-... ✅ Succeeded 22s View ↗
nx run domscribe-test-fixtures:install-fixture-... ✅ Succeeded 15s View ↗
Additional runs (18) ✅ Succeeded ... View ↗

💡 Verify your cache is correct by running tasks in a sandbox. Read docs ↗


☁️ Nx Cloud last updated this comment at 2026-06-06 20:16:53 UTC

@Narrator
Copy link
Copy Markdown
Member Author

Narrator commented Jun 6, 2026

Closing agent-generated PR during workspace cleanup.

@Narrator Narrator closed this Jun 6, 2026
@Narrator Narrator deleted the feat/benchmark-comparators-and-page branch June 6, 2026 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant