feat(benchmark): scaffold Code→UI Benchmark comparators (chrome-devtools-mcp + WebMCP reference) and /benchmark v1 results page#48
Closed
Narrator wants to merge 4 commits into
Closed
Conversation
Theory-of-Constraints sequencing: install-path readiness (publishConfig flip + protocol-freeze CI gate) is the new trunk; vendor-directory submissions and the Cursor forum reply land in days 9-14 once @domscribe/mcp@1.0.0 is publicly installable. Adopts PE RFC 0003's falsifier (any of three branches missing fails the sprint) and inherits DOP's 2026-07-10 falsifier. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…s with WebMCP-conformant in-repo reference + chrome-devtools-mcp shim Stands up the comparators sibling package consumed by @domscribe/benchmark's runner (sprint 2613 Task A) to score the third and fourth columns of the Code→UI Benchmark v1 results table. Implements the WebMCP-conformant in-repo reference per RFC 0004's fallback (no public WebMCP-conformant server installs cleanly in-sprint) and the chrome-devtools-mcp comparator as a thin client shim. Comparator interface, types, and outcomes are defined here so the runner can call any future external WebMCP server as a drop-in replacement. 17 unit tests, 84% line coverage. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…chmark comparator The chrome-devtools-mcp@1.1.1 dev-dep backs the second comparator column on /benchmark. Version is pinned (no caret) so the published results table is reproducible byte-for-byte against the exact transport release we tested. Lockfile diff includes pnpm's routine forward-resolution of cookie / cors / qs / http-errors transitive patches; these are minor patches with no API impact. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ission amendment plan; link from README and TECHNICAL_SPEC Publishes the static /benchmark v1 results page (docs/code-to-ui-benchmark/v1-results.md) as the public artifact RFC 0004's category-naming falsifier reads. Results table cells are schema-true placeholders; the runner in @domscribe/benchmark (sprint 2613 Task A) populates them from results.json. README "More" section and TECHNICAL_SPEC §9 link to the page, satisfying the "linked from README and TECHNICAL_SPEC" arm of Task B's acceptance criteria. Also publishes the Sprint 2612 vendor-submission amendment plan — the procedure to append the /benchmark link to the four IDE-vendor-directory submissions while their review windows are still open. Plan is explicit about the current sprint-2612 state (submissions tracker not yet on main as of sprint 2613 kickoff) so the trigger to execute is conditional on those landing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
View your CI Pipeline Execution ↗ for commit 9250faf
💡 Verify your cache is correct by running tasks in a sandbox. Read docs ↗ ☁️ Nx Cloud last updated this comment at |
Member
Author
|
Closing agent-generated PR during workspace cleanup. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR delivers sprint 2613 Task B (
benchmark-comparators-and-page) — the comparator implementations and the public results-page scaffold for the Code→UI Benchmark v1.Three things land:
packages/domscribe-benchmark-comparators/— new sibling package containing theComparatorinterface, the WebMCP-conformant in-repo reference server (per RFC 0004's fallback — no public WebMCP-conformant server installs cleanly in-sprint), and thechrome-devtools-mcpcomparator shim. 17 unit tests, 84% line coverage. The runner in@domscribe/benchmark(Task A) importsComparatorimplementations from here.chrome-devtools-mcp@1.1.1pinned as root dev-dep — exact-version pin (no caret) so the published 45-cell results table is reproducible byte-for-byte. Lockfile diff includes pnpm's routine forward-resolution ofcookie/cors/qs/http-errorstransitive patches./benchmarkstatic results page (docs/code-to-ui-benchmark/v1-results.md) — public artifact RFC 0004's category-naming falsifier reads. Schema-true placeholder cells; Task A's runner populates them fromresults.json. Linked fromREADME.md(Features section) andTECHNICAL_SPEC.md(§9 Framework Support Matrix). Methodology, comparator table withexternal-validityflag, reproduction command, and falsifier text all included.Plus:
docs/code-to-ui-benchmark/vendor-submission-amendment-plan.md— the checklist for amending Sprint 2612 IDE-vendor-directory submissions with the/benchmarklink while review windows are still open.Why a separate package boundary
The comparators have a different stability surface than the benchmark spec: the WebMCP reference is intentionally drop-in replaceable once a real external WebMCP-conformant server is installable, and the chrome-devtools-mcp shim is a thin wrapper over a pinned external dev-dep. Co-locating them with the spec/runner in
@domscribe/benchmarkwould couple two distinct evolution rhythms. Each package can publish independently when its surface changes.Acceptance criteria mapping (PM task B)
chrome-devtools-mcpinstalled as a pinned dev-deppackage.jsonline; lockfile updated@domscribe/benchmark— the 15 cells materialise once Task A's runner importscreateChromeDevtoolsMcpComparatorand callsrun(prompt)against the fixtures/benchmarkpage surfaces theexternal-validity: in-repo-fallbackcaveat verbatim/benchmarkpage live, linked from README.md and TECHNICAL_SPEC.mddocs/code-to-ui-benchmark/v1-results.mdwith last-run timestamp slot, comparator versions, methodology link to the spec, and reproduction command snippetdocs/submissions.mdand the four submission PRs are not yet onmainas of sprint 2613 kickoff — seevendor-submission-amendment-plan.mdfor the trigger-conditional procedureDependencies on Task A (Sprint 2613 batch A)
This PR is structurally a Task B deliverable; Task A (
benchmark-rcp-foundation—@domscribe/benchmarkpackage,docs/code-to-ui-benchmark/v1.mdspec, RCP-v1 column with 15 cells) has not yet landed onmainas of this PR. The Comparator interface here was designed to be the contract Task A's runner consumes, but:docs/code-to-ui-benchmark/v1.md) — referenced from this PR'sv1-results.mdand from the comparators README — does not yet exist. Links resolve to a404until Task A merges.—placeholders on/benchmark.When Task A merges, the runner's import statement should look approximately like:
No rebase conflict expected — Task A and this PR touch disjoint paths (
packages/domscribe-benchmark/*vs.packages/domscribe-benchmark-comparators/*).Pre-existing blocker (escalation, not a blocker for this PR)
Sprint 2492's RCP-v1 stamp (PE RFC 0003 falsifier) has not landed on
mainas of sprint 2613 kickoff — nodocs/rcp/v1.md, nodocs/submissions.md. The PM plan's replanning trigger #1 names this as a "full Sprint 2613 replan" trigger because the benchmark has no v1 subject without RCP v1. This PR does not depend on RCP v1 landing because the comparators are framework-agnostic, but Task A and the/benchmarkpage's RCP-v1 column do depend on it. Surfacing here for visibility; a separate Slack escalation to Kaushik covers the broader sprint-thesis risk.Test plan
pnpm nx run domscribe-benchmark-comparators:lint— passespnpm nx run domscribe-benchmark-comparators:test— 17/17 tests pass; 84% statements / 71% branches / 92% functions / 83% lines coverage (all above configured thresholds)pnpm nx run domscribe-benchmark-comparators:build— compiles cleanlypnpm nx affected -t lint --base=main~1— comparators package + all other affected packages pass; pre-existingdomscribe-test-fixtureslint errors are unrelated to this PR (verified by stashing changes and re-running lint on base)/benchmarkv1-results page renders correctly on GitHub — to be confirmed once Task A populates results.json