Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,10 @@ Click any element in the browser overlay, describe the change in plain English,
- 📁 **Annotations live in your repo** — stored as JSON files in `.domscribe/annotations/`, exposed via REST APIs that MCP wraps for agent access
- 📡 **Real-time feedback** — WebSocket relay pushes agent responses to the browser overlay as they happen

### Code→UI Benchmark

We score five source-mapped scenarios (S1 source-position query, S2 style provenance with source, S3 multi-instance enumeration, S4 runtime context probe, S5 annotation→source roundtrip) across three fixtures (Vite/React, Nuxt/Vue, Next.js) and three MCP comparators (RCP v1, `chrome-devtools-mcp`, an in-repo WebMCP-conformant reference). See **[Code→UI Benchmark v1 results](./docs/code-to-ui-benchmark/v1-results.md)** for the 45-cell table, methodology, and reproduction command.

---

## Manual Setup
Expand Down
4 changes: 4 additions & 0 deletions TECHNICAL_SPEC.md
Original file line number Diff line number Diff line change
Expand Up @@ -1186,6 +1186,10 @@ Nx Task Runner (controls fixture-level parallelism)
| Vue | 3.3+ | Webpack 5 | Yes | VNode | Full |
| Vue | 3.3+ | Nuxt 3 | Yes | VNode | Full |

### Code→UI Benchmark

The Code→UI Benchmark v1 scores Domscribe's RCP-v1 surface against `chrome-devtools-mcp` and an in-repo WebMCP-conformant reference across five source-mapped scenarios (S1–S5) and three fixtures (Vite/React, Nuxt/Vue, Next.js) — 45 cells in total. See [`docs/code-to-ui-benchmark/v1-results.md`](./docs/code-to-ui-benchmark/v1-results.md) for the public results page, methodology, and reproduction command. The comparator implementations live in [`packages/domscribe-benchmark-comparators/`](./packages/domscribe-benchmark-comparators/README.md); the benchmark spec and runner live in `@domscribe/benchmark`.

---

## 10. CI/CD Pipeline
Expand Down
71 changes: 71 additions & 0 deletions docs/code-to-ui-benchmark/v1-results.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Code→UI Benchmark v1 — Results

**Spec:** [`v1.md`](./v1.md) (sprint 2613 — landed by Task A)
**Suite:** `@domscribe/benchmark@1.0.0`
**Comparators:** [`@domscribe/benchmark-comparators`](../../packages/domscribe-benchmark-comparators/README.md)

This page hosts the public 45-cell results table for the Code→UI Benchmark. The benchmark measures five source-mapped scenarios (S1–S5) across three fixtures (Vite/React, Nuxt/Vue, Next.js) and three MCP comparators (RCP v1, chrome-devtools-mcp, in-repo WebMCP-conformant reference).

> [!NOTE]
> Last-run timestamp, comparator versions, and the 45-cell results table are populated by the runner in `@domscribe/benchmark` (sprint 2613 Task A). This document is the scaffold; the results below are placeholders until the runner produces `results.json` and the page is regenerated. The placeholders are intentional and unambiguous — empty cells get a literal `—`, not a value.

## Reproduce this

```bash
git clone https://github.com/patchorbit/domscribe.git
cd domscribe
pnpm install
pnpm nx test domscribe-benchmark
# Runner writes results.json next to this page; re-render with:
pnpm nx run domscribe-benchmark:render-results
```

## Comparators

| Name | External validity | Version | Notes |
| --------------------- | ----------------- | ----------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- |
| `rcp-v1` | Subject | (RCP v1 spec, [`docs/rcp/v1.md`](../rcp/v1.md)) | The artifact under measurement; RCP v1 is documentation-versioned per RFC 0003 |
| `chrome-devtools-mcp` | External | `1.1.1` (pinned in root `package.json`) | Generic runtime browser-MCP; no build-time source correlation |
| `webmcp-reference` | In-repo fallback | `0.1.0-in-repo-reference` | Minimal WebMCP-conformant reference per RFC 0004 fallback; replace with external WebMCP server when one is installable in-sprint |

### Why one column is an in-repo reference

The benchmark would prefer three external comparators. As of sprint 2613 there is no public WebMCP-conformant server that installs cleanly on a clean clone — the in-repo reference exists so the third column is reproducible. It is, by definition, a comparison where we wrote both sides; that caveat is recorded here and on the column header in the results table. Replacing the reference with a real external server is the structural fix and is tracked as a follow-on.

## Results

> **Status:** Pending Task A's runner. The table below is the schema-true placeholder; cells are populated by `results.json` at runner-completion time.

| Scenario | Fixture | `rcp-v1` | `chrome-devtools-mcp` | `webmcp-reference` |
| ------------------------------- | ------- | -------- | --------------------- | ------------------ |
| S1 source-position query | Vite | — | — | — |
| S1 source-position query | Nuxt | — | — | — |
| S1 source-position query | Next | — | — | — |
| S2 style provenance with source | Vite | — | — | — |
| S2 style provenance with source | Nuxt | — | — | — |
| S2 style provenance with source | Next | — | — | — |
| S3 multi-instance enumeration | Vite | — | — | — |
| S3 multi-instance enumeration | Nuxt | — | — | — |
| S3 multi-instance enumeration | Next | — | — | — |
| S4 runtime context probe | Vite | — | — | — |
| S4 runtime context probe | Nuxt | — | — | — |
| S4 runtime context probe | Next | — | — | — |
| S5 annotation→source roundtrip | Vite | — | — | — |
| S5 annotation→source roundtrip | Nuxt | — | — | — |
| S5 annotation→source roundtrip | Next | — | — | — |

**Cell legend:** `pass` (answered correctly per the binary rubric in v1.md), `refused` (tool reported it does not implement the surface), `wrong` (answered incorrectly), `—` (not yet run).

## Methodology

Each comparator implements a single `Comparator` interface (see [`packages/domscribe-benchmark-comparators/src/types/`](../../packages/domscribe-benchmark-comparators/src/types/index.ts)) and is invoked by the runner with a `ScenarioPrompt`. The binary rubric for each scenario (what counts as `pass`) is defined in `v1.md`. No tool gets credit for partial answers; refusal is a recorded outcome, not a missing value.

The fixtures expose components at known source positions documented per fixture in `packages/domscribe-test-fixtures/fixtures/*/README.md`. Source-position changes to fixture components are gated by the same `protocol:freeze-check` CI gate that guards RCP v1, extended in sprint 2613 to fire on benchmark-spec edits without runner updates.

## Falsifier

Engineering portion (sprint 2613 close, 2026-07-10): all 45 cells populated by a reproducible runner; this page linked from `README.md` and `TECHNICAL_SPEC.md`. Adoption portion (sprint 2614 close, 2026-08-07): one of — a third-party public artifact names "Code→UI" or "runtime-context protocol" as Domscribe's category descriptor; an OSS reference-integration PR cites a benchmark scenario by ID; an external repo forks `@domscribe/benchmark` to run their own scoring. See [RFC 0004](../rfcs/0004-code-to-ui-benchmark-v1.md) for the full falsifier and retreat path.

## Visual editors (qualitative)

Cursor Visual Editor, Stagewise, and similar click-driven UI tools are out of the scoring table by design — they have no programmatic query surface and comparing MCP returns to human clicks is not a reproducible comparison. The category-level note: visual editors solve UI→Code (point and tell) without source-correlation guarantees; the Code→UI direction (an agent querying a running app) is the surface this benchmark measures. The [README](../../README.md#features) describes both directions.
42 changes: 42 additions & 0 deletions docs/code-to-ui-benchmark/vendor-submission-amendment-plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Code→UI Benchmark — Sprint 2612 vendor-submission amendment plan

**Owner:** staff-swe-4 (sprint 2613 Task B)
**Trigger to execute:** `docs/code-to-ui-benchmark/v1-results.md` is live and populated by Task A's runner with non-placeholder results.

## Why this exists

Sprint 2612 (PE [RFC 0003](../rfcs/0003-documentation-versioned-rcp-v1.md) / [sprint 2492 plan](../sprints/2492.md)) commits to filing four IDE-vendor-directory submissions (Cursor MCP / Cline / Continue / Codex) and one Cursor forum reply. The Code→UI Benchmark page is a category-level credibility surface that strengthens every one of those submissions — but only if it's amended in while the review windows are still open. This document is the checklist Task B's owner runs at the moment the results page goes live.

## Current state (as of sprint 2613 kickoff, 2026-06-26)

| Sprint 2612 deliverable | Expected location | State on `main` | Amendable? |
| ------------------------------------------- | --------------------- | --------------- | ---------------------------- |
| `docs/submissions.md` (submission tracker) | `docs/submissions.md` | **Missing** | Depends on sprint 2612 close |
| `docs/install/<channel>.json` × 8 | `docs/install/` | **Missing** | Depends on sprint 2612 close |
| `docs/posts/2026-06-cursor-forum-146166.md` | `docs/posts/` | **Missing** | Depends on sprint 2612 close |
| Cursor MCP directory PR | external repo | Not filed | Once filed |
| Cline directory PR | external repo | Not filed | Once filed |
| Continue directory PR | external repo | Not filed | Once filed |
| Codex directory PR | external repo | Not filed | Once filed |

**Implication:** Sprint 2612's vendor-submission deliverables have not landed on `main` as of sprint 2613 kickoff. The amendment plan below is the procedure to execute the moment they do. If sprint 2612 closes without those deliverables landing, the amendment scope changes from "amend existing submissions" to "include the benchmark link from the first submission" and the surface is even simpler.

## Amendment procedure

For each Sprint 2612 submission that is still in review (per `docs/submissions.md` review-status column) when this PR's [`v1-results.md`](./v1-results.md) is populated:

1. **Read the submission's review status** in `docs/submissions.md`. If it is `merged` or `closed`, skip to step 4.
2. **Push an addendum commit** to the submission PR adding the `/benchmark` link to the README/description section of the directory entry. Use this language verbatim so the four submissions converge on the same category descriptor:

> Domscribe is the Code→UI category — runtime-context queries from a coding agent to a running browser, source-correlated by the build-time RCP v1 protocol. See [Code→UI Benchmark v1 results](https://github.com/patchorbit/domscribe/blob/main/docs/code-to-ui-benchmark/v1-results.md) for a 45-cell scoring against `chrome-devtools-mcp` and a WebMCP-conformant reference across React, Vue, and Next.js fixtures.

3. **Update `docs/submissions.md`** with a row noting the amendment date and the commit SHA of the addendum.
4. **For submissions whose review window is closed** (no longer accepting commits), add a one-line entry to [`docs/sprints/2613.md`](../sprints/2613.md) under the "Sprint 2612 submission amendments — could not amend" section, naming the directory and the close reason.

## Cursor forum reply (sprint 2612 T9)

If the forum reply at [`forum.cursor.com/t/146166`](https://forum.cursor.com/t/click-to-source-from-browser-visual-editor-inspect-open-file-line/146166) is posted but the thread is still active, post a one-paragraph follow-up linking to `/benchmark` (do not edit the original — Cursor's forum etiquette favors transparent threading over silent edits). If the thread has gone cold (no replies in 7+ days at the moment this amendment is executed), do not bump it — surface to Kaushik instead.

## Cutoff for executing this plan

Per [RFC 0004 falsifier](../rfcs/0004-code-to-ui-benchmark-v1.md): by **2026-07-10 sprint 2613 close**. After that date the engineering window is closed and any unexecuted amendments are deferred to the sprint 2614 retro decision.
Loading
Loading