Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 83 additions & 0 deletions docs/sprints/3071-rfc-0001-baseline.harness-self-test.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@

> @domscribe/test-fixtures@0.0.1 test:falsifier /tmp/wt-sprint-3071-task-a/packages/domscribe-test-fixtures
> tsx styling/scripts/falsifier.ts

{
"mode": "self-test",
"total": 10,
"passes": 10,
"fails": 0,
"oneShotRate": 1,
"annotations": [
{
"id": "A001",
"fixture": "tailwind",
"passed": true,
"pixelDiffRatio": 0,
"diffPixels": 0
},
{
"id": "A002",
"fixture": "tailwind",
"passed": true,
"pixelDiffRatio": 0,
"diffPixels": 0
},
{
"id": "A003",
"fixture": "tailwind",
"passed": true,
"pixelDiffRatio": 0,
"diffPixels": 0
},
{
"id": "A004",
"fixture": "tailwind",
"passed": true,
"pixelDiffRatio": 0,
"diffPixels": 0
},
{
"id": "A005",
"fixture": "tailwind",
"passed": true,
"pixelDiffRatio": 0,
"diffPixels": 0
},
{
"id": "A101",
"fixture": "styled",
"passed": true,
"pixelDiffRatio": 0,
"diffPixels": 0
},
{
"id": "A102",
"fixture": "styled",
"passed": true,
"pixelDiffRatio": 0,
"diffPixels": 0
},
{
"id": "A103",
"fixture": "styled",
"passed": true,
"pixelDiffRatio": 0,
"diffPixels": 0
},
{
"id": "A104",
"fixture": "styled",
"passed": true,
"pixelDiffRatio": 0,
"diffPixels": 0
},
{
"id": "A105",
"fixture": "styled",
"passed": true,
"pixelDiffRatio": 0,
"diffPixels": 0
}
]
}
100 changes: 100 additions & 0 deletions docs/sprints/3071-rfc-0001-baseline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# Sprint 3071 — RFC 0001 baseline + positioning verdict

**Author:** Staff SWE (sprint run 3071, issue #51)
**Date:** 2026-06-08
**Decides:** the positioning language for `verify_after_edit` (RFC 0002).
**Does not decide:** whether to ship `verify_after_edit` — that bet is made by the DOP memo and RFC 0002; this doc only sequences how it is framed.

---

## TL;DR

| Quantity | Value | Source |
| ----------------------------------------------------------- | --------------------------------------- | ----------------------------------------------- |
| RFC 0001 falsifier (≥70% agent one-shot styling completion) | **unmeasured** | no agent-integration harness exists on `main` |
| RFC 0001 mechanism self-test | **10/10 (100%), 0 pixel diff** | `styling/scripts/falsifier.ts --mode=self-test` |
| Positioning verdict | **self-correction layer (<85% branch)** | conservative default in absence of measurement |
| Slack alert (≥85% trigger) | **not posted** | threshold neither met nor measurable |

The lift of the comparator into `@domscribe/verify` (this PR, Task A3) is independently validated by the self-test: the harness re-imports the comparator and continues to grade all 10 baselines at 0 pixel diff.

## What the harness can measure today

The RFC 0001 falsifier harness (`packages/domscribe-test-fixtures/styling/scripts/falsifier.ts`) supports three modes:

1. **`self-test`** — builds the Tailwind and styled-components fixture apps, screenshots each annotation's `afterRoute`, and diffs against the committed baseline. Expected pass rate is **100% by design** — this is the harness's own correctness check, not a measurement of agent capability. The README is explicit:

> It does not invoke an agent. The agent-integration loop is built on top of this — see `--mode=measure`.

2. **`record`** — re-captures the baseline PNGs from the canonical `/after` routes.

3. **`measure --agent-output=<dir>`** — production grading: reads one screenshot per annotation from an external directory (produced by an agent-integration harness) and diffs against the baseline. **This is the mode that would actually answer "what is the agent's one-shot styling completion rate?"**

## What is missing

The agent-integration loop required to run `--mode=measure` does **not** exist on `main`. Specifically, there is no harness that:

- Reads each annotation from `styling/annotations.json`,
- Drives an agent (Claude / Codex / similar) through the edit using the intent + source-file context,
- Boots the fixture from the post-edit source,
- Screenshots the rendered element into a per-annotation PNG,
- Hands the directory to `falsifier.ts --mode=measure`.

Until that loop exists, the inherited RFC 0001 falsifier (≥70% one-shot agent styling completion by sprint 2734+6) is **unmeasured**. The self-test pass rate is structurally **not** a substitute — the self-test screenshots the canonical-after route, not an agent's edit, so it cannot fall below 100% no matter how poorly an agent would perform.

## Self-test result (mechanism-only)

```
mode=self-test, total=10, passes=10, fails=0, oneShotRate=1.0
all annotations: pixelDiffRatio=0, diffPixels=0
```

Raw JSON: [`3071-rfc-0001-baseline.harness-self-test.json`](./3071-rfc-0001-baseline.harness-self-test.json).

The 100% pass rate means:

- The Vite build for both fixtures is reproducible.
- Chromium + screenshot capture is locale/font/viewport-deterministic in this CI environment.
- The lifted comparator in `@domscribe/verify` (this PR) diffs identically to the inline version it replaces — none of the 10 baseline diffs shifted off zero.

The 100% pass rate **does not mean** the agent's one-shot styling completion rate is 100%. That number is unknown.

## Methodology

- **Where:** ephemeral dev sandbox; node v20.19.4; pnpm 9.12.0; playwright 1.58.2 (chrome-headless-shell 1208); locale `en-US`, timezone `UTC`, viewport `800×600`, scale 1, animations disabled (matches the harness defaults).
- **Source:** worktree at `origin/main@a171724` (RFC 0001 Task B merge), plus the `@domscribe/verify` lift introduced by this PR.
- **Command:** `pnpm --filter @domscribe/test-fixtures test:falsifier`.
- **Reproducibility:** the same command on the same commit on a CI runner with the documented Playwright cache returns the same JSON. Re-recording baselines (`--mode=record`) would only be needed if the canonical-after routes or the Chromium build changes.

## Positioning verdict

Per RFC 0002 §Implications-for-PM and issue #51, the baseline gates how `verify_after_edit` is framed:

- **≥85% → trust layer.** Verify catches the long tail; the build is conservative; PM may consider deferring relay registration (Task B) if capacity is tight.
- **<85% → self-correction layer.** Verify is load-bearing for the value loop; the full build proceeds.

The baseline is unmeasured. The conservative default in the absence of measurement is the **self-correction layer** branch — we cannot justify treating verify as a long-tail polish layer when we have no evidence the short tail is solved. The full build proceeds; the Slack alert (which fires only on ≥85%) is **not** posted.

## What this means for Task B

No change. Task B (runtime `ScreenshotCapturer` + relay `verify_after_edit` MCP tool) ships as planned, soft-recommended in MCP prompts, no lifecycle gate. The package-level value of `@domscribe/verify` is independent of the agent one-shot rate — the harness already consumes it, and the relay tool will consume it on the same contract.

## Follow-up — agent-integration harness (next sprint)

The cleanest way to retire this measurement gap is to add `--mode=agent` (or a separate driver script under `styling/scripts/`) that:

1. For each annotation in `annotations.json`, spawns the agent under test with a fixed prompt (intent + sourceFile + sourceLine + the merged RFC 0001 `styleSource` + `componentStyles`).
2. Applies the agent's edit to a scratch copy of the fixture, builds it, screenshots `afterRoute`.
3. Writes `<id>.png` into a deterministic agent-output directory.
4. Invokes the existing `--mode=measure` with that directory.

This is the prerequisite for measuring both the inherited RFC 0001 falsifier (≥70% one-shot) **and** the RFC 0002 falsifier (≥60% retry-resolution rate). Sized as a separate sprint task; out of scope for issue #51 (per the issue's "Out of scope" enumeration, which lists agent-side work as a P1 follow-up rather than in-scope).

## References

- [RFC 0001 — Two-tier component-style attribution](../rfcs/0001-component-styles-capture.md)
- [RFC 0002 — Post-edit verification as an MCP diagnostic tool](../rfcs/0002-post-edit-verify-mcp-tool.md)
- Issue [#51](https://github.com/patchorbit/domscribe/issues/51), Issue [#52](https://github.com/patchorbit/domscribe/issues/52)
- PRs [#49](https://github.com/patchorbit/domscribe/pull/49), [#50](https://github.com/patchorbit/domscribe/pull/50) (RFC 0001 Tasks A and B)
- Harness source: [`packages/domscribe-test-fixtures/styling/scripts/falsifier.ts`](../../packages/domscribe-test-fixtures/styling/scripts/falsifier.ts)
- Harness README: [`packages/domscribe-test-fixtures/styling/README.md`](../../packages/domscribe-test-fixtures/styling/README.md)
13 changes: 13 additions & 0 deletions eslint.config.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,19 @@ export default [
'scope:adapter',
],
},
{
// scope:test consumes the same packages adapters do — it
// grades them. Notably, `@domscribe/test-fixtures` now imports
// `@domscribe/verify` (scope:infra) so the harness and the
// relay verify_after_edit tool share one comparator.
sourceTag: 'scope:test',
onlyDependOnLibsWithTags: [
'scope:core',
'scope:infra',
'scope:build',
'scope:adapter',
],
},
],
},
],
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -141,4 +141,31 @@ describe('migrateAnnotation', () => {

expect(result.context.runtimeContext).toBeUndefined();
});

it('should migrate a v2 annotation up to v3 (additive verifyHistory, no field rewrite)', () => {
// Simulates a v2 annotation persisted between RFC 0001 (v1→v2) and
// RFC 0002 (v2→v3). The v2 → v3 step is purely additive (verifyHistory
// is a new optional field) — pre-existing runtimeContext data must
// survive untouched.
const raw = buildRawAnnotation({
metadata: { schemaVersion: 2 },
context: {
pageUrl: 'http://localhost:3000',
pageTitle: 'Test',
viewport: { width: 1920, height: 1080 },
userAgent: 'test-agent',
runtimeContext: {
componentStyles: { computed: { padding: '16px' } },
},
},
});

const result = migrateAnnotation(raw);

expect(result.metadata.schemaVersion).toBe(ANNOTATION_SCHEMA_VERSION);
expect(result.context.runtimeContext).toEqual({
componentStyles: { computed: { padding: '16px' } },
});
expect(result.context.verifyHistory).toBeUndefined();
});
});
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,16 @@ const migrationSteps: Record<number, (data: Record<string, unknown>) => void> =
1: () => {
// No-op: v1 → v2 is purely additive.
},
/**
* v2 → v3: additive only (per RFC 0002).
*
* v3 adds optional `context.verifyHistory` (an array of `VerifyResult`
* records emitted by the `verify_after_edit` MCP tool). The field is
* absent on v2 payloads, so no field rewriting is required.
*/
2: () => {
// No-op: v2 → v3 is purely additive.
},
};

/**
Expand Down
117 changes: 117 additions & 0 deletions packages/domscribe-core/src/lib/types/annotation.spec.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
/**
* Schema tests for RFC 0002 additions to @domscribe/core.
*
* Covers the additive surface: VerifyResultSchema, AnnotationContext.verifyHistory,
* and the v3 schema-version bump. The pre-RFC 0002 annotation shape is exercised
* exhaustively in `annotation-migrations.spec.ts` and the wider integration
* suites; this spec is scoped to the new fields.
*/

import { describe, it, expect } from 'vitest';
import {
ANNOTATION_SCHEMA_VERSION,
AnnotationContextSchema,
VerifyResultSchema,
VerifyVerdictSchema,
} from './annotation.js';

describe('ANNOTATION_SCHEMA_VERSION', () => {
it('is at v3 (RFC 0002 — verifyHistory)', () => {
expect(ANNOTATION_SCHEMA_VERSION).toBe(3);
});
});

describe('VerifyVerdictSchema', () => {
it.each(['match', 'partial', 'no_change', 'regression'] as const)(
'accepts %s',
(verdict) => {
expect(VerifyVerdictSchema.parse(verdict)).toBe(verdict);
},
);

it('rejects unknown verdicts', () => {
expect(() => VerifyVerdictSchema.parse('ok')).toThrow();
});
});

describe('VerifyResultSchema', () => {
it('parses a minimal match result (verdict + timestamp only)', () => {
const parsed = VerifyResultSchema.parse({
verdict: 'match',
timestamp: '2026-06-08T12:00:00.000Z',
});
expect(parsed.verdict).toBe('match');
expect(parsed.componentStylesDelta).toBeUndefined();
expect(parsed.screenshotRef).toBeUndefined();
});

it('parses a fully-populated partial result with all delta arrays', () => {
const parsed = VerifyResultSchema.parse({
verdict: 'partial',
timestamp: '2026-06-08T12:00:00.000Z',
pixelDiffRatio: 0.012,
componentStylesDelta: [
{ property: 'padding', before: '16px', after: '24px' },
],
computedStyleDelta: [
{ property: 'background-color', before: null, after: 'rgb(0, 0, 0)' },
],
boundingRectDelta: [{ field: 'height', before: 32, after: 40 }],
screenshotRef: 'blob://relay/ann_x/post-edit-1.png',
notes: 'padding matched intent; background-color regressed',
});
expect(parsed.componentStylesDelta).toHaveLength(1);
expect(parsed.boundingRectDelta?.[0]?.field).toBe('height');
expect(parsed.screenshotRef).toMatch(/^blob:\/\//);
});

it('rejects out-of-range pixelDiffRatio', () => {
expect(() =>
VerifyResultSchema.parse({
verdict: 'match',
timestamp: '2026-06-08T12:00:00.000Z',
pixelDiffRatio: 1.5,
}),
).toThrow();
});

it('rejects unknown BoundingRectDelta fields', () => {
expect(() =>
VerifyResultSchema.parse({
verdict: 'partial',
timestamp: '2026-06-08T12:00:00.000Z',
boundingRectDelta: [
// @ts-expect-error — runtime rejection is the point
{ field: 'depth', before: 0, after: 10 },
],
}),
).toThrow();
});
});

describe('AnnotationContextSchema.verifyHistory', () => {
it('accepts a context without verifyHistory (older clients silently ignore)', () => {
const parsed = AnnotationContextSchema.parse({
pageUrl: 'http://localhost:3000',
pageTitle: 'Test',
viewport: { width: 1920, height: 1080 },
userAgent: 'test-agent',
});
expect(parsed.verifyHistory).toBeUndefined();
});

it('accepts an append-only history of VerifyResults', () => {
const parsed = AnnotationContextSchema.parse({
pageUrl: 'http://localhost:3000',
pageTitle: 'Test',
viewport: { width: 1920, height: 1080 },
userAgent: 'test-agent',
verifyHistory: [
{ verdict: 'partial', timestamp: '2026-06-08T12:00:00.000Z' },
{ verdict: 'match', timestamp: '2026-06-08T12:00:05.000Z' },
],
});
expect(parsed.verifyHistory).toHaveLength(2);
expect(parsed.verifyHistory?.[1]?.verdict).toBe('match');
});
});
Loading
Loading