DiffSpot is a benchmark for fine-grained visual change detection in real-world web interfaces. Each example is a pair of near-identical screenshots that differ by a single programmatic CSS-level mutation; a VLM must describe what changed. Ground truth is recorded directly from the mutation that produced the pair.
- A clean probe for fine-grained perception. VLMs ace high-level imageβtext alignment but stumble on localized UI changes β DiffSpot isolates exactly that ability on real web interfaces.
- Difficulty is property-dependent. Neither pixel magnitude nor CLIP distance predicts recall β the bottleneck is nameability, not visual salience.
- Controllable by construction. A fully code-driven pipeline (mutate one CSS property β re-render β record) yields exact ground truth on a tunable difficulty gradient, with a grounding gate that discards no-effect and reflow-contaminated pairs.
Each pair differs by a single CSS-level mutation. The model is given both screenshots and must name the change.
Easy / Med / Hard / Diff are Recall on the has-diff pairs (1,300 per tier; 3,900 total); No-Diff is specificity on the 500 control pairs; Overall is per-case accuracy (TP + TN) / 4,400 β the official score. Judge: gpt-oss-120b, reasoning_effort=high (cross-judge Kendall Ο = 1.000 across gpt-oss-120b / Kimi-K2.5 / Qwen3.5-VL-397B). Bold = column max; underline = best open-weight. Trivial always-no-diff baseline: 11.4% Overall.
Tip
Gemini 3.1 Pro leads at 47.2%, 5.0 pp ahead of the best open-weight model (Kimi K2.5, 42.2%). Seven of thirteen models fall below 30%. Difficulty is strongly property-dependent β across CSS operators, neither pixel magnitude nor CLIP distance reliably predicts recall.
Submit your model: see docs/submission.md.
git clone https://github.com/Tencent/DiffSpot.git
cd DiffSpot
pip install -e ".[api]"
export OPENAI_API_KEY=... # for both the baseline run and the judge
# 1. Run a baseline VLM (loads the HF dataset, writes predictions JSONL)
# Replace gpt-5.4 with your own model β see "Evaluate your own model" below.
python baselines/api/run_openai.py \
--model gpt-5.4 \
--output results/gpt-5.4/predictions.jsonl
# 2. Score predictions with the official LLM judge (gpt-oss-120b)
python scripts/evaluate.py \
--predictions results/gpt-5.4/predictions.jsonl \
--judge-model gpt-oss-120b \
--output results/gpt-5.4/scores.json
# 3. Print the official metrics table
python scripts/show_results.py results/gpt-5.4/scores.jsonTip
To target an OpenAI-compatible endpoint (vLLM, sglang, an internal gateway) for the judge, set OPENAI_BASE_URL. Anthropic / Google baselines use ANTHROPIC_API_KEY / GOOGLE_API_KEY.
from datasets import load_dataset
ds = load_dataset("tencent/DiffSpot", split="test") # 4,400 pairs
ex = ds[0]
ex["image_before"], ex["image_after"] # PIL.Image β before / after, in this order
ex["task_type"] # "visual_diff" | "no_diff"
ex["difficulty"] # easy | medium | hard | no_diff
ex["ground_truth_diff"] # natural-language GT (display only)
ex["mutation_dicts_json"] # JSON-encoded structured GT (scoring source)Other per-pair fields: mutation_types, domain, pixel_diff / target_diff / outside_diff, target_bbox_{x,y,w,h}. Offline / local parquet: diffspot.data.load(dataset_path="path/to/test-*.parquet").
The contract is simple: produce a predictions JSONL using the prompts in
diffspot/prompts/ verbatim, then score it. Copy a runner in
baselines/ (api/ or local/) as a template and swap in your model client.
Important
Two things silently wreck scores if you roll your own runner:
- Image order β feed
image_beforefirst,image_aftersecond (the prompt is anchored to "Image 1 = before, Image 2 = after"). max_tokensβ₯ 16384 attemperature 0. A thinking model on a small budget spends it on hidden reasoning and returns emptycontentβ parse failures, not errors.
Predictions JSONL β one record per pair:
{"id": "...", "split": "easy|medium|hard|no_diff", "model": "...", "prompt_version": "v1.0", "prediction": "<model's free-form diff list>"}Runners resume automatically (re-running continues from where they stopped). Validate before submitting:
python scripts/prepare_submission.py --predictions results/<model>/predictions.jsonl --output submissions/<model>.zipScoring uses the official judge gpt-oss-120b (reasoning_effort=high), two ways:
- Self-evaluate β the judge is open-weight; host it (vLLM / sglang) or point
OPENAI_BASE_URLat any OpenAI-compatible endpoint serving it, then runscripts/evaluate.py(Quick Start step 2). - Don't want to host a 120B judge? Submit only your predictions and we score them β see
docs/submission.md.
Metric definitions (Overall / Diff / No-Diff, and why No-Diff specificity matters): docs/evaluation.md.
| Variable | Used by |
|---|---|
OPENAI_API_KEY |
OpenAI baseline and the judge |
OPENAI_BASE_URL |
route the OpenAI client to a vLLM / sglang / gateway endpoint (judge or self-hosted models) |
ANTHROPIC_API_KEY |
Anthropic baseline |
GOOGLE_API_KEY (or GEMINI_API_KEY) |
Google baseline |
Requires Python β₯ 3.10. pip install -e ".[api]" pulls the API clients; self-hosted models in baselines/local/ expect an OpenAI-compatible endpoint (--endpoint).
DiffSpot/
βββ diffspot/ # Core package (data loader, judge, metrics)
β βββ prompts/ # VLM and judge prompt templates (versioned)
βββ baselines/ # Reference baseline runners
β βββ api/ # API-hosted models (OpenAI / Anthropic / Google)
β βββ local/ # Self-hosted models via sglang / vllm
βββ scripts/ # End-to-end CLIs (eval, metrics, submission prep)
βββ docs/ # Data card, evaluation protocol, submission guide
βββ examples/ # Sample predictions + walkthrough
βββ results/ # Per-model raw predictions + scores (reproducibility)
βββ leaderboard/ # Machine-readable leaderboard + auto-update tooling
βββ space/ # HuggingFace Space demo source
@misc{zhang2026diffspotvlmsspotfinegrained,
title={DiffSpot: Can VLMs Spot Fine-Grained Visual Differences in Web Interfaces?},
author={Linhao Zhang and Aiwei Liu and Yuan Liu and Xiao Zhou},
year={2026},
eprint={2605.29615},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.29615},
}DiffSpot-Bench (code and dataset) is released under the MIT License β see LICENSE. Copyright (C) 2026 Tencent. All rights reserved.











