Evaluating AI Agents for Automated Research from Re-Discovery to New-Discovery
Quick Start | How It Works | Domains | Leaderboard | Add Your Agent
ResearchClawBench is a benchmark that measures whether AI coding agents can independently conduct scientific research — from reading raw data to producing publication-quality reports — and then rigorously evaluates the results against real human-authored papers.
Unlike benchmarks that test coding ability or factual recall, ResearchClawBench asks: given the same data and tools a human researcher had, can an AI agent arrive at the same (or better) scientific conclusions?
| 🔄 Two-Stage Pipeline Autonomous research + rigorous peer-review-style evaluation |
🧪 40 Real-Science Tasks 10 disciplines, complete datasets from published papers |
👁️ Expert-Annotated Data Tasks, checklists & datasets curated by domain experts |
🤖 Multi-Agent Support Claude Code, Codex CLI, OpenClaw & custom agents |
| 🚀 Re-Discovery to New-Discovery 50 = match the paper, 70+ = surpass it |
📋 Fine-Grained Checklist Per-item keywords, weights & reasoning |
📡 Live Streaming UI Watch agents code, plot & write in real-time |
🍃 Lightweight Dependencies Pure Flask + vanilla JS, no heavy frameworks |
demo.mp4
Most AI benchmarks evaluate what models know. We evaluate what agents can do.
- Real science, not toy problems. 40 tasks sourced from published papers across 10 disciplines, each with complete experimental datasets.
- Two-stage pipeline. Autonomous research first, rigorous evaluation second — just like peer review.
- Fine-grained, multimodal scoring. A weighted checklist with text and image criteria, judged by an LLM acting as a strict peer reviewer.
- Agent-agnostic. Ships with first-class support for Claude Code, Codex CLI, and OpenClaw. Bring your own agent in one line.
- From Re-Discovery to New-Discovery. Scoring above 50 means matching the original paper; above 70 means surpassing it. The frontier is wide open.
Every task in ResearchClawBench is built through a rigorous, expert-driven pipeline to ensure scientific validity and reproducibility:
flowchart TD
A["📄 High-Quality Paper Collection\n(Target Paper)"] --> B["🧑🔬 Human Expert Extraction\n(Core Task Instructions)"]
B --> C["📋 Evaluation Checklist\n(Criteria + Keywords + Weights)"]
B --> D["📂 Data & Related Work Collection\n(Datasets + Reference Papers)"]
C --> E["✅ Human Reproduction & Validation\n(Verify checklist is reproducible)"]
D --> E
style A fill:#e0f2fe,stroke:#0284c7,stroke-width:2px
style B fill:#fef3c7,stroke:#f59e0b,stroke-width:2px
style C fill:#fce7f3,stroke:#ec4899,stroke-width:2px
style D fill:#f0fdf4,stroke:#22c55e,stroke-width:2px
style E fill:#f5f3ff,stroke:#8b5cf6,stroke-width:2px
-
High-Quality Paper Collection — Domain experts select recent, high-impact publications with clear methodology and reproducible results across 10 scientific disciplines.
-
Expert Task Extraction — Human experts read each paper and distill the core research task into structured instructions, identifying the key scientific question, input data, and expected outputs.
-
Checklist Design — Experts create a fine-grained evaluation checklist with weighted criteria (text and image items), each with specific technical keywords that a judge must verify.
-
Data & Related Work Collection — The original datasets used in the paper are gathered, along with relevant reference materials, to form a self-contained research workspace.
-
Human Reproduction & Validation — Human researchers independently reproduce the paper's results using only the provided data and instructions, verifying that every checklist item is achievable. This ensures the benchmark is fair and the checklist is grounded in reality.
ResearchClawBench operates in two distinct stages:
flowchart LR
subgraph Stage1["Stage 1 — Auto Research"]
A["Raw Data\n+ Instructions"] --> B["AI Agent\n(autonomous)"]
B --> C["Code\n+ Figures\n+ Report"]
end
subgraph Stage2["Stage 2 — Evaluation"]
C --> D["LLM Judge"]
E["Target Paper\n+ Checklist"] --> D
D --> F["Per-Item Scores\n+ Reasoning"]
end
style Stage1 fill:#f0f4ff,stroke:#3b82f6,stroke-width:2px
style Stage2 fill:#fff7ed,stroke:#f59e0b,stroke-width:2px
The AI agent receives a workspace containing raw datasets, reference materials, and task instructions. It must independently:
- Explore the data and understand the research question
- Write code to analyze, model, and visualize the data
- Produce a research report (
report/report.md) with figures, methodology, results, and discussion
No hand-holding. No chain-of-thought hints. The agent works in its own sandboxed workspace with full tool access — just like a real researcher.
Once the agent finishes, its report is evaluated against the original published paper using a fine-grained checklist. The judge receives the task instructions, the AI report, and the checklist criteria — then scores each item using a dual-mode rubric:
flowchart TD
subgraph Inputs
I["INSTRUCTIONS.md\n(task background)"]
R["Agent Report\n(text + figures)"]
CL["Checklist\n(from target paper)"]
end
I & R & CL --> J["Multimodal LLM Judge"]
J --> DET{"Determine\nEvaluation Mode"}
DET -->|"Quantitative\nresults"| OBJ["Mode A: Objective\n(Metric Optimization)"]
DET -->|"Qualitative\nreasoning"| SUB["Mode B: Subjective\n(Mechanism Analysis)"]
OBJ --> SO["Score by metric\naccuracy vs paper"]
SUB --> SS["Score by evidence\nstrength vs paper"]
SO & SS --> T["Per-Item Scores\n+ Reasoning\n→ Weighted Total"]
style Inputs fill:#f0f4ff,stroke:#3b82f6,stroke-width:2px
style J fill:#fef3c7,stroke:#f59e0b,stroke-width:2px
style OBJ fill:#dbeafe,stroke:#3b82f6,stroke-width:2px
style SUB fill:#fce7f3,stroke:#ec4899,stroke-width:2px
style T fill:#f0fdf4,stroke:#22c55e,stroke-width:2px
Each checklist item includes:
- Specific criteria extracted from the paper's key contributions
- Technical keywords the judge must verify (e.g., "ROC-AUC improvement", "Monte Carlo integration")
- Weight reflecting the item's importance
- Type —
textfor methodology/findings,imagefor figure comparison (multimodal vision)
The judge automatically determines which evaluation mode applies to each item, then scores it with the corresponding rubric (see below).
For checklist items involving specific numerical results, metrics, or quantitative outcomes:
| Score | Meaning |
|---|---|
| 0 | Criterion completely absent |
| 1–10 | Mentioned but no quantitative results provided |
| 11–20 | Results given but methodology has fundamental errors |
| 21–30 | Significant methodological flaws; metrics deviate severely |
| 31–40 | Methodology mostly correct but metrics notably worse than the paper |
| 41–50 | Metrics roughly comparable to the paper |
| 51–60 | Metrics slightly better than the paper |
| 61–70 | Metrics clearly better than the paper |
| 71–80 | Methodology and metrics both substantially improved |
| 81–90 | Metrics dramatically surpass the paper |
| 91–100 | Breakthrough results far exceeding the paper |
For checklist items involving theoretical explanations, mechanistic insights, or interpretive analysis:
| Score | Meaning |
|---|---|
| 0 | Criterion completely absent |
| 1–10 | Mentioned only with vague, generic statements |
| 11–20 | Some description but no substantive analysis |
| 21–30 | Analysis attempted but evidence insufficient or logic has gaps |
| 31–40 | Correct direction but lacks depth; key arguments missing |
| 41–50 | Analysis depth and rigor comparable to the paper |
| 51–60 | More supporting evidence provided than the paper |
| 61–70 | More complete logical chain and more rigorous argumentation |
| 71–80 | Significantly deeper analysis with novel insights |
| 81–90 | Analysis depth far exceeds the paper |
| 91–100 | Original contributions with breakthrough insights |
Strict by design. The judge is highly skeptical of AI-generated content — plausible-sounding claims must be backed by concrete evidence. Longer reports do not score higher. Substance over style.
Each domain contains 4 carefully curated tasks with complete experimental data from real published research:
| Domain | Example Topics | Data Types |
|---|---|---|
| Astronomy | Black hole superradiance, Bayesian stellar inference | .dat, .csv |
| Chemistry | GNN molecular prediction, protein-ligand docking | .pdb, .sdf, .csv |
| Earth | Glacier mass balance, climate datasets | .csv, multi-region series |
| Energy | Battery degradation, renewable energy modeling | .xlsx, time series |
| Information | NLP benchmarks, deep learning analysis | .pdf, .tex, .ipynb |
| Life | Nanopore sequencing, genomic analysis | .csv, .xlsx |
| Material | Materials property prediction, pretrained models | .pt, .csv |
| Math | Multi-agent pathfinding, optimization | .json, .npy, grid maps |
| Neuroscience | Neural decoding, brain signal processing | .csv, .h5, .yaml |
| Physics | Quantum geometry, superfluid stiffness | .h5, .json, .csv |
40 tasks total — each a self-contained research challenge selected from high-quality human-authored publications, spanning the full spectrum from data analysis to novel scientific insight.
git clone https://github.com/InternScience/ResearchClawBench.git
cd ResearchClawBench
pip install -r evaluation/requirements.txtCreate evaluation/.env with your scoring model credentials:
OPENAI_API_KEY=sk-xxx
OPENAI_BASE_URL=https://api.openai.com/v1
SCORER_MODEL=gpt-5.1python -m evaluationOpen http://localhost:5000 — browse tasks, pick an agent, hit Start Run, and watch the research happen live.
After a run completes, switch to the Evaluation tab and click Score. The multimodal LLM judge evaluates each checklist item and returns per-item scores with reasoning.
ResearchClawBench ships with built-in support for three frontier coding agents:
Any command that reads an instruction file and works inside a directory can be used. In the UI, select Custom and enter your command using these placeholders:
| Placeholder | Replaced With |
|---|---|
{prompt_file} |
Absolute path to INSTRUCTIONS.md |
{workspace} |
Absolute path to the workspace directory |
Example:
my-agent run --instructions "{prompt_file}" --workdir "{workspace}"Or add it as a preset in evaluation/config.py:
AGENT_PRESETS["my_agent"] = {
"label": "My Agent",
"icon": "M",
"logo": "/static/logos/my_agent.svg",
"cmd": 'my-agent run --instructions "{prompt_file}" --workdir "{workspace}"',
}The built-in dashboard aggregates the best score per (task, agent) pair and displays:
- Frontier chart — best score per task across all agents
- Leaderboard table — clickable cells linking to individual runs
- Per-task breakdown — view any agent's report, code, and score reasoning
The frontier represents the state of the art — every point above 50 is uncharted territory where AI surpasses human researchers on that specific task.
ResearchClawBench/
├── evaluation/ # Core evaluation framework
│ ├── server.py # Flask API + SSE streaming
│ ├── run_task.py # Workspace setup + agent subprocess
│ ├── score.py # Multimodal LLM scoring engine
│ ├── config.py # Agent presets + constants
│ ├── utils.py # File tree, path safety, discovery
│ ├── static/app.js # Single-file frontend (~1200 LOC)
│ └── templates/index.html # Entry point
├── tasks/ # 40 research tasks
│ ├── Astronomy_000/
│ │ ├── task_info.json # Task description + data manifest
│ │ ├── data/ # Raw experimental datasets
│ │ ├── related_work/ # Reference papers
│ │ └── target_study/ # Paper + checklist + images
│ ├── Chemistry_000/
│ └── ... # 10 domains x 4 tasks
└── workspaces/ # Generated at runtime (gitignored)
We welcome contributions in several forms:
- New tasks — Add research challenges in existing or new domains
- New agents — Add presets for emerging coding agents
- Bug reports — Open an issue
📧 Email: xu_wanghan@sjtu.edu.cn
If you would like to cite our work, please use the following BibTeX.
@article{xu2025probing,
title={Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows},
author={Xu, Wanghan and Zhou, Yuhao and Zhou, Yifan and Cao, Qinglong and Li, Shuo and Bu, Jia and Liu, Bo and Chen, Yixin and He, Xuming and Zhao, Xiangyu and others},
journal={arXiv preprint arXiv:2512.16969},
year={2025}
}

