@@ -72,54 +72,39 @@ CodeScaleBench is built on three core principles:
7272
7373### 3.1 High-Level Architecture Diagram
7474
75- ```
76- CodeScaleBench Architecture
77- ┌─────────────────────────────────────────────────────────────────────┐
78- │ TASK DEFINITIONS │
79- │ benchmarks/ │
80- │ ├── csb_sdlc_understand/ (10 tasks) ├── csb_org_crossrepo_tracing/ (22) │
81- │ ├── csb_sdlc_design/ (14 tasks) ├── csb_org_security/ (24) │
82- │ ├── csb_sdlc_fix/ (26 tasks) ├── csb_org_incident/ (20) │
83- │ ├── csb_sdlc_feature/ (23 tasks) ├── csb_org_onboarding/ (28) │
84- │ ├── csb_sdlc_refactor/ (16 tasks) ├── csb_org_compliance/ (18) │
85- │ ├── csb_sdlc_test/ (18 tasks) ├── csb_org_crossorg/ (15) │
86- │ ├── csb_sdlc_document/ (13 tasks) ├── csb_org_domain/ (20) │
87- │ ├── csb_sdlc_secure/ (12 tasks) ├── csb_org_migration/ (26) │
88- │ └── csb_sdlc_debug/ (18 tasks) ├── csb_org_org/ (15) │
89- │ 150 SDLC tasks (9 suites) ├── csb_org_platform/ (18) │
90- │ ├── csb_org_crossrepo/ (14) │
91- │ └── 220 Org tasks (11 suites) │
92- └───────────────────┬─────────────────────────────────────────────────┘
93- │
94- ▼
95- ┌─────────────────────────────────────────────────────────────────────┐
96- │ EXECUTION LAYER (Harbor) │
97- │ │
98- │ ┌──────────────────┐ ┌──────────────────┐ │
99- │ │ Config: Baseline │ │ Config: MCP │ │
100- │ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │
101- │ │ │ Full source │ │ │ │ Truncated src│ │ │
102- │ │ │ Local tools │ │ │ │ 13 SG MCP │ │ │
103- │ │ │ No MCP │ │ │ │ tools │ │ │
104- │ │ │ Dockerfile │ │ │ │ Dockerfile. │ │ │
105- │ │ │ │ │ │ │ sg_only │ │ │
106- │ │ └──────────────┘ │ │ └──────────────┘ │ │
107- │ └──────────────────┘ └──────────────────┘ │
108- │ ▼ ▼ │
109- │ result.json result.json │
110- │ trajectory.jsonl trajectory.jsonl │
111- └───────────────────┬─────────────────────────────────────────────────┘
112- │
113- ▼
114- ┌─────────────────────────────────────────────────────────────────────┐
115- │ EVALUATION PIPELINE │
116- │ │
117- │ Layer 1: Deterministic Verifiers ──→ reward (0.0-1.0) │
118- │ Layer 2: Optional LLM Judge ──→ judge_score (0.0-1.0) │
119- │ Layer 3: IR Metrics Pipeline ──→ file_recall, MRR, TTFR │
120- │ Layer 4: Statistical Analysis ──→ bootstrap CIs, paired Δ │
121- │ Layer 5: Report Generation ──→ MANIFEST.json, reports │
122- └─────────────────────────────────────────────────────────────────────┘
75+ ``` text
76+ CodeScaleBench Architecture
77+ ===========================
78+
79+ Task Definitions
80+ SDLC suites: 150 tasks across 9 suites
81+ Org suites: 220 tasks across 11 suites
82+
83+ (task + config)
84+ |
85+ v
86+ Execution Layer (Harbor)
87+ Baseline config:
88+ - full source code
89+ - local tools only
90+ - Dockerfile
91+ MCP config:
92+ - truncated local source
93+ - Sourcegraph MCP tools
94+ - Dockerfile.sg_only
95+
96+ Outputs per run:
97+ - result.json
98+ - trajectory.jsonl
99+
100+ |
101+ v
102+ Evaluation Pipeline
103+ 1) Deterministic verifier -> reward (0.0-1.0)
104+ 2) Optional LLM judge -> judge_score
105+ 3) IR metrics pipeline -> file_recall, MRR, TTFR
106+ 4) Statistical analysis -> paired deltas, bootstrap CIs
107+ 5) Report generation -> manifests and reports
123108```
124109
125110### 3.2 Per-Task Directory Structure
@@ -426,27 +411,23 @@ Every task produces a single reward score (0.0--1.0) via a deterministic, in-con
426411
427412Harbor uploads each task's ` tests/ ` directory to ` /tests/ ` inside the container and invokes the entry-point script after the agent finishes. The entry point writes a floating-point score to ` /logs/verifier/reward.txt ` . All verifiers follow the exit-code-first convention: exit 0 if score > 0.0, exit 1 otherwise.
428413
429- ```
430- Harbor Container
431- ┌─────────────────────────────────────────────────────────┐
432- │ Agent writes to /workspace/ │
433- │ │ │
434- │ ▼ │
435- │ /tests/test.sh (SDLC tasks) │
436- │ /tests/eval.sh (Org tasks) │
437- │ │ │
438- │ ├── sources shared libraries as needed: │
439- │ │ ├── verifier_lib.sh (IR metrics helpers) │
440- │ │ ├── answer_json_verifier_lib.sh │
441- │ │ │ (artifact mode extraction) │
442- │ │ └── sgonly_verifier_wrapper.sh │
443- │ │ (repo restoration for MCP) │
444- │ │ │
445- │ ├── runs task-specific scoring logic │
446- │ │ │
447- │ ▼ │
448- │ /logs/verifier/reward.txt (0.0 -- 1.0) │
449- └─────────────────────────────────────────────────────────┘
414+ ``` text
415+ Harbor Container Verifier Flow
416+ ------------------------------
417+ Agent writes outputs to /workspace/
418+ |
419+ v
420+ /tests/test.sh (SDLC) or /tests/eval.sh (Org)
421+ |
422+ +-- shared libs (as needed):
423+ | - verifier_lib.sh
424+ | - answer_json_verifier_lib.sh
425+ | - sgonly_verifier_wrapper.sh
426+ |
427+ +-- task-specific scoring logic
428+ |
429+ v
430+ /logs/verifier/reward.txt (0.0-1.0)
450431```
451432
452433### 7.3 SDLC Task Verifiers (test.sh)
@@ -508,18 +489,20 @@ Four shared libraries handle cross-cutting concerns:
508489
509490A key design for fair MCP evaluation: during the agent' s run, source code is truncated (empty files). At verification time, `sgonly_verifier_wrapper.sh` restores the full codebase:
510491
511- ```
512- Agent Runtime Verification Time
513- ───────────── ─────────────────
514- Dockerfile.sg_only: sgonly_verifier_wrapper.sh:
515- ┌────────────────┐ ┌──────────────────────┐
516- │ Truncated src │ │ Read clone manifest │
517- │ (empty files) │ ──Agent edits──→ │ Back up agent files │
518- │ │ │ Clone mirror repos │
519- │ Agent uses MCP │ │ Re-inject defects │
520- │ to read code │ │ Overlay agent changes│
521- └────────────────┘ │ Run original test.sh │
522- └──────────────────────┘
492+ ```text
493+ SG-Only Clone-at-Verify Flow
494+ ----------------------------
495+ Agent runtime (Dockerfile.sg_only):
496+ - local source is truncated
497+ - agent reads code via MCP and makes edits
498+
499+ Verification runtime (sgonly_verifier_wrapper.sh):
500+ 1) Read clone manifest
501+ 2) Back up agent-edited files
502+ 3) Clone mirror repositories
503+ 4) Re-inject defects (if task requires)
504+ 5) Overlay agent edits
505+ 6) Run original verifier script
523506```
524507
525508The clone manifest (`/tmp/.sg_only_clone_manifest.json`) is written at Docker build time and specifies which `sg-evals` mirrors to clone and where to place them. This ensures the verifier operates on the same full codebase as the baseline configuration, producing comparable scores.
@@ -682,29 +665,21 @@ Each tool call is normalized into a structured event:
682665
683666The primary agent (`agents/claude_baseline_agent.py`, 2,090 lines) is a Harbor-compatible agent that wraps Claude Code for benchmark execution:
684667
685- ```
686- ┌─────────────────────────────────────────────────────────────┐
687- │ Claude Baseline Agent │
688- │ │
689- │ ┌─────────────────┐ ┌──────────────────────────────┐ │
690- │ │ Config Detection │ │ V5 Preamble Template │ │
691- │ │ BASELINE_MCP_TYPE│ │ ┌──────────────────────────┐ │ │
692- │ │ ├── none │ │ │ # Source Code Access │ │ │
693- │ │ ├── sourcegraph │ │ │ Files are NOT present. │ │ │
694- │ │ ├── sg_full │────│ │ Use Sourcegraph MCP │ │ │
695- │ │ └── artifact_full│ │ │ tools to read code. │ │ │
696- │ └─────────────────┘ │ │ {repo_scope} │ │ │
697- │ │ │ {workflow_tail} │ │ │
698- │ ┌─────────────────┐ │ └──────────────────────────┘ │ │
699- │ │ Repo Resolution │ └──────────────────────────────┘ │
700- │ │ _get_repo_display│ │
701- │ │ _get_repo_list │ ┌──────────────────────────────┐ │
702- │ │ Priority: │ │ System Prompt Assembly │ │
703- │ │ 1. ENV vars │ │ EVALUATION_CONTEXT + │ │
704- │ │ 2. Docker parse │ │ MCP-specific guidance + │ │
705- │ │ 3. Fallback │ │ Repo scoping rules │ │
706- │ └─────────────────┘ └──────────────────────────────┘ │
707- └─────────────────────────────────────────────────────────────┘
668+ ```text
669+ Claude Baseline Agent (Architecture)
670+ ------------------------------------
671+ Config detection:
672+ BASELINE_MCP_TYPE = none | sourcegraph | sg_full | artifact_full
673+
674+ Repo resolution:
675+ _get_repo_display / _get_repo_list
676+ Priority: env vars -> Docker metadata -> fallback
677+
678+ Prompt assembly:
679+ EVALUATION_CONTEXT
680+ + MCP guidance (when MCP config)
681+ + repo scope filters
682+ + workflow tail
708683```
709684
710685### 9.2 MCP Preamble
@@ -740,23 +715,23 @@ The MCP configuration provides 13 Sourcegraph MCP tools:
740715
741716# ## 9.4 Docker Environment Variants
742717
743- ` ` `
744- ┌────────────────────────────────────────────────────────────────┐
745- │ Three Dockerfile Variants │
746- │ │
747- │ Dockerfile (Baseline) Dockerfile.sg_only Dockerfile. │
748- │ ┌────────────────────┐ ┌──────────────────┐ artifact_only │
749- │ │ FROM base_image │ │ FROM base_image │ ┌────────────┐│
750- │ │ CLONE full repo │ │ CLONE + truncate │ │ FROM ubuntu ││
751- │ │ at pinned commit │ │ all source files │ │ No code ││
752- │ │ │ │ recommit (no git │ │ .artifact_ ││
753- │ │ Full source access │ │ history bypass) │ │ only_mode ││
754- │ │ │ │ │ │ marker file ││
755- │ │ Verifier runs │ │ Clone manifest │ │ ││
756- │ │ against local code │ │ for verifier │ │ Agent writes││
757- │ │ │ │ restoration │ │ answer.json ││
758- │ └────────────────────┘ └──────────────────┘ └────────────┘│
759- └────────────────────────────────────────────────────────────────┘
718+ ` ` ` text
719+ Three Dockerfile Variants
720+ -------------------------
721+ 1) Dockerfile (Baseline)
722+ - clone full repo at pinned commit
723+ - full local source access
724+ - verifier runs against local code
725+
726+ 2) Dockerfile.sg_only (MCP)
727+ - clone repo then truncate source files
728+ - remove local- source bypass paths
729+ - write clone manifest for verify-time restoration
730+
731+ 3) Dockerfile.artifact_only (Org artifact mode)
732+ - no source checkout
733+ - marker file: .artifact_only_mode
734+ - agent produces answer.json artifact
760735` ` `
761736
762737** File extension truncation** (95+ types): ` .py` , ` .js` , ` .ts` , ` .go` , ` .java` , ` .rs` , ` .c` , ` .cpp` , ` .h` , ` .yaml` , ` .toml` , ` .json` , ` .xml` , ` .md` , and more.
0 commit comments