Skip to content

Commit 9b9ef1c

Browse files
committed
Clean docs artifacts ignores and align technical report diagrams
1 parent 7f3c426 commit 9b9ef1c

File tree

2 files changed

+115
-121
lines changed

2 files changed

+115
-121
lines changed

.gitignore

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,8 +23,14 @@ runs/*
2323
!runs/official/**
2424
!runs/analysis/
2525
!runs/analysis/**
26+
# Ignore newly generated run artifacts under official runs by default.
27+
# (Already tracked files remain tracked; this suppresses local run noise.)
28+
runs/official/**
2629
runs/official/**/agent/sessions/
2730
runs/official/**/agent/.mcp.json
31+
runs/official/_raw/
32+
runs/official/_meta/
33+
runs/official/_views_legacy/
2834
runs/analysis/**/agent/sessions/
2935
runs/analysis/**/agent/.mcp.json
3036
tasks/
@@ -35,6 +41,19 @@ tasks/
3541
!docs/official_results/tasks/**
3642
# Agent traces can contain OAuth tokens — exclude from repo
3743
docs/official_results/traces/
44+
# Local/generated analysis caches and intermediate figure exports
45+
docs/analysis/github_repo_size_kb_cache.json
46+
docs/assets/blog/codescalebench_mcp/figure_1_*
47+
docs/assets/blog/codescalebench_mcp/figure_2_*
48+
docs/assets/blog/codescalebench_mcp/figure_2b_*
49+
docs/assets/blog/codescalebench_mcp/figure_3_*
50+
docs/assets/blog/codescalebench_mcp/figure_4_*
51+
docs/assets/blog/codescalebench_mcp/figure_5_*
52+
docs/assets/blog/codescalebench_mcp/figure_6_*
53+
docs/assets/blog/codescalebench_mcp/figure_8_haiku_cost_size_vs_overall.*
54+
docs/assets/blog/codescalebench_mcp/figure_9_*
55+
scripts/daytona_snapshot_cleanup.py
56+
scripts/plot_csb_mcp_blog_figures.py
3857
ralph/
3958
ralph-*/
4059
reports/

docs/technical_reports/TECHNICAL_REPORT.md

Lines changed: 96 additions & 121 deletions
Original file line numberDiff line numberDiff line change
@@ -72,54 +72,39 @@ CodeScaleBench is built on three core principles:
7272

7373
### 3.1 High-Level Architecture Diagram
7474

75-
```
76-
CodeScaleBench Architecture
77-
┌─────────────────────────────────────────────────────────────────────┐
78-
│ TASK DEFINITIONS │
79-
│ benchmarks/ │
80-
│ ├── csb_sdlc_understand/ (10 tasks) ├── csb_org_crossrepo_tracing/ (22) │
81-
│ ├── csb_sdlc_design/ (14 tasks) ├── csb_org_security/ (24) │
82-
│ ├── csb_sdlc_fix/ (26 tasks) ├── csb_org_incident/ (20) │
83-
│ ├── csb_sdlc_feature/ (23 tasks) ├── csb_org_onboarding/ (28) │
84-
│ ├── csb_sdlc_refactor/ (16 tasks) ├── csb_org_compliance/ (18) │
85-
│ ├── csb_sdlc_test/ (18 tasks) ├── csb_org_crossorg/ (15) │
86-
│ ├── csb_sdlc_document/ (13 tasks) ├── csb_org_domain/ (20) │
87-
│ ├── csb_sdlc_secure/ (12 tasks) ├── csb_org_migration/ (26) │
88-
│ └── csb_sdlc_debug/ (18 tasks) ├── csb_org_org/ (15) │
89-
│ 150 SDLC tasks (9 suites) ├── csb_org_platform/ (18) │
90-
│ ├── csb_org_crossrepo/ (14) │
91-
│ └── 220 Org tasks (11 suites) │
92-
└───────────────────┬─────────────────────────────────────────────────┘
93-
94-
95-
┌─────────────────────────────────────────────────────────────────────┐
96-
│ EXECUTION LAYER (Harbor) │
97-
│ │
98-
│ ┌──────────────────┐ ┌──────────────────┐ │
99-
│ │ Config: Baseline │ │ Config: MCP │ │
100-
│ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │
101-
│ │ │ Full source │ │ │ │ Truncated src│ │ │
102-
│ │ │ Local tools │ │ │ │ 13 SG MCP │ │ │
103-
│ │ │ No MCP │ │ │ │ tools │ │ │
104-
│ │ │ Dockerfile │ │ │ │ Dockerfile. │ │ │
105-
│ │ │ │ │ │ │ sg_only │ │ │
106-
│ │ └──────────────┘ │ │ └──────────────┘ │ │
107-
│ └──────────────────┘ └──────────────────┘ │
108-
│ ▼ ▼ │
109-
│ result.json result.json │
110-
│ trajectory.jsonl trajectory.jsonl │
111-
└───────────────────┬─────────────────────────────────────────────────┘
112-
113-
114-
┌─────────────────────────────────────────────────────────────────────┐
115-
│ EVALUATION PIPELINE │
116-
│ │
117-
│ Layer 1: Deterministic Verifiers ──→ reward (0.0-1.0) │
118-
│ Layer 2: Optional LLM Judge ──→ judge_score (0.0-1.0) │
119-
│ Layer 3: IR Metrics Pipeline ──→ file_recall, MRR, TTFR │
120-
│ Layer 4: Statistical Analysis ──→ bootstrap CIs, paired Δ │
121-
│ Layer 5: Report Generation ──→ MANIFEST.json, reports │
122-
└─────────────────────────────────────────────────────────────────────┘
75+
```text
76+
CodeScaleBench Architecture
77+
===========================
78+
79+
Task Definitions
80+
SDLC suites: 150 tasks across 9 suites
81+
Org suites: 220 tasks across 11 suites
82+
83+
(task + config)
84+
|
85+
v
86+
Execution Layer (Harbor)
87+
Baseline config:
88+
- full source code
89+
- local tools only
90+
- Dockerfile
91+
MCP config:
92+
- truncated local source
93+
- Sourcegraph MCP tools
94+
- Dockerfile.sg_only
95+
96+
Outputs per run:
97+
- result.json
98+
- trajectory.jsonl
99+
100+
|
101+
v
102+
Evaluation Pipeline
103+
1) Deterministic verifier -> reward (0.0-1.0)
104+
2) Optional LLM judge -> judge_score
105+
3) IR metrics pipeline -> file_recall, MRR, TTFR
106+
4) Statistical analysis -> paired deltas, bootstrap CIs
107+
5) Report generation -> manifests and reports
123108
```
124109

125110
### 3.2 Per-Task Directory Structure
@@ -426,27 +411,23 @@ Every task produces a single reward score (0.0--1.0) via a deterministic, in-con
426411

427412
Harbor uploads each task's `tests/` directory to `/tests/` inside the container and invokes the entry-point script after the agent finishes. The entry point writes a floating-point score to `/logs/verifier/reward.txt`. All verifiers follow the exit-code-first convention: exit 0 if score > 0.0, exit 1 otherwise.
428413

429-
```
430-
Harbor Container
431-
┌─────────────────────────────────────────────────────────┐
432-
│ Agent writes to /workspace/ │
433-
│ │ │
434-
│ ▼ │
435-
│ /tests/test.sh (SDLC tasks) │
436-
│ /tests/eval.sh (Org tasks) │
437-
│ │ │
438-
│ ├── sources shared libraries as needed: │
439-
│ │ ├── verifier_lib.sh (IR metrics helpers) │
440-
│ │ ├── answer_json_verifier_lib.sh │
441-
│ │ │ (artifact mode extraction) │
442-
│ │ └── sgonly_verifier_wrapper.sh │
443-
│ │ (repo restoration for MCP) │
444-
│ │ │
445-
│ ├── runs task-specific scoring logic │
446-
│ │ │
447-
│ ▼ │
448-
│ /logs/verifier/reward.txt (0.0 -- 1.0) │
449-
└─────────────────────────────────────────────────────────┘
414+
```text
415+
Harbor Container Verifier Flow
416+
------------------------------
417+
Agent writes outputs to /workspace/
418+
|
419+
v
420+
/tests/test.sh (SDLC) or /tests/eval.sh (Org)
421+
|
422+
+-- shared libs (as needed):
423+
| - verifier_lib.sh
424+
| - answer_json_verifier_lib.sh
425+
| - sgonly_verifier_wrapper.sh
426+
|
427+
+-- task-specific scoring logic
428+
|
429+
v
430+
/logs/verifier/reward.txt (0.0-1.0)
450431
```
451432

452433
### 7.3 SDLC Task Verifiers (test.sh)
@@ -508,18 +489,20 @@ Four shared libraries handle cross-cutting concerns:
508489
509490
A key design for fair MCP evaluation: during the agent's run, source code is truncated (empty files). At verification time, `sgonly_verifier_wrapper.sh` restores the full codebase:
510491
511-
```
512-
Agent Runtime Verification Time
513-
───────────── ─────────────────
514-
Dockerfile.sg_only: sgonly_verifier_wrapper.sh:
515-
┌────────────────┐ ┌──────────────────────┐
516-
│ Truncated src │ │ Read clone manifest │
517-
│ (empty files) │ ──Agent edits──→ │ Back up agent files │
518-
│ │ │ Clone mirror repos │
519-
│ Agent uses MCP │ │ Re-inject defects │
520-
│ to read code │ │ Overlay agent changes│
521-
└────────────────┘ │ Run original test.sh │
522-
└──────────────────────┘
492+
```text
493+
SG-Only Clone-at-Verify Flow
494+
----------------------------
495+
Agent runtime (Dockerfile.sg_only):
496+
- local source is truncated
497+
- agent reads code via MCP and makes edits
498+
499+
Verification runtime (sgonly_verifier_wrapper.sh):
500+
1) Read clone manifest
501+
2) Back up agent-edited files
502+
3) Clone mirror repositories
503+
4) Re-inject defects (if task requires)
504+
5) Overlay agent edits
505+
6) Run original verifier script
523506
```
524507
525508
The clone manifest (`/tmp/.sg_only_clone_manifest.json`) is written at Docker build time and specifies which `sg-evals` mirrors to clone and where to place them. This ensures the verifier operates on the same full codebase as the baseline configuration, producing comparable scores.
@@ -682,29 +665,21 @@ Each tool call is normalized into a structured event:
682665
683666
The primary agent (`agents/claude_baseline_agent.py`, 2,090 lines) is a Harbor-compatible agent that wraps Claude Code for benchmark execution:
684667
685-
```
686-
┌─────────────────────────────────────────────────────────────┐
687-
│ Claude Baseline Agent │
688-
│ │
689-
│ ┌─────────────────┐ ┌──────────────────────────────┐ │
690-
│ │ Config Detection │ │ V5 Preamble Template │ │
691-
│ │ BASELINE_MCP_TYPE│ │ ┌──────────────────────────┐ │ │
692-
│ │ ├── none │ │ │ # Source Code Access │ │ │
693-
│ │ ├── sourcegraph │ │ │ Files are NOT present. │ │ │
694-
│ │ ├── sg_full │────│ │ Use Sourcegraph MCP │ │ │
695-
│ │ └── artifact_full│ │ │ tools to read code. │ │ │
696-
│ └─────────────────┘ │ │ {repo_scope} │ │ │
697-
│ │ │ {workflow_tail} │ │ │
698-
│ ┌─────────────────┐ │ └──────────────────────────┘ │ │
699-
│ │ Repo Resolution │ └──────────────────────────────┘ │
700-
│ │ _get_repo_display│ │
701-
│ │ _get_repo_list │ ┌──────────────────────────────┐ │
702-
│ │ Priority: │ │ System Prompt Assembly │ │
703-
│ │ 1. ENV vars │ │ EVALUATION_CONTEXT + │ │
704-
│ │ 2. Docker parse │ │ MCP-specific guidance + │ │
705-
│ │ 3. Fallback │ │ Repo scoping rules │ │
706-
│ └─────────────────┘ └──────────────────────────────┘ │
707-
└─────────────────────────────────────────────────────────────┘
668+
```text
669+
Claude Baseline Agent (Architecture)
670+
------------------------------------
671+
Config detection:
672+
BASELINE_MCP_TYPE = none | sourcegraph | sg_full | artifact_full
673+
674+
Repo resolution:
675+
_get_repo_display / _get_repo_list
676+
Priority: env vars -> Docker metadata -> fallback
677+
678+
Prompt assembly:
679+
EVALUATION_CONTEXT
680+
+ MCP guidance (when MCP config)
681+
+ repo scope filters
682+
+ workflow tail
708683
```
709684
710685
### 9.2 MCP Preamble
@@ -740,23 +715,23 @@ The MCP configuration provides 13 Sourcegraph MCP tools:
740715
741716
### 9.4 Docker Environment Variants
742717
743-
```
744-
┌────────────────────────────────────────────────────────────────┐
745-
│ Three Dockerfile Variants │
746-
│ │
747-
│ Dockerfile (Baseline) Dockerfile.sg_only Dockerfile. │
748-
│ ┌────────────────────┐ ┌──────────────────┐ artifact_only │
749-
│ │ FROM base_image │ │ FROM base_image │ ┌────────────┐│
750-
│ │ CLONE full repo │ │ CLONE + truncate │ │ FROM ubuntu ││
751-
│ │ at pinned commit │ │ all source files │ │ No code ││
752-
│ │ │ │ recommit (no git │ │ .artifact_ ││
753-
│ │ Full source access │ │ history bypass) │ │ only_mode ││
754-
│ │ │ │ │ │ marker file ││
755-
│ │ Verifier runs │ │ Clone manifest │ │ ││
756-
│ │ against local code │ │ for verifier │ │ Agent writes││
757-
│ │ │ │ restoration │ │ answer.json ││
758-
│ └────────────────────┘ └──────────────────┘ └────────────┘│
759-
└────────────────────────────────────────────────────────────────┘
718+
```text
719+
Three Dockerfile Variants
720+
-------------------------
721+
1) Dockerfile (Baseline)
722+
- clone full repo at pinned commit
723+
- full local source access
724+
- verifier runs against local code
725+
726+
2) Dockerfile.sg_only (MCP)
727+
- clone repo then truncate source files
728+
- remove local-source bypass paths
729+
- write clone manifest for verify-time restoration
730+
731+
3) Dockerfile.artifact_only (Org artifact mode)
732+
- no source checkout
733+
- marker file: .artifact_only_mode
734+
- agent produces answer.json artifact
760735
```
761736
762737
**File extension truncation** (95+ types): `.py`, `.js`, `.ts`, `.go`, `.java`, `.rs`, `.c`, `.cpp`, `.h`, `.yaml`, `.toml`, `.json`, `.xml`, `.md`, and more.

0 commit comments

Comments
 (0)