|
60 | 60 | - Teleport: `benchmarks/ccb_swebenchpro/tasks/instance_gravitational-teleport-0415e422f12454db0c22316cf3eaa5088d6b6322` |
61 | 61 | - Vuls: `benchmarks/ccb_swebenchpro/tasks/instance_future-architect-vuls-139f3a81b66c47e6d8f70ce6c4afe7a9196a6ea8` |
62 | 62 | - Tutanota: `benchmarks/ccb_swebenchpro/tasks/instance_tutao-tutanota-f373ac3808deefce8183dad8d16729839cc330c1-v2939aa9f4356f0dc9f523ee5ce19d09e08ab979b` |
63 | | -- Flipt: 53 tasks available — must identify correct one |
64 | | -- Qutebrowser: 96 tasks — must select 4 suitable ones |
| 63 | +- Flipt: `benchmarks/ccb_swebenchpro/tasks/instance_flipt-io-flipt-6fe76d024ee0c50ddb09c86f4ae0bd4c208fd65f` |
| 64 | +- Qutebrowser: 96 tasks — 4 selected (see US-001 progress) |
| 65 | + |
| 66 | +### Go Task Dockerfile Pattern |
| 67 | +- Go base images already have Go toolchain — no need to install test frameworks |
| 68 | +- Simpler than Python: just need FROM, mkdir /logs, /workspace symlink, WORKDIR, ENTRYPOINT |
| 69 | +- No `pip install pytest` equivalent needed — `go test` is built into the Go toolchain |
65 | 70 |
|
66 | 71 | ## Progress |
67 | 72 |
|
|
120 | 125 | - The navprove task name "vault" doesn't match the actual bug (Python version drop) — task names are labels, content comes from source |
121 | 126 | --- |
122 | 127 |
|
| 128 | +## 2026-02-16 - US-004 |
| 129 | +- Populated 3 Go navprove tasks with content (teleport, vuls, flipt) |
| 130 | +- Files changed: |
| 131 | + - benchmarks/ccb_navprove/navprove-teleport-ssh-001/{instruction.md, environment/Dockerfile, tests/reference_fix.patch} |
| 132 | + - benchmarks/ccb_navprove/navprove-vuls-oval-001/{instruction.md, environment/Dockerfile, tests/reference_fix.patch} |
| 133 | + - benchmarks/ccb_navprove/navprove-flipt-cache-001/{instruction.md, environment/Dockerfile, tests/reference_fix.patch} |
| 134 | +- Source bugs: |
| 135 | + - Teleport: U2F multi-device auth limited to single token (9 files, 10353 byte patch) |
| 136 | + - Vuls: Trivy library scanner upgrade — stale imports, API changes, missing ecosystems (9 files, 122190 byte patch) |
| 137 | + - Flipt: Auth middleware doesn't support cookie tokens (1 file, 4755 byte patch) |
| 138 | +- All acceptance criteria verified: |
| 139 | + - grep -cE '\.(py|go|ts|js|rs)' returns 0 for all 3 instruction.md files |
| 140 | + - All reference_fix.patch files start with 'diff --git' and are >50 bytes |
| 141 | + - All instruction.md files are >200 bytes (1832, 1854, 1659) |
| 142 | + - All Dockerfiles have FROM with correct source base image |
| 143 | +- Go tasks don't need pytest — the SWE-bench Pro base images already have Go toolchain installed |
| 144 | +- **Learnings for future iterations:** |
| 145 | + - Go navprove Dockerfiles are simpler than Python ones — no need to install pytest/pytest-timeout, just need the /workspace symlink |
| 146 | + - Vuls has by far the largest patch (122KB, 9 files) but the instruction can still be symptom-only by focusing on the user-facing failures (DB client init, missing ecosystem detection) |
| 147 | + - Flipt is the cleanest navprove task: single-file patch, clear symptom (cookie auth fails), narrow test scope |
| 148 | + - The test.sh scaffolds already correctly use `go test -run TestRegression -v -timeout 60s` for Go tasks |
| 149 | +--- |
| 150 | + |
0 commit comments