{"actor":"sjarmak","comment":null,"created_at":"2026-03-09T21:20:01Z","event_type":"updated","id":54,"issue_id":"CodeScaleBench-2kz","new_value":"{\"notes\":\"2026-03-09 OpenHands verification results — SYSTEMATIC FAILURE:\\n\\nAll 17/18 completed tasks show the same error pattern:\\n- OpenHands LocalRuntime fails to start: tenacity.RetryError during _wait_until_alive\\n- Error location: openhands/runtime/impl/local/local_runtime.py:393\\n- The action execution server (jupyter-kernelgateway + ipykernel) cannot bind/connect\\n- Agent never actually executes any actions → no output files → verifier scores 0.0\\n- OpenHands version: 1.4.0\\n\\n2 false-positive non-zero scores:\\n- element-web MCP (1.0): Tests passed on pre-existing code because agent made no changes — verifier scored the unmodified state which happened to pass some tests\\n- django-rate-limit (0.05): Same pattern — verifier scored partial on existing repo state\\n\\nROOT CAUSE: OpenHands LocalRuntime is incompatible with Daytona sandbox networking. The LocalRuntime expects to bind localhost ports for its action execution server (jupyter-kernelgateway), but Daytona sandbox networking may not support this.\\n\\nPOSSIBLE FIXES:\\n1. Switch to DockerRuntime inside Daytona (nested Docker) — unlikely to work\\n2. Configure OpenHands to use a different port/socket binding\\n3. Run OpenHands tasks on local Docker instead of Daytona\\n4. Downgrade OpenHands to a version with compatible runtime\\n5. Debug the specific RuntimeError inside _wait_until_alive\\n\\nsgonly_verifier_wrapper.sh fix VERIFIED WORKING: before_repo_set_cmd correctly ran on the element-web MCP task (git reset + checkout visible in verifier log).\\n\\n1/18 task still running (element-web baseline — SWEAP image build on Daytona is slow).\"}","old_value":"{\"id\":\"CodeScaleBench-2kz\",\"title\":\"Verify harness fixes by rerunning historical Claude/OpenHands failures\",\"description\":\"Run a focused verification batch to prove the current task-contract and harness hardening eliminates the earlier random patch churn.\\n\\nScope:\\n- Claude Code regression sentinels:\\n - mcp_ccx-onboard-search-207\\n - mcp_ccx-onboard-search-208\\n - mcp_ccx-onboard-search-210\\n - mcp_bustub-hyperloglog-impl-001\\n - mcp_django-sensitive-file-exclusion-001\\n - mcp_flink-window-late-data-fix-001\\n - mcp_element-web-unread-indicators-diverge-fix-001\\n - clickhouse-mergetree-arch-understand-001 (confirm Daytona/local routing now that storage metadata was corrected)\\n- OpenHands regression sentinel:\\n - ccx-onboard-search-212\\n\\nAcceptance criteria:\\n- Produce a small rerun manifest or manifests for the tasks above.\\n- Execute the reruns once accounts are ready.\\n- Confirm whether each task now completes as a valid run without ad hoc task-specific patches.\\n- Record any remaining failures as either harness bugs, task bugs, or infra issues with exact root cause.\\n- If clean, note which tasks should remain in the smoke/verification matrix as permanent regression sentinels.\\n\",\"notes\":\"2026-03-09 OpenHands verification subset launched:\\n- 9 tasks x 2 configs (baseline + MCP) = 18 runs via Daytona\\n- Covers ALL 9 verifier families: checklist, continuous, diff_similarity, f1, ir_checklist, oracle_checks, repo_state_heuristic, semantic_similarity, test_ratio\\n- Also includes sgonly_verifier_wrapper.sh fix for SWE-bench Pro before_repo_set_cmd\\n- Run dir: runs/staging/openhands_sonnet46_20260309_205917\\n- Accounts: 1,2,4,5 (account3 held)\\n- Monitoring for completion and reward extraction\",\"status\":\"in_progress\",\"priority\":1,\"issue_type\":\"task\",\"owner\":\"sjarmak@users.noreply.github.com\",\"created_at\":\"2026-03-09T13:11:58Z\",\"created_by\":\"sjarmak\",\"updated_at\":\"2026-03-09T21:03:17Z\"}"}
0 commit comments