Skip to content

Commit a8131df

Browse files
sjarmakclaude
andcommitted
fix: kill OpenHands background daemons after exit to prevent Daytona session hang
OpenHands spawns persistent background processes (tmux, jupyter kernel gateway, ipykernel, action execution server) that outlive the main process. These orphans prevent Daytona's session-command from reporting an exit code, causing Harbor's _poll_response poll loop to hang indefinitely — leaving the sandbox orphaned in `started` state with no result collection, verifier run, or teardown. Claude Code runs are unaffected because they don't spawn persistent background services. The fix wraps the upstream OpenHands command in a group and appends pkill cleanup of known daemon processes, preserving the original exit code so Harbor proceeds normally through verification and finalization. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 6201cce commit a8131df

File tree

2 files changed

+70
-5
lines changed

2 files changed

+70
-5
lines changed

agents/harnesses/openhands/agent.py

Lines changed: 59 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -55,13 +55,66 @@ def _litellm_codex_model(model_name: str) -> str:
5555
class OpenHandsHarnessAgent(BaselineHarnessMixin, OpenHands):
5656
"""OpenHands CLI agent extended with evaluation context and MCP wiring."""
5757

58+
OPENHANDS_WORKSPACE_PREAMBLE = """## OpenHands Workspace Rules
59+
60+
- The provided working directory is the only submission surface for this evaluation.
61+
- Do not `git clone`, `git init`, or fetch the target repository into the working directory or any of its subdirectories.
62+
- Use MCP to inspect remote code, then create or edit only the files you need directly in the working directory using repository-relative paths.
63+
- The verifier only reads changes from the provided working directory, not from any self-cloned checkout.
64+
"""
65+
5866
def __init__(self, *args, **kwargs):
5967
super().__init__(*args, **kwargs)
6068
# LiteLLM inside the container needs a provider prefix (e.g. openai/gpt-5.3-codex)
6169
# or it raises "LLM Provider NOT provided". Normalize Codex models so env LLM_MODEL works.
6270
if self.model_name:
6371
self.model_name = _litellm_codex_model(self.model_name)
6472

73+
# OpenHands spawns persistent background daemons (tmux, jupyter kernel
74+
# gateway, action execution server) that outlive the main process. These
75+
# orphans prevent Daytona's session-command from ever reporting an exit
76+
# code, which causes Harbor's _poll_response loop to hang indefinitely.
77+
# We wrap the upstream command so that once the main pipeline exits we
78+
# kill the leftovers, letting the session terminate cleanly.
79+
_CLEANUP_SUFFIX = (
80+
"; _oh_rc=$?; "
81+
# Kill known OpenHands daemons that outlive the main process
82+
"pkill -f 'jupyter-kernelgateway' 2>/dev/null; "
83+
"pkill -f 'ipykernel_launcher' 2>/dev/null; "
84+
"pkill -f 'openhands.runtime.action_execution_server' 2>/dev/null; "
85+
"pkill -f 'tmux' 2>/dev/null; "
86+
"exit $_oh_rc"
87+
)
88+
89+
def create_run_agent_commands(self, instruction: str):
90+
instruction = self._resolve_instruction_text(instruction)
91+
instruction = self._prepare_instruction(instruction)
92+
mcp_type = os.environ.get("BASELINE_MCP_TYPE", "none").lower()
93+
if (
94+
mcp_type in ("sourcegraph_full", "sourcegraph_base", "sourcegraph_isolated")
95+
and "## OpenHands Workspace Rules" not in instruction
96+
):
97+
instruction = f"{self.OPENHANDS_WORKSPACE_PREAMBLE}\n\n{instruction}"
98+
self._save_instruction_artifact(instruction)
99+
exec_inputs = OpenHands.create_run_agent_commands(self, instruction)
100+
# Append daemon cleanup so Daytona session exits cleanly
101+
for ei in exec_inputs:
102+
ei.command = f"{{ {ei.command} }}{self._CLEANUP_SUFFIX}"
103+
return exec_inputs
104+
105+
def _build_workspace_guidance(self, workdir: str) -> str:
106+
repo_list = self._get_repo_list()
107+
if not repo_list:
108+
repo_list = [self._get_repo_display()]
109+
repo_lines = "\n".join(f"- `github.com/{repo}`" for repo in repo_list)
110+
return (
111+
f"{self.OPENHANDS_WORKSPACE_PREAMBLE}\n\n"
112+
"## Sourcegraph MCP\n\n"
113+
"- Use the provided MCP tools before local edits.\n"
114+
f"- Scope remote discovery to:\n{repo_lines}\n"
115+
f"- Write your final file edits directly under `{workdir}` using repository-relative paths.\n"
116+
)
117+
65118
# Port for the in-container auth proxy (Sourcegraph needs "token" auth,
66119
# but OpenHands hardcodes "Bearer").
67120
_SG_PROXY_PORT = 18973
@@ -157,12 +210,13 @@ async def _configure_mcp(self, environment: BaseEnvironment) -> None:
157210
os.environ["OPENHANDS_MCP_SHTTP_SERVERS"] = repr(servers)
158211
os.environ["OPENHANDS_AGENT_ENABLE_MCP"] = "true"
159212

160-
# Upload CLAUDE.md for instruction context
161-
claude_md = "## Sourcegraph MCP\nUse the provided MCP tools before local edits."
162-
claude_md_path = self.logs_dir / "CLAUDE.md"
163-
claude_md_path.write_text(claude_md)
213+
# Upload guidance through OpenHands' native workspace instruction file.
214+
guidance = self._build_workspace_guidance(workdir)
215+
guidance_path = self.logs_dir / ".openhands_instructions"
216+
guidance_path.write_text(guidance)
164217
await environment.upload_file(
165-
source_path=str(claude_md_path), target_path=f"{workdir}/CLAUDE.md"
218+
source_path=str(guidance_path),
219+
target_path=f"{workdir}/.openhands_instructions",
166220
)
167221

168222
# Save debug artifacts

docs/ops/TROUBLESHOOTING.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,3 +24,14 @@
2424
3. Check whether failure matches known fingerprint.
2525
4. Classify as infra / verifier / task / agent behavior.
2626
5. Choose isolated rerun or fix path.
27+
28+
## Daytona / OpenHands Notes
29+
30+
- Do not classify a trial as a Daytona image-build stall from `trial.log` alone. Some orphaned or crashed trials leave `trial.log` at `Building environment from ...` even after agent setup succeeded.
31+
- Before calling it a remote build issue, check for:
32+
- `agent/setup/return-code.txt`
33+
- `agent/instruction.txt`
34+
- `agent/command-0/command.txt`
35+
- If those files exist, the environment build already progressed past Docker build and the failure is later in launcher orchestration or agent startup handoff.
36+
- For MCP harness triage, inspect `agent/instruction.txt` first and confirm it names the expected `github.com/sg-evals/...` mirror. A generic repo target such as `github.com/the codebase` indicates prompt wiring drift, not task difficulty.
37+
- **OpenHands orphaned sandbox / hung harness**: OpenHands spawns persistent background daemons (tmux, jupyter kernel gateway, ipykernel, action execution server) that outlive the main process. These orphans prevent Daytona's session-command from reporting an exit code, causing Harbor's `_poll_response` loop to hang indefinitely. The `OpenHandsHarnessAgent` in `agents/harnesses/openhands/agent.py` includes a `_CLEANUP_SUFFIX` that kills these daemons after the main pipeline exits. If you see a sandbox stuck in `started` state with no harness process running, this is the likely cause. Claude Code runs are unaffected because they don't spawn persistent background services.

0 commit comments

Comments
 (0)