feat: add execute_tool API endpoint for direct tool execution on conversations#2356
feat: add execute_tool API endpoint for direct tool execution on conversations#2356
Conversation
…ersations
Add the ability to execute tools (like the terminal) directly on a
conversation without going through the agent loop. This enables pre-run
setup operations like running .openhands/setup.sh through the agent's
persistent terminal session so environment changes persist.
Changes:
- Add ExecuteToolRequest/Response models
- Add EventService.execute_tool() method
- Add ConversationService.execute_tool() method
- Add POST /conversations/{id}/execute_tool endpoint
- Implement RemoteConversation.execute_tool() (was NotImplementedError)
- Add AsyncRemoteWorkspace.execute_tool() convenience method
Co-authored-by: openhands <openhands@all-hands.dev>
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
|
@OpenHands please fix https://github.com/OpenHands/software-agent-sdk/actions/runs/22794005415/job/66125654112?pr=2356 and https://github.com/OpenHands/software-agent-sdk/actions/runs/22794005420/job/66125654134?pr=2356 Then help me review the PR and test if this PR is working - When you are done, upload artifacts to |
|
I'm on it! xingyaoww can track my progress at all-hands.dev |
- Fix pyright errors in event_service.py by capturing _conversation in a local variable before the closure (type narrowing) - Fix Observation ABC instantiation in remote_conversation.py by using a concrete _RemoteObservation subclass - Update test to verify new execute_tool implementation behavior - Fix line-too-long issues in docstrings and field descriptions Co-authored-by: openhands <openhands@all-hands.dev>
|
📁 PR Artifacts Notice This PR contains a
|
Summary of Work DoneI fixed both failing CI checks on PR #2356, reviewed the PR, functionally tested the new endpoint, and pushed everything (including CI Fixes (commit
|
Coverage Report •
|
||||||||||||||||||||||||||||||||||||||||||||||||||
|
@OpenHands please review this PR, can you test whether this works in the https://github.com/OpenHands/software-agent-sdk/blob/main/examples/02_remote_agent_server/01_convo_with_local_agent_server.py? We should able to programmatically access it (eg conversation.execute_tool) and make sure it works. When it actually works, please modify the appropriate docs in OpenHands/docs repo and send a PR |
|
I'm on it! xingyaoww can track my progress at all-hands.dev |
|
[Automatic Post]: It has been a while since there was any activity on this PR. @xingyaoww, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up. |
1 similar comment
|
[Automatic Post]: It has been a while since there was any activity on this PR. @xingyaoww, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up. |
|
@OpenHands Merge from main, resolve the conflict, and then /iterate until this PR is ready. You should keep iterating until the PRReviewBot actually approves this PR or provides no additional comments. |
|
I'm on it! xingyaoww can track my progress at all-hands.dev |
Resolve conflicts in conversation_router.py and models.py: - Keep both execute_tool and fork endpoints in conversation_router.py - Keep both ExecuteToolRequest/Response and AgentResponseResult in models.py Co-authored-by: openhands <openhands@all-hands.dev>
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable - Solid design but needs refinement on error handling and eval risk consideration.
This PR solves a real problem (environment persistence for setup.sh) with a straightforward architectural approach. However, it introduces a new tool execution path that bypasses the agent loop, which has implications for benchmark behavior and requires human review before merging.
Detailed ReviewArchitecture ✅The layered approach is clean and follows existing patterns:
Testing Observations 🟢
Key Design Insight 💡The core insight here is solid: executing through the agent's persistent terminal session (via the tool executor) rather than ephemeral subprocesses correctly solves the setup.sh environment persistence problem. This is the right architectural choice. Minor ObservationIn [RISK ASSESSMENT]Overall PR: Rationale:
Recommendation: Per the repository's eval risk policy, PRs that change agent behavior or could affect benchmark performance should not be auto-approved. This PR falls squarely into that category. A human maintainer should:
Verdict: ❌ Needs human review before merging (not due to code quality, but due to eval risk policy) |
- Separate error handling: KeyError/ValueError → 400, other exceptions → 500 - Add logger for unexpected server errors - Document expected observation response format in remote_conversation.py - Simplify observation text extraction to use only the content array format Co-authored-by: openhands <openhands@all-hands.dev>
|
Addressed review feedback in d6c5a65. Ready for another look. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟢 Good taste - Clean implementation following existing patterns
Summary
The code quality is excellent. The layered architecture is clean, error handling properly separates client/server errors, and the implementation follows repository conventions. All previously flagged issues (observation extraction, error handling) have been resolved.
Evidence: The .pr/review.md document shows thorough manual testing including the critical environment persistence use case.
Eval Risk Policy Note
Per repository guidelines, this PR changes tool execution behavior (adds direct execution path bypassing agent loop) and requires human review + lightweight evals before merge. While the code is solid, the behavioral change could affect benchmark results.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟡 MEDIUM
This PR introduces a new tool execution path that bypasses the standard agent loop. While architecturally sound and solving a real problem (setup.sh environment persistence), it represents a behavioral change that could affect evaluation/benchmark results. The implementation itself is low-risk (good error handling, follows patterns, proper testing), but the feature's impact on agent behavior places it in medium-risk territory.
Recommendation: Human maintainer should verify via lightweight evals that this doesn't affect benchmark performance before merging. The code itself is merge-ready from a quality perspective.
VERDICT:
✅ Code quality approved, awaiting eval verification: Implementation is solid and well-tested. Needs human review for benchmark impact assessment per repo policy.
KEY INSIGHT:
Solves the real environment persistence problem with a straightforward architectural approach. The key insight—executing through the agent's persistent terminal session rather than ephemeral subprocesses—correctly addresses the core issue.
all-hands-bot
left a comment
There was a problem hiding this comment.
✅ QA Report: PASS
Verified: The PR successfully delivers its stated goal — executing tools through the agent's persistent terminal session so environment changes persist.
Does this PR achieve its stated goal?
Yes. The PR set out to fix a critical problem: in V1, setup.sh was executed via AsyncRemoteWorkspace.execute_command() which runs commands in ephemeral subprocesses, so environment variable exports, sourced files, and PATH modifications don't persist to the agent's terminal session. This PR adds a POST /api/conversations/{id}/execute_tool endpoint that executes tools directly through the conversation's tool executor (the persistent terminal session), solving the persistence problem.
Evidence: I created a conversation with the terminal tool, then executed export MY_SETUP_VAR=setup_complete via the new endpoint. In a subsequent call, echo $MY_SETUP_VAR correctly returned setup_complete, proving the variable persisted across calls. This is exactly what setup.sh needs — environment changes that survive beyond a single command execution.
| Phase | Result |
|---|---|
| Environment Setup | ✅ Server running (localhost:8000), dependencies installed |
| CI & Tests | ✅ 22/26 pass (agent-server-tests, sdk-tests, pre-commit, check, REST API all pass), 4 pending build jobs |
| Functional Verification | ✅ All core behaviors verified |
Functional Verification
Test 1: Basic tool execution
Step 1 — Create a conversation with terminal tool:
curl -X POST http://localhost:8000/api/conversations \
-H "Content-Type: application/json" \
-d '{"agent": {"llm": {"model": "gpt-4o-mini"}, "tools": [{"name": "terminal"}]}, "workspace": {"type": "local", "working_dir": "/tmp/test_workspace"}}'Got conversation ID: bae8b192-0e88-40f8-ae6d-00812b6f1899
Step 2 — Execute a simple command via new endpoint:
curl -X POST "http://localhost:8000/api/conversations/bae8b192-0e88-40f8-ae6d-00812b6f1899/execute_tool" \
-H "Content-Type: application/json" \
-d '{"tool_name": "terminal", "action": {"command": "echo hello world && pwd"}}'Response:
{
"observation": {
"content": [{"text": "hello world\n/tmp/test_workspace"}],
"exit_code": 0,
"is_error": false
},
"is_error": false
}✅ Command executed successfully through the persistent terminal session.
Test 2: Environment variable persistence (CORE USE CASE)
Step 1 — Set an environment variable:
curl -X POST "http://localhost:8000/api/conversations/$CONV_ID/execute_tool" \
-H "Content-Type: application/json" \
-d '{"tool_name": "terminal", "action": {"command": "export MY_SETUP_VAR=setup_complete"}}'Exit code: 0 ✅
Step 2 — Read the variable in a subsequent call:
curl -X POST "http://localhost:8000/api/conversations/$CONV_ID/execute_tool" \
-H "Content-Type: application/json" \
-d '{"tool_name": "terminal", "action": {"command": "echo $MY_SETUP_VAR"}}'Output: setup_complete
✅ CRITICAL VERIFICATION PASSED: The environment variable persisted across calls! This proves the endpoint executes through the agent's persistent terminal session, not an ephemeral subprocess. This is exactly what the PR set out to achieve — enabling setup.sh to export variables that remain available in the agent's session.
Test 3: Error handling — nonexistent tool
curl -X POST "http://localhost:8000/api/conversations/$CONV_ID/execute_tool" \
-d '{"tool_name": "nonexistent_tool", "action": {"command": "echo test"}}'Response:
{"detail":"\"Tool 'nonexistent_tool' not found. Available tools: ['terminal', 'finish', 'think']\""}✅ Helpful error message listing available tools.
Test 4: Error handling — nonexistent conversation
curl -X POST "http://localhost:8000/api/conversations/00000000-0000-0000-0000-000000000000/execute_tool" \
-d '{"tool_name": "terminal", "action": {"command": "echo test"}}'Response:
{"detail":"Not Found"}✅ Returns 404 as expected.
Test 5: Error handling — invalid action parameters
curl -X POST "http://localhost:8000/api/conversations/$CONV_ID/execute_tool" \
-d '{"tool_name": "terminal", "action": {"invalid_field": "value"}}'Response:
{
"detail": "2 validation errors for TerminalAction\ncommand\n Field required...\ninvalid_field\n Extra inputs are not permitted..."
}✅ Pydantic validation catches schema mismatches with clear error messages.
Issues Found
None. The implementation is solid and achieves its stated goal.
Architecture Assessment
The layered approach is clean and consistent with existing patterns:
- Router (
conversation_router.py) → ConversationService → EventService → LocalConversation.execute_tool() _ensure_agent_ready()handles lazy initialization of tools- Error handling covers tool-not-found, conversation-not-found, and validation errors
- The SDK's
RemoteConversation.execute_tool()(previously NotImplementedError) now calls the server endpoint AsyncRemoteWorkspace.execute_tool()provides a convenience method for workspace-level usage
The key insight — executing through the agent's persistent terminal session rather than ephemeral subprocesses — correctly addresses the setup.sh environment persistence issue that the PR description identifies.
Summary of Work DoneRequest Checklist
Conciseness CheckAll changes are directly related to the request:
No extraneous changes were introduced. The only files modified beyond the merge were |
Demonstrates direct tool execution on conversations via both the SDK (RemoteConversation.execute_tool) and raw HTTP (POST endpoint). Shows that environment changes persist across calls and into the agent session. Co-authored-by: openhands <openhands@all-hands.dev>
|
@OpenHands /codereview /github-pr-review 🤖 This comment was automatically posted by the pr-triage skill (OpenHands) on behalf of a maintainer. See PR #2906 for skill details. |
|
I'm on it! xingyaoww can track my progress at all-hands.dev |
xingyaoww
left a comment
There was a problem hiding this comment.
🟡 Acceptable — Solid feature solving a real problem (setup.sh env persistence). Architecture follows existing layered patterns. A few things worth addressing before merge.
Note: This touches tool execution code paths, so flagging for human maintainer review before merge (per eval-risk policy).
This review was generated by an AI agent (OpenHands).
| observation_data = result.get("observation", {}) | ||
| is_error = result.get("is_error", False) | ||
|
|
||
| # Server returns observation.model_dump(mode="json") which has: | ||
| # {"content": [{"text": "...", "kind": "text"}, ...], "is_error": bool} | ||
| # Extract text from the content array. | ||
| text = "" | ||
| content = observation_data.get("content", []) | ||
| if content and isinstance(content, list) and len(content) > 0: | ||
| text = content[0].get("text", "") | ||
|
|
||
| # Use a concrete subclass since Observation is abstract | ||
| class _RemoteObservation(Observation): | ||
| pass | ||
|
|
||
| return _RemoteObservation.from_text(text=text, is_error=is_error) |
There was a problem hiding this comment.
🟠 Important: Data loss on deserialization — The server serializes the full typed observation (e.g., TerminalObservation with command, exit_code, timeout, metadata), but the client only extracts the first text content item and wraps it in a bare _RemoteObservation. All tool-specific fields are silently dropped.
For the setup.sh use case, this means callers can't check exit_code to know if the command succeeded or timeout to detect hung scripts. Only is_error (a coarser signal) survives.
Consider either:
- Returning the raw observation dict alongside the
Observationobject so callers can access extra fields - Attempting to deserialize into the correct concrete type (e.g., via a discriminated union or type registry)
- At minimum, concatenating all text content items instead of just
content[0]
Also: the content and isinstance(content, list) and len(content) > 0 guard is redundant — if content: handles empty/None. And defining _RemoteObservation inside the method body creates a new class on every call — consider a module-level class _RemoteObservation(Observation): pass.
| def _execute(): | ||
| # Get the tool to resolve the action type | ||
| conversation._ensure_agent_ready() | ||
| tool = conversation.agent.tools_map.get(tool_name) | ||
| if tool is None: | ||
| available_tools = list(conversation.agent.tools_map.keys()) | ||
| raise KeyError( | ||
| f"Tool '{tool_name}' not found. Available tools: {available_tools}" | ||
| ) | ||
| # Validate the action dict against the tool's action type | ||
| action = tool.action_type.model_validate(action_dict) | ||
| observation = conversation.execute_tool(tool_name, action) | ||
| return observation.model_dump(mode="json") |
There was a problem hiding this comment.
🟡 Suggestion: The _execute() closure calls conversation._ensure_agent_ready() directly (line 383), then calls conversation.execute_tool() which calls _ensure_agent_ready() again internally (see local_conversation.py:1238). The explicit call here is redundant — execute_tool already handles lazy init.
Also worth noting: _ensure_agent_ready() is a private method on LocalConversation. Calling it from EventService couples this layer to the implementation detail. Consider just relying on execute_tool() to handle initialization, or extracting the tool lookup + action validation into a method on LocalConversation itself.
| def _execute(): | |
| # Get the tool to resolve the action type | |
| conversation._ensure_agent_ready() | |
| tool = conversation.agent.tools_map.get(tool_name) | |
| if tool is None: | |
| available_tools = list(conversation.agent.tools_map.keys()) | |
| raise KeyError( | |
| f"Tool '{tool_name}' not found. Available tools: {available_tools}" | |
| ) | |
| # Validate the action dict against the tool's action type | |
| action = tool.action_type.model_validate(action_dict) | |
| observation = conversation.execute_tool(tool_name, action) | |
| return observation.model_dump(mode="json") | |
| def _execute(): | |
| # Validate tool exists and action matches its schema | |
| tool = conversation.agent.tools_map.get(tool_name) | |
| if tool is None: | |
| # ensure_agent_ready is called by execute_tool, but we need | |
| # the tool reference here for action validation first | |
| conversation._ensure_agent_ready() | |
| tool = conversation.agent.tools_map.get(tool_name) | |
| if tool is None: | |
| available_tools = list(conversation.agent.tools_map.keys()) | |
| raise KeyError( | |
| f"Tool '{tool_name}' not found. Available tools: {available_tools}" | |
| ) | |
| # Validate the action dict against the tool's action type | |
| action = tool.action_type.model_validate(action_dict) | |
| observation = conversation.execute_tool(tool_name, action) | |
| return observation.model_dump(mode="json") |
Actually, the simplest fix: just drop the explicit _ensure_agent_ready() call and rely on execute_tool() doing it. But then you need to move the tool lookup after execute_tool (or restructure). WDYT about having LocalConversation.execute_tool accept a dict and do the validation internally?
| conversation_id, request.tool_name, request.action | ||
| ) | ||
| except KeyError as e: | ||
| raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail=str(e)) |
There was a problem hiding this comment.
🟢 Acceptable: The except KeyError before except Exception isn't redundant — it correctly maps known errors to 400 vs generic errors to 500. The separation serves as documentation of expected error types. 👍
| async def execute_tool( | ||
| self, | ||
| conversation_id: str, | ||
| tool_name: str, | ||
| action: dict[str, Any], | ||
| timeout: float = 600.0, | ||
| ) -> dict[str, Any]: | ||
| """Execute a tool on a conversation through the agent server. | ||
|
|
||
| This calls the agent server's execute_tool endpoint, which runs the tool | ||
| through the conversation's tool executor (e.g., the persistent terminal | ||
| session). This is useful for pre-run setup operations like running | ||
| .openhands/setup.sh where environment changes need to persist. | ||
|
|
||
| Args: | ||
| conversation_id: The conversation ID (hex string) | ||
| tool_name: The name of the tool to execute (e.g., 'terminal') | ||
| action: The action parameters as a dictionary | ||
| timeout: Timeout in seconds | ||
|
|
||
| Returns: | ||
| The response dict with 'observation' and 'is_error' keys | ||
| """ | ||
| payload = { | ||
| "tool_name": tool_name, | ||
| "action": action, | ||
| } | ||
| response = await self.client.post( | ||
| f"/api/conversations/{conversation_id}/execute_tool", | ||
| json=payload, | ||
| timeout=timeout, | ||
| ) | ||
| response.raise_for_status() | ||
| return response.json() |
There was a problem hiding this comment.
🟡 Suggestion: Is this method used by anything in this PR or the companion PR? RemoteConversation.execute_tool() goes through _send_request (the sync httpx client), not through this async workspace method. If this is for the companion PR's AsyncRemoteWorkspace usage, it would be cleaner to add it in that PR. Otherwise it's dead code that might drift out of sync with the actual endpoint contract.
| async def execute_tool(self, tool_name: str, action_dict: dict) -> dict: | ||
| """Execute a tool directly on the conversation. | ||
|
|
||
| Bypasses the agent loop. Useful for pre-run setup operations | ||
| like running .openhands/setup.sh through the agent's terminal | ||
| tool so environment changes persist in the agent's session. | ||
|
|
||
| Args: | ||
| tool_name: The name of the tool to execute (e.g., 'terminal') | ||
| action_dict: The action parameters as a dictionary | ||
|
|
||
| Returns: | ||
| The observation as a dictionary | ||
|
|
||
| Raises: | ||
| ValueError: If the service is inactive | ||
| KeyError: If the tool is not found | ||
| """ | ||
| if not self._conversation: | ||
| raise ValueError("inactive_service") | ||
|
|
||
| conversation = self._conversation | ||
| loop = asyncio.get_running_loop() | ||
|
|
||
| def _execute(): | ||
| # Get the tool to resolve the action type | ||
| conversation._ensure_agent_ready() | ||
| tool = conversation.agent.tools_map.get(tool_name) | ||
| if tool is None: | ||
| available_tools = list(conversation.agent.tools_map.keys()) | ||
| raise KeyError( | ||
| f"Tool '{tool_name}' not found. Available tools: {available_tools}" | ||
| ) | ||
| # Validate the action dict against the tool's action type | ||
| action = tool.action_type.model_validate(action_dict) | ||
| observation = conversation.execute_tool(tool_name, action) | ||
| return observation.model_dump(mode="json") | ||
|
|
||
| return await loop.run_in_executor(None, _execute) |
There was a problem hiding this comment.
🟡 Suggestion: Concurrency with run() — If a caller hits execute_tool while run() is in progress (the agent is actively using the terminal), both will attempt to execute tool commands concurrently on the same conversation. _ensure_agent_ready is thread-safe, but the tool executor itself (e.g., terminal session) may not handle concurrent commands gracefully.
Consider documenting that execute_tool is intended for pre-run setup only and may produce undefined behavior if called concurrently with run(). Or acquire the state lock to serialize access.
SummaryI completed the code review for PR #2356 ("feat: add execute_tool API endpoint for direct tool execution on conversations") as requested by the What was doneI posted a single structured GitHub PR review with 5 inline comments at: #2356 (review) Review findings
Per the repo's custom codereview guide (eval-risk policy), I submitted a COMMENT review (not APPROVE) since this PR touches tool execution code paths that could affect benchmark performance. Checklist
No code changes were madeThis was a review-only task — no commits, no code modifications. Only a GitHub review was posted. |
Description
Add the ability to execute tools (like the terminal) directly on a conversation without going through the agent loop. This enables pre-run setup operations like running
.openhands/setup.shthrough the agent's persistent terminal session so environment changes persist.Problem
In V1,
setup.shwas executed viaAsyncRemoteWorkspace.execute_command()which runs commands in an ephemeral subprocess — environment changes (exported variables, sourced files, PATH modifications, etc.) don't persist to the agent's terminal session. This madesetup.sheffectively broken in V1 compared to V0.Solution
Add a new
POST /api/conversations/{id}/execute_toolendpoint that allows executing a tool directly on a conversation. The endpoint:_ensure_agent_ready())LocalConversation.execute_tool()Changes
Agent Server:
models.py: AddedExecuteToolRequest/ExecuteToolResponsemodelsevent_service.py: AddedEventService.execute_tool()— runs a tool throughLocalConversation.execute_tool()in a thread executorconversation_service.py: AddedConversationService.execute_tool()— delegates toEventServiceconversation_router.py: AddedPOST /conversations/{id}/execute_toolendpoint with 404/400/500 error handling and loggingSDK:
remote_conversation.py: ImplementedRemoteConversation.execute_tool()(wasNotImplementedError) — calls the new server endpoint with proper observation deserializationasync_remote_workspace.py: AddedAsyncRemoteWorkspace.execute_tool()convenience methodCompanion PR
This PR is used by the OpenHands companion PR: OpenHands/OpenHands (feat/setup-sh-via-terminal-tool) which restructures the V1 startup flow to run
setup.shthrough the terminal tool.Testing
RemoteConversation.execute_tool()covering success, error, and deserialization pathsAgent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:1df7714-pythonRun
All tags pushed for this build
About Multi-Architecture Support
1df7714-python) is a multi-arch manifest supporting both amd64 and arm641df7714-python-amd64) are also available if needed