-
Notifications
You must be signed in to change notification settings - Fork 7
Add SDK Guide for Critic Feature (Experimental) #263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
xingyaoww
wants to merge
8
commits into
main
Choose a base branch
from
xw/critic-model
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+183
−0
Draft
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
44db99f
Add SDK guide for Critic feature (experimental)
openhands-agent e190585
Simplify critic documentation: focus on user usage
openhands-agent 97437c7
Add 'What is a Critic?' section explaining use cases
openhands-agent 79110c4
Add reference to SWE-Bench blog post and mention forthcoming technica…
openhands-agent 5372457
Change 'evaluation model' to 'evaluator' in critic description
openhands-agent 9e77c4c
Update critic score visualization example to match actual output
openhands-agent 26a9826
Apply suggestion from @xingyaoww
xingyaoww 450766e
Rename example file from 34_critic_model_example.py to 34_critic_exam…
openhands-agent File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,183 @@ | ||
| --- | ||
| title: Critic (Experimental) | ||
| description: Real-time evaluation of agent actions using an LLM-based critic model. | ||
| --- | ||
|
|
||
| <Warning> | ||
| **This feature is highly experimental** and subject to change. The API, configuration, and behavior may evolve significantly based on feedback and testing. | ||
| </Warning> | ||
|
|
||
| <Note> | ||
| The critic model is hosted by the OpenHands LLM Provider and is currently free to use. This example is available on GitHub: [examples/01_standalone_sdk/34_critic_example.py](https://github.com/OpenHands/software-agent-sdk/blob/main/examples/01_standalone_sdk/34_critic_example.py) | ||
| </Note> | ||
|
|
||
| ## What is a Critic? | ||
|
|
||
| A **critic** is an evaluator that analyzes agent actions and conversation history to predict the quality or success probability of agent decisions. The critic runs alongside the agent and provides: | ||
|
|
||
| - **Quality scores**: Probability scores between 0.0 and 1.0 indicating predicted success | ||
| - **Real-time feedback**: Scores computed during agent execution, not just at completion | ||
|
|
||
| You can use critic scores to build automated workflows, such as triggering the agent to reflect on and fix its previous solution when the critic indicates poor task performance. | ||
|
|
||
| <Note> | ||
| This critic is a more advanced extension of the approach described in our blog post [SOTA on SWE-Bench Verified with Inference-Time Scaling and Critic Model](https://openhands.dev/blog/sota-on-swe-bench-verified-with-inference-time-scaling-and-critic-model). A technical report with detailed evaluation metrics is forthcoming. | ||
| </Note> | ||
|
|
||
| ## Quick Start | ||
|
|
||
| When using the OpenHands LLM Provider (`llm-proxy.*.all-hands.dev`), the critic is **automatically configured** - no additional setup required. | ||
|
|
||
| ```python icon="python" expandable examples/01_standalone_sdk/34_critic_example.py | ||
| """Example demonstrating critic-based evaluation of agent actions. | ||
|
|
||
| This is EXPERIMENTAL. | ||
|
|
||
| This shows how to configure an agent with a critic to evaluate action quality | ||
| in real-time. The critic scores are displayed in the conversation visualizer. | ||
|
|
||
| For All-Hands LLM proxy (llm-proxy.*.all-hands.dev), the critic is auto-configured | ||
| using the same base_url with /vllm suffix and "critic" as the model name. | ||
| """ | ||
|
|
||
| import os | ||
| import re | ||
| import sys | ||
|
|
||
| from openhands.sdk import LLM, Agent, Conversation, Tool | ||
| from openhands.sdk.critic import APIBasedCritic | ||
| from openhands.sdk.critic.base import CriticBase | ||
| from openhands.tools.file_editor import FileEditorTool | ||
| from openhands.tools.task_tracker import TaskTrackerTool | ||
| from openhands.tools.terminal import TerminalTool | ||
|
|
||
|
|
||
| def get_required_env(name: str) -> str: | ||
| value = os.getenv(name) | ||
| if value: | ||
| return value | ||
| sys.exit( | ||
| f"Missing required environment variable: {name}. " | ||
| f"Set {name} before running this example." | ||
| ) | ||
|
|
||
|
|
||
| def get_default_critic(llm: LLM) -> CriticBase | None: | ||
| """Auto-configure critic for All-Hands LLM proxy. | ||
|
|
||
| When the LLM base_url matches `llm-proxy.*.all-hands.dev`, returns an | ||
| APIBasedCritic configured with: | ||
| - server_url: {base_url}/vllm | ||
| - api_key: same as LLM | ||
| - model_name: "critic" | ||
|
|
||
| Returns None if base_url doesn't match or api_key is not set. | ||
| """ | ||
| base_url = llm.base_url | ||
| api_key = llm.api_key | ||
| if base_url is None or api_key is None: | ||
| return None | ||
|
|
||
| # Match: llm-proxy.{env}.all-hands.dev (e.g., staging, prod, eval) | ||
| pattern = r"^https?://llm-proxy\.[^./]+\.all-hands\.dev" | ||
| if not re.match(pattern, base_url): | ||
| return None | ||
|
|
||
| return APIBasedCritic( | ||
| server_url=f"{base_url.rstrip('/')}/vllm", | ||
| api_key=api_key, | ||
| model_name="critic", | ||
| ) | ||
|
|
||
|
|
||
| llm_api_key = get_required_env("LLM_API_KEY") | ||
|
|
||
| llm = LLM( | ||
| model=os.getenv("LLM_MODEL", "anthropic/claude-sonnet-4-5-20250929"), | ||
| api_key=llm_api_key, | ||
| base_url=os.getenv("LLM_BASE_URL", None), | ||
| ) | ||
|
|
||
| # Try auto-configuration for All-Hands proxy, fall back to explicit env vars | ||
| critic = get_default_critic(llm) | ||
| if critic is None: | ||
| critic = APIBasedCritic( | ||
| server_url=get_required_env("CRITIC_SERVER_URL"), | ||
| api_key=get_required_env("CRITIC_API_KEY"), | ||
| model_name=get_required_env("CRITIC_MODEL_NAME"), | ||
| ) | ||
|
|
||
|
|
||
| # Configure agent with critic | ||
| agent = Agent( | ||
| llm=llm, | ||
| tools=[ | ||
| Tool(name=TerminalTool.name), | ||
| Tool(name=FileEditorTool.name), | ||
| Tool(name=TaskTrackerTool.name), | ||
| ], | ||
| # Add critic to evaluate agent actions | ||
| critic=critic, | ||
| ) | ||
|
|
||
| cwd = os.getcwd() | ||
| conversation = Conversation(agent=agent, workspace=cwd) | ||
|
|
||
| conversation.send_message( | ||
| "Create a file called GREETING.txt with a friendly greeting message." | ||
| ) | ||
| conversation.run() | ||
|
|
||
| print("\nAll done! Check the output above for 'Critic Score' in the visualizer.") | ||
| ``` | ||
|
|
||
| ```bash Running the Example | ||
| uv run python examples/01_standalone_sdk/34_critic_example.py | ||
| ``` | ||
|
|
||
| ## Understanding Critic Results | ||
|
|
||
| Critic evaluations produce scores and feedback: | ||
|
|
||
| - **`score`**: Float between 0.0 and 1.0 representing predicted success probability | ||
| - **`message`**: Optional feedback with detailed probabilities | ||
| - **`success`**: Boolean property (True if score >= 0.5) | ||
|
|
||
| Results are automatically displayed in the conversation visualizer: | ||
|
|
||
| ``` | ||
| Critic Score: 0.4481 | predicted user sentiment: neutral 0.81 | ||
| direction_change: 0.61 • vcs_update_requests: 0.31 | ||
| ``` | ||
|
|
||
| ### Accessing Results Programmatically | ||
|
|
||
| ```python | ||
| from openhands.sdk import Event, ActionEvent, MessageEvent | ||
|
|
||
| def callback(event: Event): | ||
| if isinstance(event, (ActionEvent, MessageEvent)): | ||
| if event.critic_result is not None: | ||
| print(f"Critic score: {event.critic_result.score:.3f}") | ||
| print(f"Success: {event.critic_result.success}") | ||
|
|
||
| conversation = Conversation(agent=agent, callbacks=[callback]) | ||
| ``` | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### Critic Evaluations Not Appearing | ||
|
|
||
| - Verify the critic is properly configured and passed to the Agent | ||
| - Ensure you're using the OpenHands LLM Provider (`llm-proxy.*.all-hands.dev`) | ||
|
|
||
| ### API Authentication Errors | ||
|
|
||
| - Verify `LLM_API_KEY` is set correctly | ||
| - Check that the API key has not expired | ||
|
|
||
| ## Next Steps | ||
|
|
||
| - **[Observability](/sdk/guides/observability)** - Monitor and log agent behavior | ||
| - **[Metrics](/sdk/guides/metrics)** - Collect performance metrics | ||
| - **[Stuck Detector](/sdk/guides/agent-stuck-detector)** - Detect unproductive agent patterns |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.