Python: Add CuaAgentMiddleware for Computer-Use tool#1338
Python: Add CuaAgentMiddleware for Computer-Use tool#1338f-trycua wants to merge 16 commits intomicrosoft:mainfrom
Conversation
|
@microsoft-github-policy-service agree company="Cua AI, Inc." |
|
I've also been thinking about how to also support .NET with this integration. Since Agent Framework already has built-in MCP support (see samples), we could create a Python MCP server that wraps Cua's The flow would be: Usage from C#: // Connect to Cua MCP server
await using var mcpClient = await McpClient.CreateAsync(new StdioClientTransport(new()
{
Command = "python",
Arguments = ["-m", "cua.mcp.server"],
}));
var agent = chatClient.CreateAIAgent(
instructions: "You are a desktop automation assistant.",
tools: [.. (await mcpClient.ListToolsAsync()).Cast<AITool>()]
);
await agent.RunAsync("Open Firefox and search for 'Python tutorials'");This approach would:
We have a pending PR for MCP server support on the Cua side (trycua/cua#427). Once that's merged, I can add C# samples and documentation in a follow-up PR or update this one. Thoughts? |
|
Hey @ekzhu - I've addressed your feedback:
Happy to chat this week if helpful! |
|
Hey @ekzhu - I've made some improvements to the API design: Eliminated Dummy Variables - Created Before: # Had to use dummy client
dummy_client = OpenAIChatClient(model_id="gpt-4o-mini", api_key="dummy-not-used")
middleware = CuaAgentMiddleware(
computer=computer,
model="anthropic/claude-sonnet-4-5-20250929",
instructions="You are an assistant.",
)
agent = ChatAgent(chat_client=dummy_client, middleware=[middleware])After: # Clean API with CuaChatClient
chat_client = CuaChatClient(
model="anthropic/claude-sonnet-4-5-20250929",
instructions="You are an assistant.",
)
middleware = CuaAgentMiddleware(computer=computer)
agent = ChatAgent(chat_client=chat_client, middleware=[middleware])Standardized Examples - All samples now default to Linux on Docker (cross-platform), with macOS and Windows options shown as alternatives in comments. Let me know if there's anything else you'd like me to address! |
ekzhu
left a comment
There was a problem hiding this comment.
I really like the new interface! It looks very good and polished. I had some issue running it locally with Docker though -- see my comments.
I am a bit concerned about the package name "cua" being overly broad here. I think it may make sense to rename it to "trycua" or something more specific. Right now, it feels like this is the official computer-use feature of the framework.
Another alternative is to move this package to a module inside agent-framework-lab, we can use the extra cua there.
| # Create Cua chat client with model and instructions | ||
| chat_client = CuaChatClient( | ||
| model="anthropic/claude-sonnet-4-5-20250929", | ||
| instructions="You are a desktop automation assistant. Be precise and careful.", |
There was a problem hiding this comment.
For the examples, let's set the instructions through ChatAgent instead. This is to keep it consistent with the rest of the samples in the repo.
There was a problem hiding this comment.
For the examples, let's set the instructions through
ChatAgentinstead. This is to keep it consistent with the rest of the samples in the repo.
Thanks for flagging this, @ekzhu! CuaAgentMiddleware intercepts the call and drives the run loop, so the chat client never gets a chance to apply its own system message—anything we put there gets ignored. If you need custom guidance, the simplest path is to include it in the prompt you send to agent.run(...); that text is preserved and reaches CUA exactly as written.
| async def main(): | ||
| """Run a basic computer use example with Claude.""" | ||
| # Initialize Cua computer (Linux Docker container) | ||
| async with Computer(os_type="linux", provider_type="docker") as computer: |
There was a problem hiding this comment.
I had trouble running this example and it failed right here. I am on WSL Ubuntu and running Docker Desktop.
Traceback (most recent call last):
File "***/agent-framework/python/.venv/lib/python3.13/site-packages/computer/computer.py", line 493, in run
await self._interface.wait_for_ready(timeout=30)
File "***/agent-framework/python/.venv/lib/python3.13/site-packages/computer/interface/generic.py", line 817, in wait_for_ready
raise e
File "***/agent-framework/python/.venv/lib/python3.13/site-packages/computer/interface/generic.py", line 813, in wait_for_ready
await self._wait_for_ready_ws(timeout, interval)
File "***/agent-framework/python/.venv/lib/python3.13/site-packages/computer/interface/generic.py", line 938, in _wait_for_ready_ws
raise TimeoutError(error_msg)
TimeoutError: Could not connect to localhost after 30 seconds
...
TimeoutError: Could not connect to WebSocket interface at localhost:8000/ws: Could not connect to localhost after 30 seconds
I have already pulled the image, and I tried this even after I manually started the container from Docker Desktop.
There was a problem hiding this comment.
Hi @ekzhu, I'm Adam from the Cua team. I reproduced the sample failure. The culprit is the Docker image name: the code uses trycua/cua-ubuntu:latest, but there’s no Linux/AMD64 manifest for that tag, so the container never starts and the WebSocket wait times out. Pulling trycua/cua-xfce:latest (which is published for AMD64) and tagging it locally as trycua/cua-ubuntu:latest fixes the run.
- Pull the AMD64 image Cua documents for Docker:
docker pull --platform=linux/amd64 trycua/cua-xfce:latest - Create a local tag so the provider can find it:
docker tag trycua/cua-xfce:latest trycua/cua-ubuntu:latest
We’ll update the sample to point at the XFCE image so others don’t hit this.
On the Agent Framework side we’ll also land a tiny fix so CuaChatClient imports and applies @use_chat_middleware; that keeps the middleware hook active even when Cua handles the run loop.
|
Also there is some merge conflict. Looks like uv.lock needs to be regenerated, and the pyproject.toml file needs to be updated -- just accept both changes. |
Thanks so much for the feedback and the notes @ekzhu - super helpful. On the naming:
That said, we definitely don’t want it to appear like an “official” Microsoft Agent SDK package. If avoiding confusion is the main concern, a clean alternative for us could be cua-ai (or cua_ai), which still preserves the project identity while making the separation explicit. Happy to make that change if it aligns better with the project’s conventions. Let me know which direction you’d prefer - we’re flexible as long as the identity remains clear. |
9c14b49 to
80bd9cd
Compare
80bd9cd to
7c2bcee
Compare
|
Hi @markwallace-microsoft, those .NET, workflows, and lab labels were added while I briefly pulled in the wrong files. The PR is back to Python-only now, so could you remove those tags when you get a chance? Thanks! |
| # Create middleware | ||
| middleware = CuaAgentMiddleware(computer=computer) | ||
|
|
||
| # Create agent - no dummy variables needed! |
There was a problem hiding this comment.
Question: what does it mean by "no dummy variables needed"?
There was a problem hiding this comment.
Hi @TaoChenOSU, thanks for spotting that. The “no dummy variables needed” comment is leftover from an earlier draft and will be removed.
Pin all CUA Docker samples to trycua/cua-xfce:latest for Windows/x64 support Drop Anthropic instructions field so Claude requests keep working
add Windows, macOS, and Linux quickstarts under samples/getting_started/cua/setup/ refresh the CUA README to link to the new guides and modernize prerequisites
2c5566a to
5c43237
Compare
|
Hi @TaoChenOSU @ekzhu - what're the pending items left on this PR? |
eavanvalkenburg
left a comment
There was a problem hiding this comment.
Sorry it took a while to have another good look at this, but I have some structural issues with it. The most important is that this setup with the specific CuaChatClient doesn't work for me. The problem I have with it, is that it breaks the expectation that we want users to be able to interchange chat clients without effort. And that includes features like other middlewares on chat, local models, etc. All of that is not possible now, with this. So I think what we should do is: 1) design a computer use content type (that can be used by all major computer use capable api's like Anthropic, OpenAI and google) and 2) use that with the CuaMiddleware here, where the middleware does nothing more then look at the response, if it is computer use content, execute that and ask for another completion, if it isn't then pass the response back up the stack, and be done. That way the chat clients in AF will still be used, and we can inject the CUA Middleware without losing other functionality. let me know if that makes sense @f-trycua
|
Hi @eavanvalkenburg, Thank you for the detailed feedback! I'd like to propose a refactor that addresses all your concerns. Here's how the new design would address each point: Proposed Solution:
Proposed Change: Current: Middleware never calls Proposed: Middleware calls The middleware would follow the standard pattern you described: look at response → execute if computer use content → ask for another completion → pass through if not. I've created a design document with detailed before/after comparison and implementation proposal. Does this approach address your concerns? Happy to discuss any aspects before implementing! |
|
@YeIIcw that is what I envisioned initially as well, so good to see that! I am working on those content types already, as we want to ensure parity with the dotnet version and be provider agnostic, so I'll let you know when we have the design ready. |
|
I laid out the design here: #1108 we will need to finalize it, but that should give you some idea of how we are thinking about this, please have a look and provide some feedback on what else we should have in those tools and types |
|
Closing, because this hasn't moved since december. |
Motivation and Context
This PR implements the integration between Microsoft Agent Framework and Cua as discussed in issue #1095.
Why is this needed?
Implementation approach:
Following @eavanvalkenburg's guidance in #1095, this uses the
ChatMiddlewarepattern rather than implementing Cua as a Tool. This delegates the entire agent loop to Cua while maintaining Agent Framework's orchestration and human-in-the-loop capabilities.Why wrap
ComputerAgentinstead of justComputer?ComputerAgentprovides the complete agent loop (model inference → parsing → computer actions → multi-step execution) with support for 100+ model configurationsComputeris just the low-level tool for executing actions (click, type, screenshot, etc.)ComputerAgent, we get all of Cua's model support for free without reimplementing provider-agnostic parsers for OpenCUA, InternVL, UI-Tars, GLM, etc.Related issue: #1095
Description
This PR adds
agent-framework-cua, a new integration package that providesCuaAgentMiddleware.Key components:
CuaAgentMiddleware- Middleware that intercepts chat requests and delegates to Cua'sComputerAgentcontext.terminate = TrueComputerAgent(supports 100+ models)require_approval,approval_interval)ChatResponseformatType definitions -
CuaModelId,CuaProviderType,CuaOSType, etc. for type safetyExamples:
basic_example.py- Claude Sonnet 4.5 with Linux Dockercomposite_agent_example.py- UI-Tars + GPT-4o composite agentPackage structure - Follows existing integration patterns (
agent-framework-redis,agent-framework-mem0)Architecture:
The chat client becomes a no-op since
CuaAgentMiddlewareterminates middleware execution and returns the response directly from Cua.Technical notes:
cua-agentdependency)chat_clientsince middleware terminates execution before reaching itChatMessage.content→ChatMessage.text/contentsattribute usage in middlewareContribution Checklist