Python: Add CuaAgentMiddleware for Computer-Use tool #1338

f-trycua · 2025-10-09T07:07:34Z

Motivation and Context

This PR implements the integration between Microsoft Agent Framework and Cua as discussed in issue #1095.

Why is this needed?

Provides Agent Framework with 100+ model configurations (OpenAI, Anthropic, OpenCUA, InternVL, UI-Tars, GLM, etc.) without duplicating model-specific parsers
Enables desktop automation capabilities across Windows, macOS, and Linux through Cua's virtualization infrastructure
Supports composite agents (e.g., "UI-Tars+GPT-4o") combining grounding and planning models
Leverages Cua's existing computer-use infrastructure instead of reimplementing it

Implementation approach:
Following @eavanvalkenburg's guidance in #1095, this uses the ChatMiddleware pattern rather than implementing Cua as a Tool. This delegates the entire agent loop to Cua while maintaining Agent Framework's orchestration and human-in-the-loop capabilities.

Why wrap ComputerAgent instead of just Computer?

ComputerAgent provides the complete agent loop (model inference → parsing → computer actions → multi-step execution) with support for 100+ model configurations
Computer is just the low-level tool for executing actions (click, type, screenshot, etc.)
By wrapping ComputerAgent, we get all of Cua's model support for free without reimplementing provider-agnostic parsers for OpenCUA, InternVL, UI-Tars, GLM, etc.
This architectural choice means Agent Framework benefits from Cua's ongoing model additions automatically

Related issue: #1095

Description

This PR adds agent-framework-cua, a new integration package that provides CuaAgentMiddleware.

Key components:

CuaAgentMiddleware - Middleware that intercepts chat requests and delegates to Cua's ComputerAgent
- Completely bypasses the Agent Framework chat client by setting context.terminate = True
- All model inference is handled by Cua's ComputerAgent (supports 100+ models)
- Handles message format conversion between Agent Framework and Cua
- Supports human-in-the-loop approval workflows (require_approval, approval_interval)
- Transforms Cua results back to Agent Framework ChatResponse format
Type definitions - CuaModelId, CuaProviderType, CuaOSType, etc. for type safety
Examples:
- basic_example.py - Claude Sonnet 4.5 with Linux Docker
- composite_agent_example.py - UI-Tars + GPT-4o composite agent
Package structure - Follows existing integration patterns (agent-framework-redis, agent-framework-mem0)

Architecture:

Agent Framework → CuaAgentMiddleware → Cua ComputerAgent
                      ↓                      ↓
                 terminate=True    Model + Computer Loop
                                           ↓
                                       Results
                                           ↓
Agent Framework ← CuaAgentMiddleware ← Cua ComputerAgent

The chat client becomes a no-op since CuaAgentMiddleware terminates middleware execution and returns the response directly from Cua.

Technical notes:

Requires Python ≥3.12 (due to cua-agent dependency)
Uses dummy chat_client since middleware terminates execution before reaching it
Fixed ChatMessage.content → ChatMessage.text/contents attribute usage in middleware

Contribution Checklist

The code builds clean without any errors or warnings
The PR follows the Contribution Guidelines
All unit tests pass, and I have added new tests where possible
Is this a breaking change? No

f-trycua · 2025-10-09T21:14:11Z

@microsoft-github-policy-service agree company="Cua AI, Inc."

f-trycua · 2025-10-10T18:29:12Z

I've also been thinking about how to also support .NET with this integration. Since Agent Framework already has built-in MCP support (see samples), we could create a Python MCP server that wraps Cua's ComputerAgent.

The flow would be:

.NET Agent → MCP Client → stdio → Python MCP Server → Cua ComputerAgent (100+ models)

Usage from C#:

// Connect to Cua MCP server
await using var mcpClient = await McpClient.CreateAsync(new StdioClientTransport(new()
{
    Command = "python",
    Arguments = ["-m", "cua.mcp.server"],
}));

var agent = chatClient.CreateAIAgent(
    instructions: "You are a desktop automation assistant.",
    tools: [.. (await mcpClient.ListToolsAsync()).Cast<AITool>()]
);

await agent.RunAsync("Open Firefox and search for 'Python tutorials'");

This approach would:

✅ Reuse existing MCP infrastructure (no new .NET bindings needed)
✅ Give .NET agents access to all 100+ Cua models
✅ Work cross-language via the MCP protocol

We have a pending PR for MCP server support on the Cua side (trycua/cua#427). Once that's merged, I can add C# samples and documentation in a follow-up PR or update this one. Thoughts?

Add CuaAgentMiddleware for Cua integration

3ce4a3c

markwallace-microsoft added documentation Improvements or additions to documentation python labels Oct 9, 2025

f-trycua mentioned this pull request Oct 9, 2025

Single Agent: Computer Use Integration #1095

Open

markwallace-microsoft requested review from ekzhu and victordibia October 9, 2025 08:45

Merge branch 'main' into feature/cua-integration

f219f11

f-trycua marked this pull request as ready for review October 9, 2025 21:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Python: Add CuaAgentMiddleware for Computer-Use tool #1338

Python: Add CuaAgentMiddleware for Computer-Use tool #1338

Uh oh!

f-trycua commented Oct 9, 2025

Uh oh!

f-trycua commented Oct 9, 2025

Uh oh!

f-trycua commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Python: Add CuaAgentMiddleware for Computer-Use tool #1338

Are you sure you want to change the base?

Python: Add CuaAgentMiddleware for Computer-Use tool #1338

Uh oh!

Conversation

f-trycua commented Oct 9, 2025

Motivation and Context

Description

Contribution Checklist

Uh oh!

f-trycua commented Oct 9, 2025

Uh oh!

f-trycua commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants