Skip to content

Conversation

f-trycua
Copy link

@f-trycua f-trycua commented Oct 9, 2025

Motivation and Context

This PR implements the integration between Microsoft Agent Framework and Cua as discussed in issue #1095.

Why is this needed?

  • Provides Agent Framework with 100+ model configurations (OpenAI, Anthropic, OpenCUA, InternVL, UI-Tars, GLM, etc.) without duplicating model-specific parsers
  • Enables desktop automation capabilities across Windows, macOS, and Linux through Cua's virtualization infrastructure
  • Supports composite agents (e.g., "UI-Tars+GPT-4o") combining grounding and planning models
  • Leverages Cua's existing computer-use infrastructure instead of reimplementing it

Implementation approach:
Following @eavanvalkenburg's guidance in #1095, this uses the ChatMiddleware pattern rather than implementing Cua as a Tool. This delegates the entire agent loop to Cua while maintaining Agent Framework's orchestration and human-in-the-loop capabilities.

Why wrap ComputerAgent instead of just Computer?

  • ComputerAgent provides the complete agent loop (model inference → parsing → computer actions → multi-step execution) with support for 100+ model configurations
  • Computer is just the low-level tool for executing actions (click, type, screenshot, etc.)
  • By wrapping ComputerAgent, we get all of Cua's model support for free without reimplementing provider-agnostic parsers for OpenCUA, InternVL, UI-Tars, GLM, etc.
  • This architectural choice means Agent Framework benefits from Cua's ongoing model additions automatically

Related issue: #1095

Description

This PR adds agent-framework-cua, a new integration package that provides CuaAgentMiddleware.

Key components:

  1. CuaAgentMiddleware - Middleware that intercepts chat requests and delegates to Cua's ComputerAgent

    • Completely bypasses the Agent Framework chat client by setting context.terminate = True
    • All model inference is handled by Cua's ComputerAgent (supports 100+ models)
    • Handles message format conversion between Agent Framework and Cua
    • Supports human-in-the-loop approval workflows (require_approval, approval_interval)
    • Transforms Cua results back to Agent Framework ChatResponse format
  2. Type definitions - CuaModelId, CuaProviderType, CuaOSType, etc. for type safety

  3. Examples:

    • basic_example.py - Claude Sonnet 4.5 with Linux Docker
    • composite_agent_example.py - UI-Tars + GPT-4o composite agent
  4. Package structure - Follows existing integration patterns (agent-framework-redis, agent-framework-mem0)

Architecture:

Agent Framework → CuaAgentMiddleware → Cua ComputerAgent
                      ↓                      ↓
                 terminate=True    Model + Computer Loop
                                           ↓
                                       Results
                                           ↓
Agent Framework ← CuaAgentMiddleware ← Cua ComputerAgent

The chat client becomes a no-op since CuaAgentMiddleware terminates middleware execution and returns the response directly from Cua.

Technical notes:

  • Requires Python ≥3.12 (due to cua-agent dependency)
  • Uses dummy chat_client since middleware terminates execution before reaching it
  • Fixed ChatMessage.contentChatMessage.text/contents attribute usage in middleware

Contribution Checklist

  • The code builds clean without any errors or warnings
  • The PR follows the Contribution Guidelines
  • All unit tests pass, and I have added new tests where possible
  • Is this a breaking change? No

@f-trycua
Copy link
Author

f-trycua commented Oct 9, 2025

@microsoft-github-policy-service agree company="Cua AI, Inc."

@f-trycua f-trycua marked this pull request as ready for review October 9, 2025 21:14
@f-trycua
Copy link
Author

I've also been thinking about how to also support .NET with this integration. Since Agent Framework already has built-in MCP support (see samples), we could create a Python MCP server that wraps Cua's ComputerAgent.

The flow would be:

.NET Agent → MCP Client → stdio → Python MCP Server → Cua ComputerAgent (100+ models)

Usage from C#:

// Connect to Cua MCP server
await using var mcpClient = await McpClient.CreateAsync(new StdioClientTransport(new()
{
    Command = "python",
    Arguments = ["-m", "cua.mcp.server"],
}));

var agent = chatClient.CreateAIAgent(
    instructions: "You are a desktop automation assistant.",
    tools: [.. (await mcpClient.ListToolsAsync()).Cast<AITool>()]
);

await agent.RunAsync("Open Firefox and search for 'Python tutorials'");

This approach would:

  • ✅ Reuse existing MCP infrastructure (no new .NET bindings needed)
  • ✅ Give .NET agents access to all 100+ Cua models
  • ✅ Work cross-language via the MCP protocol

We have a pending PR for MCP server support on the Cua side (trycua/cua#427). Once that's merged, I can add C# samples and documentation in a follow-up PR or update this one. Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants