Skip to content

Single Agent: Computer Use Integration #1095

@f-trycua

Description

@f-trycua

Computer Use Integration

Hi @markwallace-microsoft, @giles17 and @peterychang! I'd like to help with the Computer Use implementation mentioned in issue #377.

I maintain Cua, which handles cross-platform desktop automation on local virtual environments (Windows sandbox, macOS VMs via Apple Virtualization.Framework, Linux on Docker). It supports 100+ VLMs configurations (OpenAI Computer Use Preview, Anthropic Claude, OpenCUA, InternVL, GLM, UI-Tars, local Ollama) and provides primitives for mouse/keyboard control, screenshots, and shell commands.

windows-sandbox.mp4

Proposed Integration Approach:

I see Computer Use could fit into the Agent Framework in a few ways:

  1. As a Tool (like MCP tools in Python: Add MCP tool support to Foundry chat client #595):

    from agent_framework import ChatAgent
    from agent_framework.tools import ComputerUseTool
    
    computer = ComputerUseTool(provider="windows", name="windows-sandbox")
    agent = ChatAgent(
        chat_client=OpenAIChatClient(),
        tools=[computer],
        instructions="You are a desktop automation assistant."
    )
  2. As a Chat Client extension (following the pattern of web search in Python: Web search file search tools #395):

    from agent_framework.openai import OpenAIChatClient
    
    client = OpenAIChatClient(tools=["computer_use"])
    agent = ChatAgent(chat_client=client)
  3. In the lab package (for experimentation before core integration):

    from agent_framework.lab.cua import CuaComputerTool
    from agent_framework import ChatAgent
    
    computer = CuaComputerTool()
    agent = ChatAgent(tools=[computer], ...)

Technical Details:

The Cua Computer class provides methods that map well to the Agent Framework tool pattern:

# Screenshot capture
screenshot = await computer.screenshot()

# Mouse actions
await computer.left_click(x, y)
await computer.double_click(x, y)
await computer.move_cursor(x, y)

# Keyboard actions
await computer.type_text("Hello")
await computer.press_key("enter")
await computer.hotkey("command", "c")

# Shell commands
result = await computer.run_command("ls -la")

The API design should align well since Cua is async-first and uses a tool-based architecture that could wrap as AIFunction.

Questions:

  1. Which integration pattern would align best with the existing built-in tools architecture?
  2. Should this follow the OpenAI Responses API pattern or be more generic to support multiple providers?
  3. Would you prefer I start with:
    • A proof-of-concept PR showing the integration
    • An ADR (Architectural Decision Record) first
    • A lab package for experimentation
  4. Are there specific compatibility requirements I should be aware of?

Happy to work on this if it's useful. Let me know what makes sense for next steps.

Additional Context:

Metadata

Metadata

Assignees

No one assigned

    Labels

    agentsIssues related to single agentspublic_preview_refreshRequirements for Public Preview Refresh for Ignite 2025

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions