-
Notifications
You must be signed in to change notification settings - Fork 547
Description
Computer Use Integration
Hi @markwallace-microsoft, @giles17 and @peterychang! I'd like to help with the Computer Use implementation mentioned in issue #377.
I maintain Cua, which handles cross-platform desktop automation on local virtual environments (Windows sandbox, macOS VMs via Apple Virtualization.Framework, Linux on Docker). It supports 100+ VLMs configurations (OpenAI Computer Use Preview, Anthropic Claude, OpenCUA, InternVL, GLM, UI-Tars, local Ollama) and provides primitives for mouse/keyboard control, screenshots, and shell commands.
windows-sandbox.mp4
Proposed Integration Approach:
I see Computer Use could fit into the Agent Framework in a few ways:
-
As a Tool (like MCP tools in Python: Add MCP tool support to Foundry chat client #595):
from agent_framework import ChatAgent from agent_framework.tools import ComputerUseTool computer = ComputerUseTool(provider="windows", name="windows-sandbox") agent = ChatAgent( chat_client=OpenAIChatClient(), tools=[computer], instructions="You are a desktop automation assistant." )
-
As a Chat Client extension (following the pattern of web search in Python: Web search file search tools #395):
from agent_framework.openai import OpenAIChatClient client = OpenAIChatClient(tools=["computer_use"]) agent = ChatAgent(chat_client=client)
-
In the lab package (for experimentation before core integration):
from agent_framework.lab.cua import CuaComputerTool from agent_framework import ChatAgent computer = CuaComputerTool() agent = ChatAgent(tools=[computer], ...)
Technical Details:
The Cua Computer class provides methods that map well to the Agent Framework tool pattern:
# Screenshot capture
screenshot = await computer.screenshot()
# Mouse actions
await computer.left_click(x, y)
await computer.double_click(x, y)
await computer.move_cursor(x, y)
# Keyboard actions
await computer.type_text("Hello")
await computer.press_key("enter")
await computer.hotkey("command", "c")
# Shell commands
result = await computer.run_command("ls -la")
The API design should align well since Cua is async-first and uses a tool-based architecture that could wrap as AIFunction.
Questions:
- Which integration pattern would align best with the existing built-in tools architecture?
- Should this follow the OpenAI Responses API pattern or be more generic to support multiple providers?
- Would you prefer I start with:
- A proof-of-concept PR showing the integration
- An ADR (Architectural Decision Record) first
- A lab package for experimentation
- Are there specific compatibility requirements I should be aware of?
Happy to work on this if it's useful. Let me know what makes sense for next steps.
Additional Context:
- Cua GitHub: https://github.com/trycua/cua
- Cua Documentation: https://docs.trycua.com