Single Agent: Computer Use Integration

### Computer Use Integration

Hi @markwallace-microsoft, @giles17 and @peterychang! I'd like to help with the Computer Use implementation mentioned in issue #377.

I maintain [Cua](https://github.com/trycua/cua), which handles cross-platform desktop automation on local virtual environments (Windows sandbox, macOS VMs via Apple Virtualization.Framework, Linux on Docker). It supports 100+ VLMs configurations (OpenAI Computer Use Preview, Anthropic Claude, OpenCUA, InternVL, GLM, UI-Tars, local Ollama) and provides primitives for mouse/keyboard control, screenshots, and shell commands.

https://github.com/user-attachments/assets/03d50860-0a73-4f3a-b49e-f6cd1958b9a2

**Proposed Integration Approach:**

I see Computer Use could fit into the Agent Framework in a few ways:

1. **As a Tool** (like MCP tools in #595):
   ```python
   from agent_framework import ChatAgent
   from agent_framework.tools import ComputerUseTool
   
   computer = ComputerUseTool(provider="windows", name="windows-sandbox")
   agent = ChatAgent(
       chat_client=OpenAIChatClient(),
       tools=[computer],
       instructions="You are a desktop automation assistant."
   )
   ```

2. **As a Chat Client extension** (following the pattern of web search in #395):
   ```python
   from agent_framework.openai import OpenAIChatClient
   
   client = OpenAIChatClient(tools=["computer_use"])
   agent = ChatAgent(chat_client=client)
   ```

3. **In the lab package** (for experimentation before core integration):
   ```python
   from agent_framework.lab.cua import CuaComputerTool
   from agent_framework import ChatAgent
   
   computer = CuaComputerTool()
   agent = ChatAgent(tools=[computer], ...)
   ```

**Technical Details:**

The Cua Computer class provides methods that map well to the Agent Framework tool pattern:

```python
# Screenshot capture
screenshot = await computer.screenshot()

# Mouse actions
await computer.left_click(x, y)
await computer.double_click(x, y)
await computer.move_cursor(x, y)

# Keyboard actions
await computer.type_text("Hello")
await computer.press_key("enter")
await computer.hotkey("command", "c")

# Shell commands
result = await computer.run_command("ls -la")
```

The API design should align well since Cua is async-first and uses a tool-based architecture that could wrap as AIFunction.

**Questions:**

1. Which integration pattern would align best with the existing built-in tools architecture?
2. Should this follow the OpenAI Responses API pattern or be more generic to support multiple providers?
3. Would you prefer I start with:
   - A proof-of-concept PR showing the integration
   - An ADR (Architectural Decision Record) first
   - A lab package for experimentation
4. Are there specific compatibility requirements I should be aware of?

Happy to work on this if it's useful. Let me know what makes sense for next steps.

**Additional Context:**

- Cua GitHub: https://github.com/trycua/cua
- Cua Documentation: https://docs.trycua.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Single Agent: Computer Use Integration #1095

Computer Use Integration

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Single Agent: Computer Use Integration #1095

Description

Computer Use Integration

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions