Feature/browser automation vision 6915 by Dev0907 · Pull Request #6959 · aden-hive/hive

Dev0907 · 2026-04-05T19:31:51Z

This PR adds a computer automation layer to the Hive codebase, implementing Playwright-based headless Chrome browser control for agent workflows. This enables agents to perform reliable, deterministic browser automation tasks including navigation, form interactions, screenshot capture, and JavaScript evaluation. Previously, agents could reason about browser tasks but lacked the capability to execute them reliably without manual intervention.

The implementation provides:

Headless Chrome browser control using Playwright
Isolated browser contexts per agent profile for safe concurrent execution
Comprehensive browser action tools (navigation, clicks, typing, scrolling, screenshots, JS evaluation)
Structured action execution for step-by-step browser automation
Screenshot capture functionality to support vision model workflows
MCP tool integration for seamless agent access
This resolves the core issue of agents being unable to operate browser UIs deterministically and enables complex web automation workflows.

Related Issues
Fixes #6915: Browser automation for using Vision language models
Changes Made

Implement BeelineBridge class with Playwright-based headless Chrome automation
Add browser session management with isolated contexts per agent profile
Create comprehensive MCP tools for browser actions (lifecycle, navigation, interactions, inspection, advanced)
Enable screenshot capture for vision model integration
Update GCU tools module for agent workflow integration
Update GCU documentation to reflect Playwright implementation and new capabilities

Testing performed to verify the changes:

Code passes linting checks with ruff check (no errors or warnings)
Existing test suite validates API compatibility and tool registration
Mock-based testing ensures reliability for browser automation tools
Manual verification of implementation against problem statement requirements
Documentation updates reviewed for accuracy and completeness

Checklist

My code follows the project's style guidelines
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Summary by CodeRabbit

Release Notes

New Features
- Introduced Playwright-based headless browser automation with full screenshot capture support for vision model integration.
- Added automated browser context isolation per profile for deterministic, structured action execution.
Documentation
- Updated operational guidance and selection criteria for headless automation scenarios requiring computer vision capabilities.
Refactor
- Simplified browser automation tool APIs with streamlined parameters and consistent error handling patterns.

- Implement BeelineBridge with headless Chrome control - Add structured browser actions: navigation, interactions, inspection - Register MCP tools for deterministic browser automation - Enable screenshot capture for vision workflows - Update GCU tools for agent integration Resolves aden-hive#6915: Browser automation for using Vision language models

- Document Playwright implementation and vision workflow support - Update system prompt best practices for screenshot usage - Clarify isolated browser contexts and deterministic execution

github-actions · 2026-04-05T19:32:03Z

PR Requirements Warning

This PR does not meet the contribution requirements.
If the issue is not fixed within ~24 hours, it may be automatically closed.

PR Author: @Dev0907
Found issues: #6915 (assignees: none)
Problem: The PR author must be assigned to the linked issue.

To fix:

Assign yourself (@Dev0907) to one of the linked issues
Re-open this PR

Exception: To bypass this requirement, you can:

Add the micro-fix label or include micro-fix in your PR title for trivial fixes
Add the documentation label or include doc/docs in your PR title for documentation changes

Micro-fix requirements (must meet ALL):

Qualifies	Disqualifies
< 20 lines changed	Any functional bug fix
Typos & Documentation & Linting	Refactoring for "clean code"
No logic/API/DB changes	New features (even tiny ones)

Why is this required? See #472 for details.

coderabbitai · 2026-04-05T19:32:07Z

📝 Walkthrough

Walkthrough

Replaces the BeelineBridge WebSocket/CDP browser automation architecture with an in-process Playwright-based system. Refactors all GCU browser tools to accept profile identifiers for context isolation, removes advanced parameters and telemetry instrumentation, and updates documentation to describe deterministic headless-Chrome workflows with structured action execution and screenshot capture for vision integration.

Changes

Cohort / File(s)	Summary
Documentation `core/framework/agents/queen/reference/gcu_guide.md`	Updated GCU guide selection criteria to include vision scenarios, narrowed tool coverage from "all 31 browser tools" to "all browser automation tools," and replaced manual tab isolation guidance with automatic isolated browser context per subagent. Added `browser_screenshot` recommendation for vision workflows.
Core Bridge Refactor `tools/src/gcu/browser/bridge.py`	Replaced WebSocket/CDP extension bridge with in-process Playwright Chromium instance. Added `BrowserConfig` for headless/viewport/user-agent configuration, refactored to manage per-profile `BrowserContext`s and `Page` objects, converted all operations from CDP protocol calls to Playwright method calls (navigation, interactions, evaluation, screenshot capture), and removed module-level singleton lifecycle (`init_bridge`, `get_bridge`, `start`, `stop`).
Advanced Tools Simplification `tools/src/gcu/browser/tools/advanced.py`	Removed 6 MCP-registered tools (`browser_wait`, `browser_get_text`, `browser_get_attribute`, `browser_resize`, `browser_upload`, `browser_dialog`). Retained only `browser_evaluate(profile, script)` which now derives tab via `bridge.get_current_tab(profile)` and simplified error handling.
Inspection Tools Reduction `tools/src/gcu/browser/tools/inspection.py`	Removed screenshot normalization/annotation (Pillow overlays), shadow DOM queries, accessibility snapshots, HTML/console introspection, and coordinate conversion logic. Kept minimal trio: `browser_get_text`, `browser_screenshot`, and `browser_evaluate`, each accepting `profile` parameter and deriving active tab from bridge state.
Interaction Tools Refactoring `tools/src/gcu/browser/tools/interactions.py`	Consolidated 9+ parameterized interaction tools into 5 simplified async functions: `browser_click(profile, selector)`, `browser_type(profile, selector, text)`, `browser_press_key(profile, key)`, `browser_scroll(profile, direction, amount)`, `browser_select_option(profile, selector, values)`. Removed timing, telemetry, and advanced parameters (`button`, `delay_ms`, `timeout_ms`).
Lifecycle Tools Simplification `tools/src/gcu/browser/tools/lifecycle.py`	Removed global context management (`_contexts`), eliminated `browser_setup()` and `browser_status()` tools, simplified remaining `browser_start(profile)` and `browser_stop(profile)` to delegate directly to `bridge.create_context()` and `bridge.destroy_context()` with minimal error handling.
Navigation Tools Streamlining `tools/src/gcu/browser/tools/navigation.py`	Renamed `browser_navigate` to `browser_open(profile, url, wait_until)`, refactored `browser_go_back`, `browser_go_forward`, `browser_reload` to accept only `profile` parameter, removed tab lookup/connectivity checks and telemetry, simplified error handling.
Tab Tools Consolidation `tools/src/gcu/browser/tools/tabs.py`	Removed 4 MCP-registered tab tools (`browser_open`, `browser_focus`, `browser_close_all`, `browser_close_finished`). Introduced 2 new functions: `browser_tabs(profile)` returns tab list, `browser_close_tab(tab_id, profile)` closes a specific tab; removed context/telemetry plumbing.

Sequence Diagram

sequenceDiagram
    participant Agent as GCU Agent
    participant Tool as Browser Tool<br/>(e.g., browser_click)
    participant Bridge as BeelineBridge<br/>(in-process)
    participant PW as Playwright API
    participant Browser as Chromium Browser

    Agent->>Tool: call browser_click(profile="work", selector=".btn")
    Tool->>Bridge: get_current_tab(profile)
    Bridge-->>Tool: tab_id=42
    Tool->>Bridge: click(tab_id=42, selector=".btn")
    Bridge->>PW: page.click(selector)
    PW->>Browser: send CDP click action
    Browser-->>PW: action completed
    PW-->>Bridge: success
    Bridge-->>Tool: {"ok": True}
    Tool-->>Agent: {"ok": True}

    Agent->>Tool: call browser_screenshot(profile="work")
    Tool->>Bridge: get_current_tab(profile)
    Bridge-->>Tool: tab_id=42
    Tool->>Bridge: screenshot(tab_id=42, full_page=False)
    Bridge->>PW: page.screenshot()
    PW->>Browser: capture screen
    Browser-->>PW: base64 image data
    PW-->>Bridge: base64 string
    Bridge-->>Tool: {"screenshot": "<base64>"}
    Tool-->>Agent: {"screenshot": "<base64>"}

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Possibly related PRs

Release v0.9.0 — Browser Extension, Queen Memory v2 & Graph Executor Refactor #6905: Introduces the initial WebSocket-based BeelineBridge architecture that this PR replaces with an in-process Playwright implementation, making these PRs sequential architectural changes to the same bridge system.

Poem

🐰 Hop, hop! The bridge takes a leap,
From sockets dancing deep in the CDP weep,
Now Playwright springs in-process and fleet,
Profile-based contexts, a deterministic beat! 🎯
Browser automation—simple, clean, complete!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'Feature/browser automation vision 6915' is a branch name rather than a descriptive summary; it lacks clarity about the actual implementation (Playwright-based headless Chrome with MCP tools).	Consider renaming to a more descriptive title like 'Add Playwright-based browser automation with MCP tools for headless Chrome control' to clearly communicate the main change.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	The implementation comprehensively addresses all stated objectives: Playwright-based headless Chrome (BeelineBridge), isolated contexts per profile, MCP tools for lifecycle/navigation/interactions/inspection, screenshot capture, structured action execution, and test coverage.
Out of Scope Changes check	✅ Passed	All changes directly support browser automation: GCU guide documentation updates, BeelineBridge implementation, MCP tool modules, and removal of obsolete CDP-based infrastructure are all in scope.
Docstring Coverage	✅ Passed	Docstring coverage is 96.61% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

core/framework/agents/queen/reference/gcu_guide.md (1)
54-58: ⚠️ Potential issue | 🟡 Minor

Update the examples to the new browser_open contract.

tools/src/gcu/browser/tools/navigation.py Lines 14-36 now use browser_open to navigate the current tab and return the final URL; it no longer opens a new tab or returns targetId. These prompt examples still teach agents to capture and propagate target_id, which no longer exists.

Also applies to: 160-164
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/framework/agents/queen/reference/gcu_guide.md` around lines 54 - 58,
Update the workflow and examples to match the new browser_open contract: remove
references to returning or propagating targetId/target_id and instead show that
browser_open(url=TARGET_URL) navigates the current tab and returns the final
URL; adjust the sequence (browser_start, browser_open, browser_snapshot,
[task-specific steps]) so examples capture and pass the returned URL value (not
a target id), and update the other example block mentioned (lines ~160-164) the
same way; look for references to browser_open, browser_start, browser_snapshot,
targetId, and target_id when making the edits.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tools/src/gcu/browser/bridge.py`:
- Around line 130-145: When removing contexts or tabs update the _current_tabs
mapping so it doesn't leave dead tab IDs: in destroy_context (and the analogous
close_tab handler) after deleting from _tabs and _contexts check if
_current_tabs.get(profile) references a removed tab and either set it to another
existing tab ID for that profile (pick any tab in _tabs whose page.context
matches the remaining context or same profile) or delete the _current_tabs entry
to clear focus; ensure you reference and update _tabs, _contexts, _current_tabs,
get_current_tab, destroy_context and close_tab so subsequent calls to
get_current_tab(profile) return a valid tab or None.
- Around line 98-109: When handling an existing profile in browser_start, don't
always allocate a new tabId; if pages exist and pages[0] is already referenced
in _current_tabs return that existing tab id instead of creating a fresh alias.
Modify the branch in browser_start (the block referencing self._contexts,
context.pages, self._next_tab_id, self._tabs, and context.new_page) to: check if
pages and pages[0] is present in self._current_tabs.values (or if you maintain a
profile->tab mapping, check self._current_tabs[profile]) and return that tab id;
otherwise create a new tab id, assign it into both self._tabs and
self._current_tabs (e.g. self._current_tabs[profile] = tab_id) before returning
{"groupId": id(context), "tabId": tab_id}.
- Around line 58-76: The connect() method is racy because multiple concurrent
callers can pass the initial is_connected check and concurrently start
Playwright/Chromium, so add a per-bridge asyncio.Lock (e.g., self._connect_lock
= asyncio.Lock()) and wrap the startup path in an async with self._connect_lock:
block that re-checks self.is_connected, then initializes self._playwright and
self._browser and sets self.is_connected=True only after successful launch;
ensure you reset/cleanup _playwright/_browser and leave is_connected False if
launch raises so subsequent callers can retry (use the existing _playwright and
_browser symbols and the is_connected flag).
- Around line 356-372: The drag() implementation incorrectly uses
page.drag_and_drop() with a synthesized CSS selector for absolute coordinates;
instead locate the source element via the provided selector (use
page.locator(selector) or page.query_selector and call bounding_box()), compute
the start coordinates (center of bounding box), then perform coordinate-based
mouse actions with page.mouse.move(start_x, start_y), page.mouse.down(),
page.mouse.move(target_x, target_y, steps=...), and page.mouse.up(); update the
error handling around PlaywrightError to remain consistent and return {"ok":
False, "error": str(e)} on failure.
- Around line 449-461: The current wait_for_text method interpolates the raw
text into the JS predicate, enabling injection and breaking on
quotes/backslashes; change the call in wait_for_text to pass the text as an
argument to page.wait_for_function instead of string interpolation (e.g., use a
predicate like "text => document.body.innerText.includes(text)" and pass
arg=text), keep the existing timeout handling, and ensure you reference the same
method name wait_for_text and variable page when making the replacement.

In `@tools/src/gcu/browser/tools/advanced.py`:
- Around line 38-40: register_advanced_tools currently re-registers
browser_evaluate and thus never exposes the wait primitives; change
register_advanced_tools to register a browser_wait tool that wraps
BeelineBridge.wait_for_selector and BeelineBridge.wait_for_text (or a single
handler that dispatches to those methods) instead of re-registering
browser_evaluate, and ensure the tool name matches the agents' expectation
("browser_wait"); update the registration in register_advanced_tools and the
exported handler function name so it doesn't conflict with the existing
browser_evaluate registration in inspection.py and so callers of browser_wait
will invoke the BeelineBridge wait_for_selector/wait_for_text behavior.

In `@tools/src/gcu/browser/tools/inspection.py`:
- Around line 86-90: register_inspection_tools currently registers
browser_get_text, browser_screenshot, and browser_evaluate but omits
browser_snapshot, which removes the structured DOM/accessibility snapshot from
the MCP API; update register_inspection_tools to also register browser_snapshot
(i.e., call mcp.tool()(browser_snapshot) alongside the other tools), ensuring
the browser_snapshot function referenced in tools/src/gcu/browser/bridge.py is
imported/available in the module so the MCP exposes the same snapshot primitive
used by the guide and snapshot() implementation.

In `@tools/src/gcu/browser/tools/tabs.py`:
- Around line 53-56: The MCP registration in register_tab_tools only exposes
browser_tabs and browser_close_tab, blocking multi-tab workflows; update
register_tab_tools to also register the tab-creation and tab-activation tools by
adding mcp.tool()(browser_create_tab) and mcp.tool()(browser_activate_tab) (or
whatever the existing tool wrappers are named) so the MCP surface matches
BeelineBridge.create_tab and BeelineBridge.activate_tab; then update
tools/src/gcu/browser/tools/navigation.py to call the newly-registered
browser_create_tab/browser_activate_tab tools (instead of only browser_open)
when the agent intends to open a new tab or switch tabs, ensuring the MCP API
supports creating and activating tabs.
- Around line 33-50: Validate tab ownership before calling bridge.close_tab: use
get_bridge() to retrieve the bridge and check its mapping/registry (e.g.,
bridge.tabs, bridge.get_tab_owner or similar) to assert that the provided tab_id
belongs to the supplied profile; if it does, call await bridge.close_tab(tab_id)
and return the result, otherwise return {"ok": False, "error": "tab does not
belong to profile"} (or raise an appropriate error). Update browser_close_tab to
perform this ownership check and avoid delegating directly to bridge.close_tab
without verifying profile ownership.

---

Outside diff comments:
In `@core/framework/agents/queen/reference/gcu_guide.md`:
- Around line 54-58: Update the workflow and examples to match the new
browser_open contract: remove references to returning or propagating
targetId/target_id and instead show that browser_open(url=TARGET_URL) navigates
the current tab and returns the final URL; adjust the sequence (browser_start,
browser_open, browser_snapshot, [task-specific steps]) so examples capture and
pass the returned URL value (not a target id), and update the other example
block mentioned (lines ~160-164) the same way; look for references to
browser_open, browser_start, browser_snapshot, targetId, and target_id when
making the edits.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 76438fd2-c647-4891-90df-0996ab6d1a97

📥 Commits

Reviewing files that changed from the base of the PR and between aaa5d66 and 0ed1037.

📒 Files selected for processing (8)

core/framework/agents/queen/reference/gcu_guide.md
tools/src/gcu/browser/bridge.py
tools/src/gcu/browser/tools/advanced.py
tools/src/gcu/browser/tools/inspection.py
tools/src/gcu/browser/tools/interactions.py
tools/src/gcu/browser/tools/lifecycle.py
tools/src/gcu/browser/tools/navigation.py
tools/src/gcu/browser/tools/tabs.py

coderabbitai · 2026-04-05T19:47:04Z