Aleph
Tools & Extensions

Browser Automation

Controlling a Chromium browser via CDP for web navigation, interaction, and data extraction

Overview

Aleph includes a full browser automation system that gives the AI agent the ability to navigate the web, interact with page elements, extract content, and capture screenshots. The system uses the Chrome DevTools Protocol (CDP) to control a Chromium instance, with an ARIA accessibility tree for structured page understanding.

Source locations:

  • Runtime: src/browser/runtime.rs
  • Actions: src/browser/actions.rs
  • Snapshots: src/browser/snapshot.rs
  • Types: src/browser/types.rs
  • Tool wrapper: src/builtin_tools/browser.rs
  • Discovery: src/browser/discovery.rs

Architecture

┌─────────────────────────────────────────────┐
│               BrowserTool                    │
│          (AlephTool interface)               │
│                                             │
│  Arc<Mutex<Option<BrowserRuntime>>>          │
│  Optional ApprovalPolicy                    │
└──────────────────┬──────────────────────────┘


┌─────────────────────────────────────────────┐
│             BrowserRuntime                   │
│        (CDP transport layer)                │
│                                             │
│  Browser (chromiumoxide)                    │
│  HashMap<TabId, Page>                       │
│  CDP event loop (tokio task)               │
└──────────────────┬──────────────────────────┘
                   │ Chrome DevTools Protocol

┌─────────────────────────────────────────────┐
│             Chromium Instance                │
│      (headless or headed mode)              │
└─────────────────────────────────────────────┘

The BrowserTool is the AlephTool implementation that the agent interacts with. It wraps a BrowserRuntime in Arc<Mutex<Option<...>>> so the tool can be cloned (required by the AlephTool trait) while sharing a single browser instance. The BrowserRuntime manages the actual Chromium process via the chromiumoxide crate.

Launch Modes

Aleph supports three ways to obtain a browser instance:

ModeDescriptionConfiguration
AutoAutomatically discover a Chromium binary on the systemLaunchMode::Auto
BinaryUse a specific browser executable pathLaunchMode::Binary { path }
ConnectConnect to an existing browser via WebSocketLaunchMode::Connect { endpoint }

Browser Configuration

pub struct BrowserConfig {
    pub mode: LaunchMode,           // Auto, Binary, or Connect
    pub headless: bool,             // Headless mode (default: false)
    pub cdp_port: u16,              // CDP port (default: 9222)
    pub user_data_dir: Option<String>, // Custom profile directory
    pub extra_args: Vec<String>,    // Extra Chromium command-line args
}

Chromium Discovery

In Auto mode, Aleph searches for Chromium in well-known locations. The find_chromium() function checks platform-specific paths to locate a suitable browser binary.

The Browser Tool

The browser tool exposes all browser operations through a single tool with an action parameter. This design gives the agent a consistent interface without needing to remember dozens of separate tool names.

Typical Workflow

The agent follows this workflow to interact with a web page:

start -> open_tab -> snapshot -> click/type/fill -> screenshot -> stop
  1. Start the browser (headed or headless)
  2. Open a tab to a URL
  3. Take a snapshot to get the ARIA accessibility tree with ref_ids
  4. Interact with elements using ref_ids from the snapshot
  5. Screenshot to verify the result visually
  6. Stop the browser when done

Actions Reference

Lifecycle

ActionDescriptionParameters
startLaunch a browser instanceOptional: headless (bool)
stopShut down the browserNone
{ "action": "start" }
{ "action": "start", "headless": true }
{ "action": "stop" }

Tab Management

ActionDescriptionParameters
open_tabOpen a new taburl (required)
close_tabClose an existing tabtab_id (required)
list_tabsList all open tabsNone
navigateNavigate a tab to a URLtab_id, url (both required)
{ "action": "open_tab", "url": "https://example.com" }
{ "action": "list_tabs" }
{ "action": "navigate", "tab_id": "ABC123", "url": "https://other.com" }
{ "action": "close_tab", "tab_id": "ABC123" }

Element Interaction

ActionDescriptionParameters
clickClick an elementtab_id, ref_id or selector
typeAppend text to an elementtab_id, ref_id or selector, text
fillReplace element valuetab_id, ref_id or selector, text
scrollScroll page or elementtab_id, direction (up/down/left/right)
hoverHover over an elementtab_id, ref_id or selector
{ "action": "click", "tab_id": "ABC123", "ref_id": "e42" }
{ "action": "type", "tab_id": "ABC123", "ref_id": "e7", "text": "search query" }
{ "action": "fill", "tab_id": "ABC123", "selector": "input#email", "text": "[email protected]" }
{ "action": "scroll", "tab_id": "ABC123", "direction": "down" }
{ "action": "hover", "tab_id": "ABC123", "ref_id": "e15" }

Observation

ActionDescriptionParameters
screenshotCapture a tab as PNGtab_id, optional full_page
snapshotGet ARIA accessibility treetab_id
evaluateRun JavaScript in the tabtab_id, js
{ "action": "screenshot", "tab_id": "ABC123", "full_page": true }
{ "action": "snapshot", "tab_id": "ABC123" }
{ "action": "evaluate", "tab_id": "ABC123", "js": "document.title" }

Element Targeting

Elements can be targeted using three methods, in order of priority:

1. ARIA Ref ID (Preferred)

Each element in an ARIA snapshot has a unique ref_id. This is the most reliable targeting method because it references elements by their accessibility tree position:

{ "action": "click", "tab_id": "...", "ref_id": "e42" }

The agent obtains ref_ids by first taking a snapshot:

{ "action": "snapshot", "tab_id": "..." }

Which returns:

{
  "elements": [
    {
      "ref_id": "e1",
      "role": "heading",
      "name": "Welcome",
      "bounds": { "x": 100, "y": 50, "width": 300, "height": 40 }
    },
    {
      "ref_id": "e42",
      "role": "button",
      "name": "Submit",
      "state": ["focused"],
      "bounds": { "x": 200, "y": 400, "width": 120, "height": 36 }
    }
  ],
  "page_title": "My Page",
  "page_url": "https://example.com",
  "focused_ref": "e42"
}

2. CSS Selector

When a ref_id is not available, use a CSS selector:

{ "action": "click", "tab_id": "...", "selector": "button.submit" }
{ "action": "fill", "tab_id": "...", "selector": "#login-form input[type='email']", "text": "[email protected]" }

The selector is resolved using document.querySelector() and getBoundingClientRect().

3. Coordinates (Internal)

Actions are ultimately resolved to viewport coordinates (x, y) internally. The Coordinates variant exists for programmatic use but is not typically used by the agent directly.

Action Implementation

All browser actions work by resolving the target to viewport coordinates and then executing JavaScript in the page context:

Click

Resolves target to (x, y), then uses document.elementFromPoint(x, y).click().

Type

Focuses the target element, then appends text to its value property and dispatches an input event with bubbles: true.

Fill

Similar to type, but replaces the entire value instead of appending. Dispatches both input and change events to ensure frameworks (React, Vue, etc.) detect the change.

Scroll

Uses scrollBy() with behavior: 'smooth' and a fixed delta of 300px in the specified direction.

Hover

Dispatches mouseenter and mouseover events at the target coordinates to trigger CSS :hover styles and JavaScript event listeners.

Screenshots

Screenshots are captured as PNG (default) or JPEG and returned as Base64-encoded data:

pub struct ScreenshotResult {
    pub data_base64: String,  // Base64-encoded image
    pub width: u32,           // Image width in pixels
    pub height: u32,          // Image height in pixels
    pub format: String,       // "png" or "jpeg"
}

Options

OptionTypeDefaultDescription
full_pagebooleanfalseCapture entire scrollable page
formatstring"png"Image format: png, jpeg, webp
qualityinteger80JPEG quality (1-100)

ARIA Accessibility Snapshot

The accessibility snapshot provides a structured representation of the page:

pub struct AriaSnapshot {
    pub elements: Vec<AriaElement>,   // Flat list of elements
    pub page_title: Option<String>,   // Document title
    pub page_url: Option<String>,     // Current URL
    pub focused_ref: Option<String>,  // Currently focused element
}

pub struct AriaElement {
    pub ref_id: String,               // Unique reference ID
    pub role: String,                 // ARIA role (button, textbox, link, etc.)
    pub name: Option<String>,         // Accessible name
    pub value: Option<String>,        // Current value (for inputs)
    pub state: Vec<String>,           // States (focused, checked, disabled, etc.)
    pub bounds: Option<ElementRect>,  // Viewport position and size
    pub children: Vec<AriaElement>,   // Nested elements
}

The snapshot is the primary way the agent understands page structure. By reading the ARIA tree, the agent can identify interactive elements, understand their purpose, and target them by ref_id.

Approval Policy

The browser tool supports an approval policy for gating sensitive actions. When configured, mutating actions require approval before execution:

Action CategoryRequires Approval
start, stopNo
list_tabs, screenshot, snapshot, scroll, hoverNo (read-only)
open_tab, navigateYes (navigation)
clickYes (mutation)
type, fillYes (input)
evaluateYes (JavaScript execution)
let policy = Arc::new(MyApprovalPolicy::new());
let tool = BrowserTool::new().with_approval_policy(policy);

When a policy denies an action:

{
  "success": false,
  "message": "Action denied by approval policy: navigation to untrusted domain"
}

When a policy requires user confirmation:

{
  "success": false,
  "message": "Approval required: Confirm navigation to https://example.com",
  "data": {
    "approval_required": true,
    "prompt": "Confirm navigation to https://example.com"
  }
}

Configuration

Configure the browser tool in Aleph's configuration:

{
  "browser": {
    "headless": false,
    "cdp_port": 9222,
    "user_data_dir": "~/.aleph/browser-profile",
    "extra_args": [
      "--disable-blink-features=AutomationControlled",
      "--disable-infobars",
      "--no-first-run"
    ]
  }
}

The flags --disable-blink-features=AutomationControlled and --disable-infobars are added automatically to reduce detection by anti-bot systems.

Graceful Degradation

When no browser is running, action calls return a friendly message rather than throwing an error:

{
  "success": false,
  "message": "Browser is not running. Use action 'start' to launch a browser first."
}

This allows the agent to detect that a browser is needed and start one without error-handling boilerplate.

Security Considerations

  • Process isolation: The browser runs as a separate Chromium process with its own sandbox
  • Headless mode: Use headless: true in production to prevent visual interference
  • Approval policies: Gate sensitive actions (navigation, JavaScript execution) through the approval system
  • User data directory: Use a dedicated browser profile to isolate cookies and storage from the user's personal browser
  • JavaScript evaluation: The evaluate action can execute arbitrary JS in page context — this is powerful but should be gated by approval policies in security-sensitive deployments
  • Anti-automation flags: Default Chromium flags reduce bot detection, but some sites may still block automated access

Example: Search and Extract

A typical agent workflow using the browser tool:

Agent: I'll search for the latest Rust release notes.

1. browser(action="start", headless=true)
   -> "Browser started successfully."

2. browser(action="open_tab", url="https://blog.rust-lang.org/")
   -> { "tab_id": "TAB_001" }

3. browser(action="snapshot", tab_id="TAB_001")
   -> { elements: [
        { ref_id: "e1", role: "heading", name: "Rust Blog" },
        { ref_id: "e5", role: "link", name: "Rust 1.84.0" },
        ...
      ]}

4. browser(action="click", tab_id="TAB_001", ref_id="e5")
   -> "Clicked."

5. browser(action="snapshot", tab_id="TAB_001")
   -> { elements: [{ ref_id: "e1", role: "heading", name: "Announcing Rust 1.84.0" }, ...] }

6. browser(action="evaluate", tab_id="TAB_001", js="document.body.innerText")
   -> "Announcing Rust 1.84.0 ..."

7. browser(action="stop")
   -> "Browser stopped."

The agent uses snapshots to understand page structure, clicks links by ref_id, and extracts content via JavaScript evaluation.

On this page