Web-Browser-Agent: Multimodal Autonomous Web Navigation via Visual Grounding
Overview
An autonomous web navigation agent that bridges the grounding gap between vision-language model perception and browser automation execution. The core innovation is a dual-channel Set-of-Mark (SoM) grounding system — injected JavaScript simultaneously renders numbered red tags as visual overlays (for the VLM to see in screenshots) and sets data-agent-id DOM attributes (for Playwright to locate elements programmatically) — eliminating the ambiguity where a VLM might describe an element but the automation layer cannot find it. A closed-loop Verify-Act-Verify self-correction mechanism uses pixel-level screenshot differencing to detect “ghost clicks” (actions that had no visible effect), injecting error signals into conversation history for prompt-based self-correction. Using Claude 3.5 Sonnet at temperature 0.0 with zero fine-tuning (pure prompt engineering), the agent achieves 85% step-level success rate across a multi-category web benchmark at ~$0.27/task.
Dual-Channel Set-of-Mark Grounding
Before each step, ~70 lines of JavaScript are injected via page.evaluate() that:
- Select interactive elements via comprehensive CSS selectors — standard elements (
button,a,input,textarea,select), ARIA roles ([role="button"],[role="link"],[role="textbox"],[role="menuitem"],[role="tab"],[role="checkbox"],[role="combobox"]), event-based ([onclick],[tabindex]:not([tabindex="-1"])), and semantic (label[for],summary) - Filter for visibility — elements must have
width > 5,height > 5, not hidden/display-none/opacity-0, and within viewport bounds - Tag each element via two channels simultaneously:
- DOM attribute: Sets
data-agent-id=Non the element (Playwright usespage.locator('[data-agent-id="N"]')to find it) - Visual overlay: Creates an absolutely-positioned
<div>with red border, yellow semi-transparent background, displaying number N atz-index: 2147483647
- DOM attribute: Sets
This dual-channel approach is the core architectural insight: the VLM sees numbered red tags in screenshots and outputs "element_id": 5; Playwright locates the exact DOM element via [data-agent-id="5"]. No XPath, no text matching, no coordinate regression — just a shared integer namespace bridging vision and DOM.
Observe-Reason-Act-Verify Loop
ENTRY → Navigate to URL
→ Inject SoM JavaScript (tag + overlay interactive elements)
→ Wait 300ms for overlay render
→ Capture JPEG screenshot (1024px max, quality 70) → base64 encode
→ Build sliding-window history (text-only summaries of past steps)
→ VLM inference: [history] + [screenshot + action prompt] → JSON action
→ If action == "done": mark success, EXIT
→ Execute action via Playwright
→ If element not found (hallucination): inject error signal, CONTINUE
→ Wait 1.5s for page effects to settle
→ Capture verification screenshot
→ Compute pixel diff (256×256 grayscale, threshold 0.01)
→ If diff < threshold AND URL unchanged (ghost click): inject error signal, CONTINUE
→ Mark step success, append to history → LOOP (up to max_steps=15)
Action Space
Nine discrete actions forming a complete web interaction vocabulary:
| Action | Parameters | Implementation |
|---|---|---|
click | element_id | page.locator('[data-agent-id="N"]').first.click() |
type | element_id (opt), text | Click to focus → page.keyboard.type(text, delay=50) |
press_enter | — | page.keyboard.press("Enter") |
scroll_down | — | window.scrollBy(0, 500) via JS |
scroll_up | — | window.scrollBy(0, -500) via JS |
wait | — | asyncio.sleep(2) |
navigate | text (URL) | page.goto(text) + wait for domcontentloaded |
back | — | page.go_back() |
done | — | Signals task completion, terminates loop |
The type action simulates human typing with 50ms per-character delay. The click action uses the SoM-assigned data-agent-id CSS selector — if loc.count() == 0, the element is flagged as hallucinated and an error signal is injected.
Ghost Click Detection and Self-Correction
The closed-loop verification distinguishes this agent from open-loop systems that assume every action succeeds:
- Pre/post screenshots are both resized to 256×256 grayscale
- Pixel-wise absolute difference:
np.abs(arr_before - arr_after), normalized by dividing mean by 255 - Ghost click threshold: If
pixel_diff < 0.01AND URL unchanged → the action had no visible effect (clicked a disabled button, overlay intercepted the event, element wasn’t interactive) - Self-correction via error injection: Rather than explicit replanning logic, error signals are injected as
usermessages in conversation history —"System Error: Action had no visible effect. The element may be disabled or the click missed. Try a different element or approach."— leveraging the VLM’s in-context learning to naturally adjust its next action
This pixel-diff approach is computationally trivial (no neural network) yet effective at catching the most common failure mode in web agents.
Error Taxonomy and Recovery
| Error Type | Detection | Recovery |
|---|---|---|
| Hallucination | locator('[data-agent-id=N]').count() == 0 | Inject: “Element ID N does not exist. Choose a different element.” |
| Ghost Click | pixel_diff < 0.01 AND URL unchanged | Inject: “Action had no visible effect… Try a different approach.” |
| JSON Parse Error | json.JSONDecodeError (4-level fallback: raw → ` json ` → ` `` ` → brace extraction) | Default to safe wait action |
| API Rate Limit | anthropic.RateLimitError | Exponential backoff: 2s → 4s → 8s (3 retries) |
| API Server Error | APIStatusError with status >= 500 | Same exponential backoff |
Token-Aware Sliding Window History
The context manager balances information retention against token costs:
- Token budget: 100,000 tokens (half of Claude’s 200K context window)
- Window size: Last 3 step-pairs (user + assistant messages) kept in full
- Older steps: Condensed into a summary —
"Previous actions summary: [Step 1: ...first 100 chars...] [Step 2: ...]" - Screenshots: Only the current step’s screenshot is sent to the VLM — all previous steps are text-only summaries, dramatically reducing token costs (each image costs thousands of tokens)
This design choice trades away the ability to “look back” at previous page states in exchange for ~5× lower per-step API costs.
VLM Configuration
| Parameter | Value |
|---|---|
| Model | Claude 3.5 Sonnet (claude-3-5-sonnet-20241022) |
| Temperature | 0.0 (deterministic) |
| Max output tokens | 1,024 |
| Input | Multimodal: base64 JPEG screenshot + text prompt |
| Fine-tuning | None — pure prompt engineering |
| System prompt | Instructs VLM to prioritize dismissing cookie banners/modals, output only valid JSON, use element IDs from red overlay tags |
The system prompt delegates complex UI challenges (popup detection, cookie consent handling) to the VLM’s visual reasoning rather than building separate detection modules. The action prompt specifies the 9-action vocabulary and enforces structured JSON output: {"thought": "...", "action": "...", "element_id": N, "text": "..."}.
Results
| Metric | Score |
|---|---|
| Task Success Rate | 67% |
| Step-Level Success Rate | 85% |
| Total benchmark cost | $2.45 (~$0.27/task) |
Evaluated across 3 benchmark tasks spanning the Mind2Web taxonomy (retrieval, navigation, form-filling) with 3 runs per task (9 total runs, up to 15 steps each). Metrics track hallucination count, ghost click count, per-task cost, and per-step success rate. Per-step screenshots and full JSON run logs are saved as artifacts for post-hoc analysis.
Tech Stack
Python (3.10+), Playwright (async Chromium automation), Anthropic Claude API (3.5 Sonnet), Pillow, NumPy, asyncio, Rich (terminal UI), PyYAML