Intermediate

What is Computer Use?

Computer use is a capability that lets AI agents perceive a computer screen and take real actions — clicking, typing, scrolling, navigating — just like a human would. Instead of calling structured APIs, the agent sees what's on screen and interacts with it directly. This unlocks automation of any application, including ones with no API at all.

The Core Mechanism: Screenshot Loop

Computer use runs a perception-action loop:

Take a screenshot of the current screen state
Send the screenshot to a vision model (Claude, GPT-4o, etc.) along with the task
Model decides what action to take — click a button, type text, scroll down
Execute the action via mouse/keyboard control APIs
Take another screenshot to verify the action worked
Repeat until the task is complete

Simple example

Task: "Search for the quarterly report on the company intranet and download it."

The agent takes a screenshot → sees a browser with the intranet login → types credentials → takes a screenshot → sees the search bar → types "quarterly report" → takes a screenshot → sees the results → clicks the PDF link → the file downloads. All without any API — the agent just used the UI.

Action Types

Computer use agents can take most actions a human can:

Mouse actions

Left click, right click, double click
Click and drag
Scroll up/down, left/right
Hover over elements
Click at coordinates (x, y)

Keyboard actions

Type text into fields
Keyboard shortcuts (Ctrl+C, Ctrl+V, etc.)
Special keys (Enter, Tab, Escape, Arrow keys)
Function keys (F1–F12)

Shell / file actions (extended)

Run bash commands
Read and write files
Launch applications
Take and analyse screenshots

Computer Use vs Standard Tool Calling

These are fundamentally different approaches to AI interaction with external systems:

Dimension	Tool Calling (Function Calling)	Computer Use
Interface	Structured API / JSON function calls	Visual UI (any interface)
Requires API	Yes — application must expose endpoints	No — anything visible on screen works
Speed	Milliseconds per call	2–10 seconds per action
Reliability	High — structured outputs	Lower — UI can change; vision can fail
Cost	Low — fast API calls	High — vision model call per screenshot
Use case	Systems with good APIs	Legacy systems, closed UIs, automation without APIs

What Computer Use Unlocks

The key insight: every application becomes automatable, including:

Legacy enterprise software — SAP, Oracle, custom ERP systems with no modern API; only a thick client desktop app
Government and regulatory portals — Web portals with no API; require form filling and manual document upload
SaaS tools without automation support — Any web app that hasn't built API endpoints for your workflow
Desktop-only applications — Local software that runs only on Windows/macOS and has no integration surface
Complex multi-step workflows — Tasks that require switching between multiple applications and copy-pasting data between them

Current Limitations (2025)

Slow: 2–10 seconds per action; a 20-step task takes 40–200 seconds. Not suitable for latency-sensitive workflows.
UI fragility: If the UI changes (new design, different layout, A/B test variant), the agent can get confused. Robust agents need error detection.
Expensive: Every screenshot is a vision model API call. A complex task with 30 actions might cost $0.30–3.00 in vision model tokens alone.
Security risk: Agents with computer access can do real harm. Sandboxing and approval gates are essential — see the Sandboxing page.
Error recovery is imperfect: Agents sometimes get stuck in loops or fail to detect that an action didn't work as expected.

Error Detection and Recovery

Reliable computer use agents need explicit error detection:

After each action, compare the new screenshot to what was expected
Check for error messages, popups, or unexpected state changes
Implement a maximum retry count before aborting and reporting failure
Define explicit success criteria — the agent should know when the task is done

Checklist: Do You Understand This?

Describe the 6-step perception-action loop that computer use agents run.
Name three action types available to computer use agents.
What is the key advantage of computer use over tool calling?
What are the three main current limitations of computer use?
Why is sandboxing essential for any production computer use agent?