What is Computer Use?
Computer use is a capability that lets AI agents perceive a computer screen and take real actions — clicking, typing, scrolling, navigating — just like a human would. Instead of calling structured APIs, the agent sees what's on screen and interacts with it directly. This unlocks automation of any application, including ones with no API at all.
The Core Mechanism: Screenshot Loop
Computer use runs a perception-action loop:
- Take a screenshot of the current screen state
- Send the screenshot to a vision model (Claude, GPT-4o, etc.) along with the task
- Model decides what action to take — click a button, type text, scroll down
- Execute the action via mouse/keyboard control APIs
- Take another screenshot to verify the action worked
- Repeat until the task is complete
Simple example
Task: "Search for the quarterly report on the company intranet and download it."
The agent takes a screenshot → sees a browser with the intranet login → types credentials → takes a screenshot → sees the search bar → types "quarterly report" → takes a screenshot → sees the results → clicks the PDF link → the file downloads. All without any API — the agent just used the UI.
Action Types
Computer use agents can take most actions a human can:
Mouse actions
- Left click, right click, double click
- Click and drag
- Scroll up/down, left/right
- Hover over elements
- Click at coordinates (x, y)
Keyboard actions
- Type text into fields
- Keyboard shortcuts (Ctrl+C, Ctrl+V, etc.)
- Special keys (Enter, Tab, Escape, Arrow keys)
- Function keys (F1–F12)
Shell / file actions (extended)
- Run bash commands
- Read and write files
- Launch applications
- Take and analyse screenshots
Computer Use vs Standard Tool Calling
These are fundamentally different approaches to AI interaction with external systems:
| Dimension | Tool Calling (Function Calling) | Computer Use |
|---|---|---|
| Interface | Structured API / JSON function calls | Visual UI (any interface) |
| Requires API | Yes — application must expose endpoints | No — anything visible on screen works |
| Speed | Milliseconds per call | 2–10 seconds per action |
| Reliability | High — structured outputs | Lower — UI can change; vision can fail |
| Cost | Low — fast API calls | High — vision model call per screenshot |
| Use case | Systems with good APIs | Legacy systems, closed UIs, automation without APIs |
What Computer Use Unlocks
The key insight: every application becomes automatable, including:
- Legacy enterprise software — SAP, Oracle, custom ERP systems with no modern API; only a thick client desktop app
- Government and regulatory portals — Web portals with no API; require form filling and manual document upload
- SaaS tools without automation support — Any web app that hasn't built API endpoints for your workflow
- Desktop-only applications — Local software that runs only on Windows/macOS and has no integration surface
- Complex multi-step workflows — Tasks that require switching between multiple applications and copy-pasting data between them
Current Limitations (2025)
- Slow: 2–10 seconds per action; a 20-step task takes 40–200 seconds. Not suitable for latency-sensitive workflows.
- UI fragility: If the UI changes (new design, different layout, A/B test variant), the agent can get confused. Robust agents need error detection.
- Expensive: Every screenshot is a vision model API call. A complex task with 30 actions might cost $0.30–3.00 in vision model tokens alone.
- Security risk: Agents with computer access can do real harm. Sandboxing and approval gates are essential — see the Sandboxing page.
- Error recovery is imperfect: Agents sometimes get stuck in loops or fail to detect that an action didn't work as expected.
Error Detection and Recovery
Reliable computer use agents need explicit error detection:
- After each action, compare the new screenshot to what was expected
- Check for error messages, popups, or unexpected state changes
- Implement a maximum retry count before aborting and reporting failure
- Define explicit success criteria — the agent should know when the task is done
Checklist: Do You Understand This?
- Describe the 6-step perception-action loop that computer use agents run.
- Name three action types available to computer use agents.
- What is the key advantage of computer use over tool calling?
- What are the three main current limitations of computer use?
- Why is sandboxing essential for any production computer use agent?