Intermediate

How Computer Use Works

Computer use is built on three tools — computer, bash, and text_editor — and a loop where Claude alternates between observing the screen and taking actions. Understanding this loop and how to wire it up is the foundation for building any computer use application.

The Three Tools

When you enable computer use in the API, you pass three tool definitions to Claude:

computer — the primary interface tool. Claude uses this to take screenshots and send mouse/keyboard actions. The tool takes an action field (screenshot, left_click, right_click, double_click, type, key, scroll, mouse_move) and coordinate or text parameters as needed.
bash — run shell commands. Claude uses this to execute scripts, navigate the file system, start/stop processes, and read command output. Faster and more reliable than clicking through a terminal GUI.
text_editor — view and edit files directly without needing to open a GUI editor. Claude can create, read, write, and search files using this tool.

The Vision-Action Loop

Screenshot

Base64 image

→

Claude Sees Screen

Vision model processes

→

Decides Action

click / type / key / scroll

→

Execute Action

pyautogui / bash

→

New Screenshot

Observe result

Vision-action loop — repeats until task is complete

The computer use loop in pseudocode:

while task not complete:
    1. Call Claude API with:
       - Current task description in system prompt
       - Conversation history (all previous actions + results)
       - Computer/bash/text_editor tool definitions

    2. Claude returns one of:
       a. tool_use: a specific action to take (click, type, screenshot, bash command)
       b. end_turn: task complete, final text response

    3. If tool_use:
       a. Execute the action on the desktop
       b. If action was "screenshot": return image to Claude as tool_result
       c. If action was bash/text_editor: return command output as tool_result
       d. Append action + result to conversation history
       e. Loop back to step 1

    4. If end_turn: stop, return Claude's response

The key insight: each iteration sends the full conversation history to Claude. Claude sees every action it has taken and every result it has received — it knows where it is in the task.

Action Types in the API

A screenshot action looks like this in Claude's response:

{
  "type": "tool_use",
  "name": "computer",
  "input": {
    "action": "screenshot"
  }
}

# Your code takes a screenshot and returns:
{
  "type": "tool_result",
  "tool_use_id": "...",
  "content": [
    {
      "type": "image",
      "source": {
        "type": "base64",
        "media_type": "image/png",
        "data": "<base64-encoded-screenshot>"
      }
    }
  ]
}

A click action:

{
  "type": "tool_use",
  "name": "computer",
  "input": {
    "action": "left_click",
    "coordinate": [640, 400]   // x, y pixels from top-left
  }
}

How Claude Reasons About the Screen

Claude processes each screenshot as a vision input. It identifies UI elements by their visual appearance — buttons, text fields, menus, dialog boxes, and text content. It uses the element positions to calculate coordinates for click actions.

Claude reasons step-by-step about the task: "I need to click the Login button, which appears to be the blue button in the top-right of the screenshot at approximately (1150, 45)." This reasoning is often visible in the text content before the tool call if you look at the full API response.

A common pattern: Claude first takes a screenshot to understand the current state before acting. Do not be surprised if the first tool call in a session is always screenshot — Claude is orienting itself.

Latency and Token Cost

Computer use is significantly slower and more expensive than text-only interactions:

Latency: Each loop iteration involves an API call that processes a screenshot image. Expect 3–8 seconds per action step depending on screenshot size and model load.
Token cost: Screenshots are large — a 1920×1080 screenshot costs ~1,500–2,000 input tokens. A 20-step task may consume 40,000–80,000 tokens in image content alone.
Context growth: The conversation history grows with every step — after 30+ steps, the context gets large and costs increase per iteration.

Optimise by reducing screenshot resolution (1024×768 is sufficient for most tasks), and by having Claude take screenshots only when needed rather than after every action.

Checklist: Do You Understand This?

Three tools: computer (screenshot + mouse/keyboard), bash (shell commands), text_editor (file read/write)
Loop: screenshot → Claude decides → execute action → return result → repeat
Screenshots returned as base64-encoded images in tool_result with type: "image"
Claude reasons about screen visually — coordinates from what it sees in the screenshot
Cost: ~1,500–2,000 tokens per screenshot; 20-step tasks can consume 40k–80k tokens in images alone