Advanced

Building a Computer Use Pipeline

A complete computer use pipeline has three layers: the Claude API call layer (sending screenshots, receiving actions), the action execution layer (running mouse/keyboard events), and the display layer (the virtual or real screen). This page shows a working end-to-end Python implementation.

API Setup: Beta Header

Computer use requires the anthropic-beta header in every API call. Pass it via the SDK:

import anthropic
import base64
from PIL import ImageGrab  # or use mss for Linux

client = anthropic.Anthropic()

# Computer use beta header required
BETA_HEADER = "computer-use-2024-10-22"

Tool Definitions

Define the three computer use tools. Pass these in every API call:

TOOLS = [
    {
        "type": "computer_20241022",
        "name": "computer",
        "display_width_px": 1024,
        "display_height_px": 768,
        "display_number": 1  # X display number for Linux
    },
    {
        "type": "bash_20241022",
        "name": "bash"
    },
    {
        "type": "text_editor_20241022",
        "name": "str_replace_editor"
    }
]

The Action Execution Layer

This layer translates Claude's action outputs into actual system interactions. A basic implementation using pyautogui:

import pyautogui
import subprocess
import json

def take_screenshot() -> str:
    """Capture screen and return as base64."""
    screenshot = ImageGrab.grab()
    screenshot = screenshot.resize((1024, 768))
    import io
    buffer = io.BytesIO()
    screenshot.save(buffer, format="PNG")
    return base64.standard_b64encode(buffer.getvalue()).decode()

def execute_computer_action(action: dict) -> str:
    """Execute a computer tool action."""
    action_type = action["action"]

    if action_type == "screenshot":
        return take_screenshot()

    elif action_type == "left_click":
        x, y = action["coordinate"]
        pyautogui.click(x, y)
        return "Clicked"

    elif action_type == "double_click":
        x, y = action["coordinate"]
        pyautogui.doubleClick(x, y)
        return "Double-clicked"

    elif action_type == "type":
        pyautogui.typewrite(action["text"], interval=0.02)
        return "Typed text"

    elif action_type == "key":
        pyautogui.hotkey(*action["key"].split("+"))
        return f"Pressed {action['key']}"

    elif action_type == "scroll":
        x, y = action["coordinate"]
        direction = action.get("direction", "down")
        amount = action.get("amount", 3)
        pyautogui.scroll(-amount if direction == "down" else amount, x=x, y=y)
        return f"Scrolled {direction}"

    elif action_type == "mouse_move":
        x, y = action["coordinate"]
        pyautogui.moveTo(x, y)
        return "Moved mouse"

    return f"Unknown action: {action_type}"

def execute_bash(command: str) -> str:
    """Execute a bash command and return output."""
    result = subprocess.run(
        command, shell=True, capture_output=True, text=True, timeout=30
    )
    output = result.stdout
    if result.returncode != 0:
        output += f"\nSTDERR: {result.stderr}"
    return output or "(no output)"

The Main Loop

def run_computer_use(task: str, max_iterations: int = 50) -> str:
    messages = [{"role": "user", "content": task}]

    for _ in range(max_iterations):
        response = client.beta.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=4096,
            tools=TOOLS,
            messages=messages,
            betas=[BETA_HEADER]
        )

        # Add Claude's response to history
        messages.append({"role": "assistant", "content": response.content})

        # Task complete
        if response.stop_reason == "end_turn":
            for block in response.content:
                if hasattr(block, "text"):
                    return block.text
            return "Task completed."

        # Execute tool calls
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                if block.name == "computer":
                    result = execute_computer_action(block.input)
                    # Screenshot result is an image
                    if block.input.get("action") == "screenshot":
                        tool_results.append({
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": [{"type": "image", "source": {
                                "type": "base64",
                                "media_type": "image/png",
                                "data": result
                            }}]
                        })
                    else:
                        tool_results.append({
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": result
                        })
                elif block.name == "bash":
                    result = execute_bash(block.input["command"])
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result
                    })

        messages.append({"role": "user", "content": tool_results})

    return "Max iterations reached."

# Example usage
result = run_computer_use(
    "Open a browser, navigate to weather.com, and tell me today's weather in London."
)
print(result)

Error Handling: Stuck States and Task Completion

Stuck state detection: If the same screenshot is returned 3+ times with no visual change, Claude is likely stuck. Inject a message: "The screen does not seem to be changing. Try a different approach or report that you are stuck."
Unexpected UI: Dialogs, permission prompts, and error messages can appear unexpectedly. Claude usually handles these — ensure the max iterations is high enough for recovery attempts.
Task completion detection: Rely on Claude's end_turn stop reason. Provide a clear success condition in the task prompt: "When you have extracted the data, output it as JSON and stop."
Timeout per action: Set a timeout on bash commands to prevent hanging processes from blocking the loop.

Checklist: Do You Understand This?

Beta header: anthropic-beta: computer-use-2024-10-22 required on every API call
Three tools: computer_20241022, bash_20241022, text_editor_20241022
Screenshot results: image tool_result with base64 PNG — not text
Action execution layer: translates Claude's action outputs to system calls (pyautogui, subprocess)
Error handling: detect stuck states by comparing successive screenshots; set bash command timeouts