Advanced
Building a Computer Use Pipeline
A complete computer use pipeline has three layers: the Claude API call layer (sending screenshots, receiving actions), the action execution layer (running mouse/keyboard events), and the display layer (the virtual or real screen). This page shows a working end-to-end Python implementation.
API Setup: Beta Header
Computer use requires the anthropic-beta header in every API call. Pass it via the SDK:
import anthropic
import base64
from PIL import ImageGrab # or use mss for Linux
client = anthropic.Anthropic()
# Computer use beta header required
BETA_HEADER = "computer-use-2024-10-22"Tool Definitions
Define the three computer use tools. Pass these in every API call:
TOOLS = [
{
"type": "computer_20241022",
"name": "computer",
"display_width_px": 1024,
"display_height_px": 768,
"display_number": 1 # X display number for Linux
},
{
"type": "bash_20241022",
"name": "bash"
},
{
"type": "text_editor_20241022",
"name": "str_replace_editor"
}
]The Action Execution Layer
This layer translates Claude's action outputs into actual system interactions. A basic implementation using pyautogui:
import pyautogui
import subprocess
import json
def take_screenshot() -> str:
"""Capture screen and return as base64."""
screenshot = ImageGrab.grab()
screenshot = screenshot.resize((1024, 768))
import io
buffer = io.BytesIO()
screenshot.save(buffer, format="PNG")
return base64.standard_b64encode(buffer.getvalue()).decode()
def execute_computer_action(action: dict) -> str:
"""Execute a computer tool action."""
action_type = action["action"]
if action_type == "screenshot":
return take_screenshot()
elif action_type == "left_click":
x, y = action["coordinate"]
pyautogui.click(x, y)
return "Clicked"
elif action_type == "double_click":
x, y = action["coordinate"]
pyautogui.doubleClick(x, y)
return "Double-clicked"
elif action_type == "type":
pyautogui.typewrite(action["text"], interval=0.02)
return "Typed text"
elif action_type == "key":
pyautogui.hotkey(*action["key"].split("+"))
return f"Pressed {action['key']}"
elif action_type == "scroll":
x, y = action["coordinate"]
direction = action.get("direction", "down")
amount = action.get("amount", 3)
pyautogui.scroll(-amount if direction == "down" else amount, x=x, y=y)
return f"Scrolled {direction}"
elif action_type == "mouse_move":
x, y = action["coordinate"]
pyautogui.moveTo(x, y)
return "Moved mouse"
return f"Unknown action: {action_type}"
def execute_bash(command: str) -> str:
"""Execute a bash command and return output."""
result = subprocess.run(
command, shell=True, capture_output=True, text=True, timeout=30
)
output = result.stdout
if result.returncode != 0:
output += f"\nSTDERR: {result.stderr}"
return output or "(no output)"
The Main Loop
def run_computer_use(task: str, max_iterations: int = 50) -> str:
messages = [{"role": "user", "content": task}]
for _ in range(max_iterations):
response = client.beta.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
tools=TOOLS,
messages=messages,
betas=[BETA_HEADER]
)
# Add Claude's response to history
messages.append({"role": "assistant", "content": response.content})
# Task complete
if response.stop_reason == "end_turn":
for block in response.content:
if hasattr(block, "text"):
return block.text
return "Task completed."
# Execute tool calls
tool_results = []
for block in response.content:
if block.type == "tool_use":
if block.name == "computer":
result = execute_computer_action(block.input)
# Screenshot result is an image
if block.input.get("action") == "screenshot":
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": [{"type": "image", "source": {
"type": "base64",
"media_type": "image/png",
"data": result
}}]
})
else:
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result
})
elif block.name == "bash":
result = execute_bash(block.input["command"])
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result
})
messages.append({"role": "user", "content": tool_results})
return "Max iterations reached."
# Example usage
result = run_computer_use(
"Open a browser, navigate to weather.com, and tell me today's weather in London."
)
print(result)Error Handling: Stuck States and Task Completion
- Stuck state detection: If the same screenshot is returned 3+ times with no visual change, Claude is likely stuck. Inject a message: "The screen does not seem to be changing. Try a different approach or report that you are stuck."
- Unexpected UI: Dialogs, permission prompts, and error messages can appear unexpectedly. Claude usually handles these — ensure the max iterations is high enough for recovery attempts.
- Task completion detection: Rely on Claude's
end_turnstop reason. Provide a clear success condition in the task prompt: "When you have extracted the data, output it as JSON and stop." - Timeout per action: Set a timeout on bash commands to prevent hanging processes from blocking the loop.
Checklist: Do You Understand This?
- Beta header:
anthropic-beta: computer-use-2024-10-22required on every API call - Three tools:
computer_20241022,bash_20241022,text_editor_20241022 - Screenshot results: image tool_result with base64 PNG — not text
- Action execution layer: translates Claude's action outputs to system calls (pyautogui, subprocess)
- Error handling: detect stuck states by comparing successive screenshots; set bash command timeouts