Intermediate

Tool Calling Patterns

Tool calling (also called function calling) is the mechanism by which an LLM requests the execution of a function and incorporates the result into its reasoning. It is what makes an agent capable of taking actions beyond text generation — querying databases, calling APIs, reading files, running code. This page covers how tool calling works at the protocol level, how to design tools the LLM uses reliably, how to handle parallel and sequential calls, and how to manage errors.

Send Message

tools[] in request

→

Claude Decides

Call or respond?

→

Tool Call

name + JSON args

→

Your Code Runs

Execute tool

→

Append Result

tool_result block

→

Final Answer

stop_reason: end_turn

Tool calling round-trip — your code executes tools, Claude drives the loop

How Tool Calling Works

The flow is a structured round-trip between your application and the LLM:

1. Define tools in the API call: You send the LLM a list of available tools, each described with a name, description, and JSON Schema for its parameters. The LLM never sees actual function code — only the schema and description.

2. LLM decides to call a tool: Instead of outputting a text response, the LLM returns a structured tool call: the tool name and the argument values it wants to pass. The LLM has stopped generating — it is waiting for the result.

3. Your application executes the tool: Your code receives the tool call, validates the arguments, executes the real function (API call, DB query, etc.), and captures the result or error.

4. Inject the result back: You append the tool result to the conversation (as a special tool-result message) and call the LLM again. The LLM continues reasoning with the result in context — deciding whether to call another tool or produce a final answer.

Tool Schema Design

The quality of the tool description is the single biggest factor in whether the LLM calls the tool correctly. The LLM cannot read your code — it relies entirely on the schema and description to understand what the tool does, when to call it, and what arguments to provide.

Anatomy of a Good Tool Definition

Field	Purpose	Common mistake
name	Identifies the tool. Used by the LLM to select it.	Vague names like `tool1` or `helper`. Use verbs: `search_web`, `get_order_status`, `send_email`
description	Tells the LLM when to use this tool vs others, and what it returns.	Too short. Include: what it does, when to use it, what it returns, and its limitations
parameters	JSON Schema defining each argument: type, description, constraints, required vs optional.	Missing `description` on individual parameters — the LLM then guesses what to put in each field
required array	Marks which parameters the LLM must always provide.	Marking everything required when some params have sensible defaults — creates unnecessary friction

Example: Good vs Poor Schema

Poor schema

{
  "name": "search",
  "description": "Search for things",
  "parameters": {
    "type": "object",
    "properties": {
      "q": { "type": "string" }
    },
    "required": ["q"]
  }
}

No description of what it searches, what it returns, when to use it vs other tools. No description on parameter q.

Good schema

{
  "name": "search_product_catalog",
  "description": "Search the product catalog by keyword. Returns up to 10 matching products with name, SKU, price, and stock status. Use this when the user asks about product availability or pricing. Do NOT use for order history.",
  "parameters": {
    "type": "object",
    "properties": {
      "query": {
        "type": "string",
        "description": "Search keywords, e.g. 'blue running shoes size 10'"
      },
      "max_results": {
        "type": "integer",
        "description": "Max products to return (1-10). Default: 5",
        "default": 5
      }
    },
    "required": ["query"]
  }
}

Schema Design Rules

One tool, one purpose. Tools that do two things (search AND summarise) cause the LLM to pass incorrect arguments. Split into two tools.

Describe disambiguation explicitly. If you have get_order and get_invoice, the description of each must explain when to use it vs the other.

Use enums for constrained values. If a parameter has a fixed set of valid values, define it as an enum — the LLM will only use those values rather than inventing strings.

Describe what the tool returns, not just what it accepts. The LLM uses the description to know what information it will have available after calling the tool.

Keep the tool list small. With 2–5 tools the LLM almost never picks the wrong one. With 15+ tools, tool selection errors increase substantially. If you need many tools, use tool search or dynamic tool loading (see below).

Parallel vs Sequential Tool Calls

Modern LLMs (GPT-4o, Claude 3.x, Gemini 1.5) can request multiple tool calls in a single response — before seeing any of the results. This is called parallel tool calling.

Mode	When LLM uses it	Your application must	Best for
Parallel	Multiple independent tool calls can be issued before seeing results	Execute all calls concurrently; collect all results; inject all results before next LLM call	Independent lookups (search + get user + get order simultaneously)
Sequential	Each call depends on the result of the previous	Execute one tool, inject result, LLM decides next call	Dependent steps (search → read top result → summarise that result)

Parallelism note: OpenAI exposes parallel_tool_calls: false to force sequential calls. Anthropic supports parallel tool calls natively. If your tools have side effects that conflict (e.g. two writes to the same resource), you should either disable parallelism or implement locking at the tool layer.

Error Handling

Tool calls will fail. The network will time out. The API will return a 404. The input will fail validation. How you surface errors to the LLM determines whether it can recover gracefully or gets stuck.

Always return a tool result, even on error. Never leave a tool call unanswered — if you omit the tool result message the LLM has no signal about why it cannot proceed. Return an error message in the tool result content, and set is_error: true (Anthropic) or use the role: "tool" message with error content (OpenAI).

Be specific in error messages. “Error: 404” tells the LLM nothing actionable. “Order #12345 not found. The order may belong to a different account, or may have been deleted. Try verifying the order ID with the customer.” gives the LLM enough information to respond sensibly or try a different approach.

Validate arguments before executing. Check the LLM-provided arguments match your expected schema and constraints before passing them to the actual tool. If an argument is invalid, return a validation error immediately — this prevents expensive downstream calls with bad inputs and helps the LLM self-correct.

Implement retry budgets. If the LLM calls a tool with invalid arguments and gets an error, it will typically self-correct and retry. But if it keeps retrying the same bad call, you need a per-tool retry limit (e.g. 3 attempts max) to prevent runaway loops.

Result Injection and Context Management

Tool results accumulate in the context window as the agent runs. For long-running agents with many tool calls, this becomes a context management problem.

Truncate large results. A web search might return 50KB of HTML. Inject only the relevant portion — extract the top-3 most relevant paragraphs, not the full page. The LLM does not need (and is confused by) irrelevant content in tool results.

Pre-summarise large tool outputs. For tools that return large structured data (JSON API responses with 100+ fields), use a second LLM call to summarise the relevant fields before injecting into the main agent context. This is called a “result summarisation step.”

Track what was retrieved. For multi-step agents, maintain a separate “scratchpad” structure that records which tools were called and what key facts were learned. Inject a compressed version of this into each subsequent prompt rather than the full tool-call history.

Provider Differences

Provider	Tool format	Parallel calls	Notes
OpenAI	`tools` array with `function` type + JSON Schema	Yes (default on; disable with `parallel_tool_calls: false`)	Most mature implementation; `tool_choice` forces specific tool or disables tool calling
Anthropic	`tools` array with `input_schema` (JSON Schema)	Yes (native support)	Error results use `is_error: true`; `tool_choice` supports `auto`, `any`, `tool`; programmatic tool calling (2025)
Gemini	`function_declarations` inside `tools`	Yes	Different field naming convention; LiteLLM or LangChain normalises across providers

Cross-provider compatibility: If you need to switch models or support multiple providers, use LiteLLM or LangChain's tool abstraction layer — they normalise the schema format differences automatically.

Controlling Tool Selection

auto (default): The LLM decides whether to call a tool or respond in text. Use for general agents where the LLM should decide based on context.

required / any: Force the LLM to call at least one tool. Use when you need structured output and want to prevent the LLM from responding in free text.

Named tool: Force the LLM to call a specific tool. Use for structured data extraction — define a tool that matches your desired output schema, force the call, and treat the tool arguments as your structured output.

none: Disable tool calling entirely for this call. Use when you want a text response summarising the tool results collected so far, without risking another tool call.

Advanced Patterns (2025)

Dynamic tool loading / Tool search

When you have hundreds of tools, providing them all in every call is expensive (tokens) and degrades selection quality. Instead, maintain a tool registry with descriptions, embed them, and retrieve the most relevant tools per query. Inject only the top-5 relevant tools for each call. Anthropic introduced a native Tool Search Tool in 2025 that implements this pattern — the LLM can search for tools by description before calling them.

Structured output via forced tool call

LLMs produce unreliable JSON in free-text mode. To get reliable structured output, define a “tool” whose schema matches your desired output format (e.g. extract_customer_info with fields for name, email, intent), force the LLM to call it with tool_choice: required, and treat the arguments as your output. This is more reliable than asking the LLM to “respond in JSON.”

Programmatic tool calling (Anthropic, 2025)

Anthropic introduced programmatic tool calling — Claude can invoke tools in a code execution environment rather than individual API round-trips. This enables parallel tool execution from within a single reasoning step, reducing latency for agents with many concurrent tool calls.

Failure Modes

Wrong tool selected

The LLM calls a tool that is not appropriate for the situation — usually because descriptions are too similar or too vague. Fix: add explicit disambiguation language to each description; reduce the number of tools visible per call.

Hallucinated arguments

The LLM invents argument values it was not given (e.g. fabricates an order ID). Fix: validate arguments before executing; add parameter descriptions that clarify where the values should come from ("Use the order ID from the user's message, not from your knowledge").

Ignoring tool results

The LLM calls a tool, receives the result, then answers from its training knowledge anyway — ignoring what the tool returned. Fix: strengthen the system prompt instruction to use tool results; inspect traces to confirm results are being injected correctly.

Executing without validation

Blindly executing the LLM's tool call arguments without checking them first. For write tools (delete, send, update), always validate arguments in your application layer before execution — treat LLM-provided arguments as untrusted user input.

Checklist: Do You Understand This?

Can you describe the four-step tool calling round-trip (define → LLM calls → execute → inject result)?
What are the four fields of a tool definition, and what is the most common mistake in each?
Can you explain the difference between parallel and sequential tool calls, and when each is appropriate?
How should you handle a tool error — what must you always include in the response?
Can you name the three tool_choice modes and when to use each?
What is the “structured output via forced tool call” pattern and why is it more reliable than asking for JSON?
Can you name four failure modes in tool calling and their fixes?
How do Anthropic and OpenAI differ in their tool format and error signalling?