Intermediate

Using the Ollama API

Ollama is not just a CLI tool. It runs a local HTTP server on port 11434, exposing a REST API you can call from any language, framework, or tool — including as a drop-in replacement for the OpenAI API.

Architecture

Your App
Python / JS / curl / LangChain
Ollama Server
REST API :11434
OpenAI-compat /v1/
Model
Loaded model (llama3.2, deepseek-r1, etc.)

Ollama server sits between your app and the model — streams tokens back via HTTP

Core Endpoints

EndpointMethodWhat it does
/api/generatePOSTSingle-turn completion — prompt in, response out. Streams by default.
/api/chatPOSTMulti-turn chat with message history. OpenAI messages format.
/api/embeddingsPOSTGenerate vector embeddings for a string. Use with nomic-embed-text.
/api/tagsGETList all locally downloaded models.
/api/showPOSTGet details about a specific model (Modelfile, parameters, size).
/api/pullPOSTPull a model from the library programmatically.
/api/deleteDELETEDelete a model from disk.
/v1/chat/completionsPOSTOpenAI-compatible endpoint — drop-in for OpenAI clients.
/v1/embeddingsPOSTOpenAI-compatible embeddings endpoint.

Quick Examples — curl

# Single-turn generation
curl http://localhost:11434/api/generate \
-d '{"model":"llama3.2","prompt":"Why is the sky blue?","stream":false}'
# Multi-turn chat
curl http://localhost:11434/api/chat \
-d '{"model":"llama3.2","messages":[{"role":"user","content":"Hello!"}],"stream":false}'
# Embeddings
curl http://localhost:11434/api/embeddings \
-d '{"model":"nomic-embed-text","prompt":"The quick brown fox"}'

Python — Direct API

# pip install requests
import requests

response = requests.post(
    'http://localhost:11434/api/generate',
    json={'model': 'llama3.2', 'prompt': 'Explain RAG in 2 sentences', 'stream': False}
)
print(response.json()["response"])

OpenAI-Compatible Endpoint

Ollama's /v1/ endpoints accept the same JSON structure as the OpenAI API. This means any code that uses the openai Python library (or JavaScript SDK) can talk to Ollama with a one-line change — just point the base URL at your local server:

# pip install openai
from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama'  # required but ignored
)
response = client.chat.completions.create(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Hello!'}]
)
print(response.choices[0].message.content)

JavaScript / Node.js

// npm install ollama
import { Ollama } from 'ollama'

const ollama = new Ollama({ host: 'http://localhost:11434' })
const response = await ollama.chat({
  model: 'llama3.2',
  messages: [{ role: 'user', content: 'Explain embeddings' }]
})
console.log(response.message.content)

LangChain Integration

# pip install langchain-ollama
from langchain_ollama import ChatOllama

llm = ChatOllama(model='llama3.2', temperature=0)
result = llm.invoke('Summarize RAG in 3 bullets')
print(result.content)

LangChain's Ollama integration supports streaming, structured output (JSON mode), tool calling on supported models, and embedding generation via OllamaEmbeddings. Use it directly in RAG pipelines alongside pgvector or Chroma.

Tools That Work With the Ollama API

Open WebUI
Full chat UI — like ChatGPT but running locally. Points at localhost:11434 out of the box. Install with Docker in 30 seconds.
AnythingLLM
Document Q&A and RAG workflows. Supports Ollama as a local backend. Best GUI for local RAG without coding.
LiteLLM
API proxy and router. Expose Ollama models as OpenAI-compatible endpoints with logging, cost tracking, and fallbacks.
Enchanted (macOS)
Native macOS app for Ollama. Menu bar chat, multi-model switching, conversation history.
n8n
Automation platform. Ollama node available — use local models in AI workflows without data leaving your machine.
LangChain / LlamaIndex
Both frameworks have first-class Ollama integrations for building RAG pipelines, agents, and chains locally.

Streaming

By default, Ollama streams tokens as they are generated. Set "stream": false to get the complete response as a single JSON object — useful when you want the full output before doing anything with it. Streaming is generally preferred for UI applications so the user sees text appearing in real time.

Checklist: Do You Understand This?

  • Can you make a cURL request to Ollama's /api/chat endpoint?
  • Do you know what one-line change points the OpenAI Python SDK at a local Ollama instance?
  • Can you use Ollama with LangChain for a local RAG pipeline?
  • Do you know the difference between /api/generate and /api/chat?

Page built: 01 Jun 2026