Using the Ollama API
Ollama is not just a CLI tool. It runs a local HTTP server on port 11434, exposing a REST API you can call from any language, framework, or tool — including as a drop-in replacement for the OpenAI API.
Architecture
Ollama server sits between your app and the model — streams tokens back via HTTP
Core Endpoints
| Endpoint | Method | What it does |
|---|---|---|
| /api/generate | POST | Single-turn completion — prompt in, response out. Streams by default. |
| /api/chat | POST | Multi-turn chat with message history. OpenAI messages format. |
| /api/embeddings | POST | Generate vector embeddings for a string. Use with nomic-embed-text. |
| /api/tags | GET | List all locally downloaded models. |
| /api/show | POST | Get details about a specific model (Modelfile, parameters, size). |
| /api/pull | POST | Pull a model from the library programmatically. |
| /api/delete | DELETE | Delete a model from disk. |
| /v1/chat/completions | POST | OpenAI-compatible endpoint — drop-in for OpenAI clients. |
| /v1/embeddings | POST | OpenAI-compatible embeddings endpoint. |
Quick Examples — curl
Python — Direct API
# pip install requests
import requests
response = requests.post(
'http://localhost:11434/api/generate',
json={'model': 'llama3.2', 'prompt': 'Explain RAG in 2 sentences', 'stream': False}
)
print(response.json()["response"])OpenAI-Compatible Endpoint
Ollama's /v1/ endpoints accept the same JSON structure as the OpenAI API. This means any code that uses the openai Python library (or JavaScript SDK) can talk to Ollama with a one-line change — just point the base URL at your local server:
# pip install openai
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama' # required but ignored
)
response = client.chat.completions.create(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Hello!'}]
)
print(response.choices[0].message.content)JavaScript / Node.js
// npm install ollama
import { Ollama } from 'ollama'
const ollama = new Ollama({ host: 'http://localhost:11434' })
const response = await ollama.chat({
model: 'llama3.2',
messages: [{ role: 'user', content: 'Explain embeddings' }]
})
console.log(response.message.content)LangChain Integration
# pip install langchain-ollama
from langchain_ollama import ChatOllama
llm = ChatOllama(model='llama3.2', temperature=0)
result = llm.invoke('Summarize RAG in 3 bullets')
print(result.content)LangChain's Ollama integration supports streaming, structured output (JSON mode), tool calling on supported models, and embedding generation via OllamaEmbeddings. Use it directly in RAG pipelines alongside pgvector or Chroma.
Tools That Work With the Ollama API
Streaming
By default, Ollama streams tokens as they are generated. Set "stream": false to get the complete response as a single JSON object — useful when you want the full output before doing anything with it. Streaming is generally preferred for UI applications so the user sees text appearing in real time.
Checklist: Do You Understand This?
- Can you make a cURL request to Ollama's
/api/chatendpoint? - Do you know what one-line change points the OpenAI Python SDK at a local Ollama instance?
- Can you use Ollama with LangChain for a local RAG pipeline?
- Do you know the difference between
/api/generateand/api/chat?