Intermediate

Using the Ollama API

Ollama is not just a CLI tool. It runs a local HTTP server on port 11434, exposing a REST API you can call from any language, framework, or tool — including as a drop-in replacement for the OpenAI API.

Architecture

Your App

Python / JS / curl / LangChain

Ollama Server

REST API :11434

OpenAI-compat /v1/

Model

Loaded model (llama3.2, deepseek-r1, etc.)

Ollama server sits between your app and the model — streams tokens back via HTTP

Core Endpoints

Endpoint	Method	What it does
/api/generate	POST	Single-turn completion — prompt in, response out. Streams by default.
/api/chat	POST	Multi-turn chat with message history. OpenAI messages format.
/api/embeddings	POST	Generate vector embeddings for a string. Use with nomic-embed-text.
/api/tags	GET	List all locally downloaded models.
/api/show	POST	Get details about a specific model (Modelfile, parameters, size).
/api/pull	POST	Pull a model from the library programmatically.
/api/delete	DELETE	Delete a model from disk.
/v1/chat/completions	POST	OpenAI-compatible endpoint — drop-in for OpenAI clients.
/v1/embeddings	POST	OpenAI-compatible embeddings endpoint.

Quick Examples — curl

# Single-turn generation

curl http://localhost:11434/api/generate \

-d '{"model":"llama3.2","prompt":"Why is the sky blue?","stream":false}'

# Multi-turn chat

curl http://localhost:11434/api/chat \

-d '{"model":"llama3.2","messages":[{"role":"user","content":"Hello!"}],"stream":false}'

# Embeddings

curl http://localhost:11434/api/embeddings \

-d '{"model":"nomic-embed-text","prompt":"The quick brown fox"}'

Python — Direct API

# pip install requests
import requests

response = requests.post(
    'http://localhost:11434/api/generate',
    json={'model': 'llama3.2', 'prompt': 'Explain RAG in 2 sentences', 'stream': False}
)
print(response.json()["response"])

OpenAI-Compatible Endpoint

Ollama's /v1/ endpoints accept the same JSON structure as the OpenAI API. This means any code that uses the openai Python library (or JavaScript SDK) can talk to Ollama with a one-line change — just point the base URL at your local server:

# pip install openai
from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama'  # required but ignored
)
response = client.chat.completions.create(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Hello!'}]
)
print(response.choices[0].message.content)

JavaScript / Node.js

// npm install ollama
import { Ollama } from 'ollama'

const ollama = new Ollama({ host: 'http://localhost:11434' })
const response = await ollama.chat({
  model: 'llama3.2',
  messages: [{ role: 'user', content: 'Explain embeddings' }]
})
console.log(response.message.content)

LangChain Integration

# pip install langchain-ollama
from langchain_ollama import ChatOllama

llm = ChatOllama(model='llama3.2', temperature=0)
result = llm.invoke('Summarize RAG in 3 bullets')
print(result.content)

LangChain's Ollama integration supports streaming, structured output (JSON mode), tool calling on supported models, and embedding generation via OllamaEmbeddings. Use it directly in RAG pipelines alongside pgvector or Chroma.

Tools That Work With the Ollama API

Open WebUI

Full chat UI — like ChatGPT but running locally. Points at localhost:11434 out of the box. Install with Docker in 30 seconds.

AnythingLLM

Document Q&A and RAG workflows. Supports Ollama as a local backend. Best GUI for local RAG without coding.

LiteLLM

API proxy and router. Expose Ollama models as OpenAI-compatible endpoints with logging, cost tracking, and fallbacks.

Enchanted (macOS)

Native macOS app for Ollama. Menu bar chat, multi-model switching, conversation history.

n8n

Automation platform. Ollama node available — use local models in AI workflows without data leaving your machine.

LangChain / LlamaIndex

Both frameworks have first-class Ollama integrations for building RAG pipelines, agents, and chains locally.

Streaming

By default, Ollama streams tokens as they are generated. Set "stream": false to get the complete response as a single JSON object — useful when you want the full output before doing anything with it. Streaming is generally preferred for UI applications so the user sees text appearing in real time.

Checklist: Do You Understand This?

Can you make a cURL request to Ollama's /api/chat endpoint?
Do you know what one-line change points the OpenAI Python SDK at a local Ollama instance?
Can you use Ollama with LangChain for a local RAG pipeline?
Do you know the difference between /api/generate and /api/chat?