OpenAI API & Azure OpenAI
OpenAI's platform is the most widely used AI API in production. Understanding its model families, API surfaces, pricing tiers, and enterprise options (Azure OpenAI) is essential for any builder working in the AI ecosystem.
Model Families
| Family | Models | Purpose | Input cost (1M tokens) |
|---|---|---|---|
| GPT-4o mini | gpt-4o-mini | Fast, cheap, good for simple tasks and classification | $0.15 |
| GPT-4o | gpt-4o, gpt-4o-2024-11-20 | Balanced flagship — multimodal, fast, broad capability | $2.50 |
| GPT-5 | gpt-5 | Frontier general model — 400K context, lowest hallucination | $15.00 |
| o4-mini | o4-mini | Cost-efficient reasoning — maths, code, logic | $1.10 |
| o3 | o3 | Full reasoning with tool use — hardest tasks | $10.00 |
| Whisper | whisper-1 | Speech-to-text transcription | $0.006/min |
| TTS | tts-1, tts-1-hd | Text-to-speech generation | $15–30/1M chars |
| DALL-E 3 | dall-e-3 | Image generation (being replaced by GPT Image 1) | $0.04–0.12/image |
API Surface: Chat Completions vs Responses API
Chat Completions API
The original and most widely used endpoint. Stateless — every call is independent. You manage conversation history yourself by appending messages.
POST /v1/chat/completions
{
"model": "gpt-4o",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain RAG in one paragraph."}
],
"temperature": 0.7
}Responses API (2025)
Newer API surface replacing the Assistants API. Supports multi-turn conversations, built-in tool use, file search, and web browsing as first-class primitives. Simpler than the Assistants API but with more capability than raw Chat Completions.
Tool Use (Function Calling)
Tool use (function calling) lets you define functions the model can invoke:
{
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"}
},
"required": ["city"]
}
}
}]
}When the model decides to call a tool, it returns a structured JSON call in the response. Your code executes the function and returns results; the model incorporates them in its next response.
Rate Limits and Handling 429s
OpenAI rate limits are applied per tier across two dimensions:
- RPM (Requests Per Minute) — number of API calls
- TPM (Tokens Per Minute) — total tokens processed
Tier limits increase with monthly spend. Production best practices:
- Implement exponential backoff with jitter on 429 errors
- Use the
Retry-Afterheader when provided - Distribute load across multiple API keys / org accounts if hitting limits
- Prefer batch API for non-time-sensitive workloads (no rate limits apply)
Prompt Caching
OpenAI automatically caches repeated prefixes of prompts. If your system prompt or static context is the same across calls, you pay 50% less for the cached portion. Effective for:
- Long system prompts used across all calls
- Few-shot examples prepended to every request
- Static document context shared across many queries
Azure OpenAI
Microsoft's Azure OpenAI Service provides access to OpenAI models (GPT-4o, GPT-5, o3, DALL-E, Whisper) via Azure's cloud infrastructure with enterprise guarantees:
Reasons to use Azure OpenAI
- Your organisation already uses Azure and Azure AD
- HIPAA, SOC2, FedRAMP compliance needed
- Data must stay in a specific Azure region
- Private endpoint / VNet isolation required
- Existing Azure enterprise agreement pricing
Tradeoffs vs direct OpenAI API
- New models arrive later (days to weeks lag after OpenAI release)
- More complex provisioning (deployments per model per region)
- Rate limits are per deployment, harder to scale quickly
- Some newer API features not yet available on Azure
Multimodal API
GPT-4o, GPT-5, and o3/o4-mini accept images, audio, and files in the same API call:
{
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
]
}]
}PDFs can be sent via the Files API and referenced in calls. The Realtime API enables low-latency audio input/output for voice applications.
Batch API
For non-time-sensitive workloads (data labelling, offline analysis, bulk processing), the Batch API offers:
- 50% cost reduction on all models
- Results returned within 24 hours
- No rate limit concerns
- Submit up to 50,000 requests per batch file
Checklist: Do You Understand This?
- What is the difference between Chat Completions API and the Responses API?
- How does tool use (function calling) work at a high level?
- What are the two rate limit dimensions you need to manage?
- What is the key advantage of prompt caching and how do you maximise it?
- Name three reasons to choose Azure OpenAI over direct OpenAI API access.
- When should you use the Batch API?