Intermediate

OpenAI API & Azure OpenAI

OpenAI's platform is the most widely used AI API in production. Understanding its model families, API surfaces, pricing tiers, and enterprise options (Azure OpenAI) is essential for any builder working in the AI ecosystem.

Model Families

Family	Models	Purpose	Input cost (1M tokens)
GPT-4o mini	gpt-4o-mini	Fast, cheap, good for simple tasks and classification	$0.15
GPT-4o	gpt-4o, gpt-4o-2024-11-20	Balanced flagship — multimodal, fast, broad capability	$2.50
GPT-5	gpt-5	Frontier general model — 400K context, lowest hallucination	$15.00
o4-mini	o4-mini	Cost-efficient reasoning — maths, code, logic	$1.10
o3	o3	Full reasoning with tool use — hardest tasks	$10.00
Whisper	whisper-1	Speech-to-text transcription	$0.006/min
TTS	tts-1, tts-1-hd	Text-to-speech generation	$15–30/1M chars
DALL-E 3	dall-e-3	Image generation (being replaced by GPT Image 1)	$0.04–0.12/image

API Surface: Chat Completions vs Responses API

Chat Completions API

The original and most widely used endpoint. Stateless — every call is independent. You manage conversation history yourself by appending messages.

POST /v1/chat/completions
{
  "model": "gpt-4o",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain RAG in one paragraph."}
  ],
  "temperature": 0.7
}

Responses API (2025)

Newer API surface replacing the Assistants API. Supports multi-turn conversations, built-in tool use, file search, and web browsing as first-class primitives. Simpler than the Assistants API but with more capability than raw Chat Completions.

Tool Use (Function Calling)

Tool use (function calling) lets you define functions the model can invoke:

{
  "tools": [{
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Get weather for a city",
      "parameters": {
        "type": "object",
        "properties": {
          "city": {"type": "string", "description": "City name"}
        },
        "required": ["city"]
      }
    }
  }]
}

When the model decides to call a tool, it returns a structured JSON call in the response. Your code executes the function and returns results; the model incorporates them in its next response.

Rate Limits and Handling 429s

OpenAI rate limits are applied per tier across two dimensions:

RPM (Requests Per Minute) — number of API calls
TPM (Tokens Per Minute) — total tokens processed

Tier limits increase with monthly spend. Production best practices:

Implement exponential backoff with jitter on 429 errors
Use the Retry-After header when provided
Distribute load across multiple API keys / org accounts if hitting limits
Prefer batch API for non-time-sensitive workloads (no rate limits apply)

Prompt Caching

OpenAI automatically caches repeated prefixes of prompts. If your system prompt or static context is the same across calls, you pay 50% less for the cached portion. Effective for:

Long system prompts used across all calls
Few-shot examples prepended to every request
Static document context shared across many queries

Azure OpenAI

Microsoft's Azure OpenAI Service provides access to OpenAI models (GPT-4o, GPT-5, o3, DALL-E, Whisper) via Azure's cloud infrastructure with enterprise guarantees:

Reasons to use Azure OpenAI

Your organisation already uses Azure and Azure AD
HIPAA, SOC2, FedRAMP compliance needed
Data must stay in a specific Azure region
Private endpoint / VNet isolation required
Existing Azure enterprise agreement pricing

Tradeoffs vs direct OpenAI API

New models arrive later (days to weeks lag after OpenAI release)
More complex provisioning (deployments per model per region)
Rate limits are per deployment, harder to scale quickly
Some newer API features not yet available on Azure

Multimodal API

GPT-4o, GPT-5, and o3/o4-mini accept images, audio, and files in the same API call:

{
  "messages": [{
    "role": "user",
    "content": [
      {"type": "text", "text": "What's in this image?"},
      {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
    ]
  }]
}

PDFs can be sent via the Files API and referenced in calls. The Realtime API enables low-latency audio input/output for voice applications.

Batch API

For non-time-sensitive workloads (data labelling, offline analysis, bulk processing), the Batch API offers:

50% cost reduction on all models
Results returned within 24 hours
No rate limit concerns
Submit up to 50,000 requests per batch file

Checklist: Do You Understand This?

What is the difference between Chat Completions API and the Responses API?
How does tool use (function calling) work at a high level?
What are the two rate limit dimensions you need to manage?
What is the key advantage of prompt caching and how do you maximise it?
Name three reasons to choose Azure OpenAI over direct OpenAI API access.
When should you use the Batch API?