🧠 All Things AI
Intermediate

OpenAI API & Azure OpenAI

OpenAI's platform is the most widely used AI API in production. Understanding its model families, API surfaces, pricing tiers, and enterprise options (Azure OpenAI) is essential for any builder working in the AI ecosystem.

Model Families

FamilyModelsPurposeInput cost (1M tokens)
GPT-4o minigpt-4o-miniFast, cheap, good for simple tasks and classification$0.15
GPT-4ogpt-4o, gpt-4o-2024-11-20Balanced flagship — multimodal, fast, broad capability$2.50
GPT-5gpt-5Frontier general model — 400K context, lowest hallucination$15.00
o4-minio4-miniCost-efficient reasoning — maths, code, logic$1.10
o3o3Full reasoning with tool use — hardest tasks$10.00
Whisperwhisper-1Speech-to-text transcription$0.006/min
TTStts-1, tts-1-hdText-to-speech generation$15–30/1M chars
DALL-E 3dall-e-3Image generation (being replaced by GPT Image 1)$0.04–0.12/image

API Surface: Chat Completions vs Responses API

Chat Completions API

The original and most widely used endpoint. Stateless — every call is independent. You manage conversation history yourself by appending messages.

POST /v1/chat/completions
{
  "model": "gpt-4o",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain RAG in one paragraph."}
  ],
  "temperature": 0.7
}

Responses API (2025)

Newer API surface replacing the Assistants API. Supports multi-turn conversations, built-in tool use, file search, and web browsing as first-class primitives. Simpler than the Assistants API but with more capability than raw Chat Completions.

Tool Use (Function Calling)

Tool use (function calling) lets you define functions the model can invoke:

{
  "tools": [{
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Get weather for a city",
      "parameters": {
        "type": "object",
        "properties": {
          "city": {"type": "string", "description": "City name"}
        },
        "required": ["city"]
      }
    }
  }]
}

When the model decides to call a tool, it returns a structured JSON call in the response. Your code executes the function and returns results; the model incorporates them in its next response.

Rate Limits and Handling 429s

OpenAI rate limits are applied per tier across two dimensions:

  • RPM (Requests Per Minute) — number of API calls
  • TPM (Tokens Per Minute) — total tokens processed

Tier limits increase with monthly spend. Production best practices:

  • Implement exponential backoff with jitter on 429 errors
  • Use the Retry-After header when provided
  • Distribute load across multiple API keys / org accounts if hitting limits
  • Prefer batch API for non-time-sensitive workloads (no rate limits apply)

Prompt Caching

OpenAI automatically caches repeated prefixes of prompts. If your system prompt or static context is the same across calls, you pay 50% less for the cached portion. Effective for:

  • Long system prompts used across all calls
  • Few-shot examples prepended to every request
  • Static document context shared across many queries

Azure OpenAI

Microsoft's Azure OpenAI Service provides access to OpenAI models (GPT-4o, GPT-5, o3, DALL-E, Whisper) via Azure's cloud infrastructure with enterprise guarantees:

Reasons to use Azure OpenAI

  • Your organisation already uses Azure and Azure AD
  • HIPAA, SOC2, FedRAMP compliance needed
  • Data must stay in a specific Azure region
  • Private endpoint / VNet isolation required
  • Existing Azure enterprise agreement pricing

Tradeoffs vs direct OpenAI API

  • New models arrive later (days to weeks lag after OpenAI release)
  • More complex provisioning (deployments per model per region)
  • Rate limits are per deployment, harder to scale quickly
  • Some newer API features not yet available on Azure

Multimodal API

GPT-4o, GPT-5, and o3/o4-mini accept images, audio, and files in the same API call:

{
  "messages": [{
    "role": "user",
    "content": [
      {"type": "text", "text": "What's in this image?"},
      {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
    ]
  }]
}

PDFs can be sent via the Files API and referenced in calls. The Realtime API enables low-latency audio input/output for voice applications.

Batch API

For non-time-sensitive workloads (data labelling, offline analysis, bulk processing), the Batch API offers:

  • 50% cost reduction on all models
  • Results returned within 24 hours
  • No rate limit concerns
  • Submit up to 50,000 requests per batch file

Checklist: Do You Understand This?

  • What is the difference between Chat Completions API and the Responses API?
  • How does tool use (function calling) work at a high level?
  • What are the two rate limit dimensions you need to manage?
  • What is the key advantage of prompt caching and how do you maximise it?
  • Name three reasons to choose Azure OpenAI over direct OpenAI API access.
  • When should you use the Batch API?