Beginner

Image Generation

OpenAI's image generation has undergone a fundamental architectural shift. What began as DALL-E — a separate model you called via a tool — has evolved into native image generation built directly into GPT-4o. This integration changes both the quality ceiling and the interaction model for creating images with AI.

DALL-E 3 (Legacy)

DALL-E 3 was OpenAI's previous image generation model, available in ChatGPT and via the API. It supported resolutions up to 2048×2048 pixels in square, landscape, and portrait orientations. It offered two style presets: "natural" (more photographic) and "vivid" (more saturated and stylised).

DALL-E 3's primary weakness was text rendering — words appearing in generated images were frequently misspelled, poorly formed, or illegible. Faces and hands were inconsistent, and complex multi-element compositions often lost coherence at the edges.

DALL-E 3 was deprecated and shut down on May 2, 2026. New work should use GPT-4o native image generation.

GPT-4o Native Image Generation (Current)

Launched in March 2025, GPT-4o native image generation represents a fundamental architectural change: image generation is no longer a separate model that GPT calls. Instead, it is an intrinsic capability of GPT-4o itself — the same model that understands your text also generates the image, with full comprehension of the prompt's meaning rather than a simplified text representation passed to a separate system.

This integration produces several measurable improvements over DALL-E 3:

Key Improvements

Accurate text rendering: Words, labels, and signs in images are correctly spelled and legible
Near-photorealistic faces: Consistent, realistic facial anatomy with proper proportions
Correct hand anatomy: The longstanding "extra fingers" problem is largely resolved
Realistic lighting physics: Shadows, reflections, and light sources behave correctly
Complex multi-element prompts: Maintains coherence across all elements of a detailed scene

Iterative Refinement

Because image generation is native to the conversation model, you can refine images through natural dialogue: "Make the background darker", "Change the shirt to blue", "Add a coffee cup to the table." Each edit builds on the previous image without starting over — a workflow that was awkward with DALL-E 3's separate tool architecture.

Availability

GPT-4o native image generation is available in ChatGPT on Plus and Pro plans. Free plan users receive limited image generation credits. Via the API, it is accessible through the standard completions endpoint with the appropriate model parameter.

Use Cases

Blog and article illustrations: Generate relevant images with readable embedded text (headings, captions, labels)
UI mockups: Create rough interface layouts, wireframe-style visuals, or polished UI screenshots
Diagrams and infographics: Architecture diagrams, process flows, and labelled charts where text legibility matters
Social media assets: Branded visuals, quote cards, event announcements with correctly rendered text
Product visualisation: Concept renders, packaging mockups, scenario illustrations

Limitations

Despite significant improvements, limitations remain:

Real people policy: OpenAI's policy prohibits generating photorealistic images of real, identifiable individuals without consent. Public figures and celebrities are restricted.
Complex geometry: Scenes involving precise spatial relationships (architectural blueprints, technical engineering drawings) can still have proportional errors.
Consistency across generations: Characters and objects may look slightly different between separate generation calls — not suitable for storyboards requiring strict visual consistency without additional tooling.

Checklist

What is the fundamental architectural difference between DALL-E 3 and GPT-4o native image generation?
Name two specific quality improvements in GPT-4o native images over DALL-E 3.
How does iterative refinement work in the ChatGPT image generation experience?
When was DALL-E 3 deprecated, and what replaced it?
What policy restriction affects generating images of real people?