What's New in AI
A curated timeline of the milestones that actually changed how AI is built and used — from early 2024 through mid-2026. Not every model release; the ones that shifted the industry or changed what practitioners could do.
Last updated: May 2026
2026 (so far)
Devstral Small — agentic coding at $0.07/1M input
Mistral releases Devstral Small — an agentic coding specialist at $0.07/$0.28 per 1M tokens. Outperforms Codestral Small on SWE-bench agentic tasks, making economically viable coding agents accessible for high-volume loops.
Qwen3 — thinking-mode toggle across open-weight models
Alibaba releases Qwen3 family (0.6B–235B MoE, Apache 2.0). All models support a thinking/non-thinking mode toggle per request. Qwen3 Coder 480B MoE targets software engineering. Strong multilingual support (29 languages). Extends Chinese AI labs' influence on open-weight ecosystem.
GPT-5.5 and DeepSeek V4 Preview
GPT-5.5 (April 23) and DeepSeek V4-Pro (April 24, 1.6T total params, MIT license) arrived within 24 hours of each other — emblematic of 2026's pace. V4-Pro's permissive licensing continued the open-weight momentum DeepSeek started in January 2025.
Claude 4 family: Haiku 4.5, Sonnet 4.6, Opus 4.7
Anthropic's fourth generation. Opus 4.7 launched April 16 — the most capable Claude to date. Sonnet 4.6 became the go-to production model: mid-tier cost, frontier-class reasoning for most tasks. The 4.x family raised the bar on agentic and multi-step tasks.
Graphify — codebase knowledge graphs go open-source
An MIT-licensed tool that turns any codebase into a queryable knowledge graph using local Tree-sitter parsing (no source code sent to external servers). 71.5× token reduction on real codebases. Crossed 22,000 GitHub stars in under ten days.
Claude Code Routines — scheduled AI automation
Anthropic added three trigger types to Claude Code: scheduled (cron-like), API-triggered, and GitHub event-triggered. AI-assisted automation moved from manual prompting to persistent background workflows.
Grok 4.3 — xAI flagship with 1M context window
xAI releases Grok 4.3 at $1.25/$2.50/1M tokens with a 1M token context window — one of the largest of any commercial model. $150/month free developer credits available. Live X/Twitter search integration remains Grok's unique differentiator.
Llama 4 Scout & Maverick — open-weight multimodal MoE
Meta releases Llama 4 Scout (109B MoE, 10M context) and Maverick (400B MoE, 17B active). Apache 2.0 licensed, natively multimodal, matching GPT-4o on major benchmarks at a fraction of inference cost. Scout fits a single H100. The open vs closed gap narrowed significantly.
Gemini 3.1 Pro leads scientific reasoning
94.3% on GPQA Diamond — a PhD-level science benchmark. 77.1% on ARC-AGI-2. Google's Gemini family regained benchmark leadership in the first quarter, intensifying the rotation between labs at the top of the leaderboards.
2025
GPT-5 integrated into ChatGPT
OpenAI's most capable model yet, with major improvements in multi-step reasoning, scientific problem-solving, and multi-modal understanding. Shipped directly into ChatGPT, making frontier capability accessible to all subscribers.
LazyGraphRAG — knowledge graphs become affordable
Microsoft's LazyGraphRAG reduced knowledge graph indexing cost from $20–500 to under $5 by deferring community summaries to query time. Graph RAG moved from a research technique to a practical production tool.
GPT-4.1 family — OpenAI's mid-range tier
GPT-4.1, Mini, and Nano gave developers a coherent OpenAI lineup matching the tiered structure Anthropic popularised. The cost gap between tiers widened: GPT-4.1 Nano at sub-$0.10/1M tokens vs GPT-4.1 at $2.00/1M.
Meta Llama 4 — natively multimodal MoE
Llama 4 Scout and Maverick used Mixture-of-Experts architecture (activating only a fraction of parameters per token) to deliver competitive performance at dramatically lower inference cost. Maverick beat GPT-4o on multiple benchmarks. Behemoth (2T total params) signalled the raw scale Meta was willing to deploy.
Gemini 2.5 Pro leads benchmarks
Google's Gemini 2.5 Pro with 'thinking budget' took top positions on coding and reasoning leaderboards. The 'thinking budget' feature let developers trade cost against reasoning depth — a new knob for production optimization.
Claude 3.7 Sonnet — hybrid reasoning
Anthropic's first model with an explicit 'extended thinking' mode — letting users toggle the reasoning depth. Topped SWE-bench for autonomous software engineering tasks. The practitioner shift toward reasoning models accelerated.
DeepSeek R1 — the 'DeepSeek shock'
A Chinese lab released an open-weight reasoning model trained for approximately $6M — matching OpenAI o1 on coding and maths. It became the #1 free app on the US iOS App Store within days. The narrative that frontier AI required billions in training compute collapsed overnight.
2024
OpenAI o3, Gemini 2.0, DeepSeek V3
A landmark month: o3 achieved near-human scores on ARC-AGI; Gemini 2.0 Flash matched o1 performance at lower cost; DeepSeek V3 quietly shipped as a top-tier open-weight model. The frontier moved dramatically in 30 days.
Anthropic open-sources MCP
The Model Context Protocol defined how AI assistants connect to external tools and data sources — files, databases, APIs, services. Within weeks, hundreds of MCP servers appeared. MCP became the de facto standard for AI tool integration in 2025.
Llama 3.2 adds vision and mobile
Meta added visual understanding to Llama and released models small enough to run on smartphones. On-device AI became practical for the first time on consumer hardware.
OpenAI o1 — reasoning models arrive
o1 allocated compute to 'thinking before answering' — generating an internal chain of reasoning invisible to users. It scored above PhD level on maths and science benchmarks. Test-time compute emerged as a new design dimension.
Claude 3.5 Sonnet surpasses Opus
A mid-tier model beating the flagship at lower cost. This broke the assumption that 'best quality = most expensive.' Cost-performance optimisation became a primary engineering concern from this point forward.
GPT-4o released
Faster, cheaper, and natively multimodal — audio, text, and images in a single model. Real-time voice mode with natural interruptions previewed a new tier of human-AI interaction. Speed and cost dropped 2× vs GPT-4 Turbo.
Meta releases Llama 3
Llama 3 70B matched proprietary models on most benchmarks. The era of 'open-weight models can compete with closed models' began in earnest. Developers gained a credible alternative to API-only providers.
Anthropic launches Claude 3 (Haiku, Sonnet, Opus)
The first coherent model family with clear capability/cost/speed tiers. Opus topped GPT-4 on multiple benchmarks. The tiered release became the template every lab copied in 2024–2025.
OpenAI debuts Sora
Sora produced photorealistic video from text prompts at a quality that shocked the industry. It signalled that generative AI's next frontier was temporal media, not just static images.