Intermediate

Google AI Studio & Vertex AI

Google provides AI models through two distinct platforms: AI Studio (fast, free tier, prototyping) and Vertex AI (enterprise, production, compliance). Both serve the Gemini model family, which leads on context window size and multimodal capability.

AI Studio vs Vertex AI

Google AI Studio (aistudio.google.com)

Free tier with generous daily limits
Simple API key authentication
Instant access — no provisioning
Best for prototyping, exploration, and development
Data used to improve Google models (unless you opt out)
Google Gemini SDK / OpenAI-compatible endpoint available

Vertex AI

Enterprise managed platform; pay-per-use
GCP IAM authentication + service accounts
Data not used for model training
SOC2, HIPAA, ISO 27001 compliance
VPC private endpoints, data residency controls
Supports all Gemini models + fine-tuning + model garden (3P models)

Gemini Model Family

Model	Context	Input ($/1M)	Strengths
Gemini 2.5 Flash	1M tokens	$0.15 (short) / $0.40 (long)	Best price/performance; fast; 1M context
Gemini 2.5 Pro	1M tokens	$1.25 (short) / $2.50 (long)	Top coding + maths benchmarks; thinking mode; multimodal
Gemini 1.5 Flash (legacy)	1M tokens	$0.075	Stable, fast, very cheap for high-volume tasks

Note: "short" context pricing applies to prompts under 128K tokens; "long" pricing applies above 128K. For 1M token use cases, plan for the higher rate.

1M Token Context in Practice

Gemini 2.5's 1M token context window is the largest in production among major commercial APIs. What this enables:

Entire codebase analysis — Load 50K+ lines of code in a single call and ask architectural questions across the full context
Large document processing — A 1M context fits approximately 700–800 pages of text; entire books or document collections fit in one call
Multi-document reasoning — Feed hundreds of PDFs simultaneously and ask questions that require synthesising across them
Long video analysis — Up to ~1 hour of video can be processed as frames in the context

Cost consideration: A single 1M-token call to Gemini 2.5 Pro costs approximately $2.50 (input) + output tokens. For repeated queries over the same large document, consider context caching (see below) or RAG for documents queried many times.

Multimodal Inputs

Gemini natively handles multiple input types in a single API call:

Images — PNG, JPEG, GIF, WebP; up to 3,600 images per request
Video — MP4, MOV, AVI up to 1GB; frames sampled automatically
Audio — WAV, MP3, FLAC; models transcribe and reason over audio
Documents — PDF, text, HTML, CSV; processed natively
Code — All programming languages; strong code understanding

Mixing modalities in one call is first-class: e.g., "Here is a diagram [image], the related code [code], and a voice note explaining the context [audio] — what should I change?"

Grounding with Google Search

Gemini can ground its responses in Google Search results in real time:

{
  "tools": [{"google_search": {}}],
  "contents": [{"role": "user", "parts": [{"text": "What happened in AI this week?"}]}]
}

When grounding is enabled, Gemini automatically issues search queries, retrieves results, and incorporates them into the response with citations. This is distinct from RAG — no vector database needed; it uses Google's live search index.

Deep Think (Thinking Mode)

Gemini 2.5 Pro supports a thinking mode analogous to OpenAI's o-series:

{
  "model": "gemini-2.5-pro",
  "generation_config": {
    "thinking_config": {
      "thinking_budget": 8192
    }
  }
}

The thinking budget controls how many tokens the model spends reasoning before answering. Gemini 2.5 Pro with thinking enabled performs competitively with o3 and DeepSeek-R1 on maths and coding benchmarks.

Context Caching

Gemini supports explicit context caching — you store large context on Google's servers and reference it by cache ID in subsequent calls:

Cache contents for 1 hour to 1 month
Cached tokens cost ~4× less than fresh input tokens
Minimum 32K tokens to cache
Huge savings when the same large document is queried many times

Vertex AI Production Setup

For production Vertex AI deployment:

Enable the Vertex AI API in your GCP project
Create a service account with roles/aiplatform.user
Use Application Default Credentials or Workload Identity for authentication
Select a regional endpoint (us-central1, europe-west4, etc.) for data residency
Set up billing alerts — long-context calls can be expensive at scale

Checklist: Do You Understand This?

What is the key difference between AI Studio and Vertex AI in terms of data and compliance?
What types of use cases justify the 1M token context window?
How does Grounding with Search differ from RAG?
What is context caching and when is it worth using?
How do you enable thinking mode (Deep Think) in the Gemini API?
What GCP authentication mechanism should you use for production Vertex AI?