🧠 All Things AI
Intermediate

Google AI Studio & Vertex AI

Google provides AI models through two distinct platforms: AI Studio (fast, free tier, prototyping) and Vertex AI (enterprise, production, compliance). Both serve the Gemini model family, which leads on context window size and multimodal capability.

AI Studio vs Vertex AI

Google AI Studio (aistudio.google.com)

  • Free tier with generous daily limits
  • Simple API key authentication
  • Instant access — no provisioning
  • Best for prototyping, exploration, and development
  • Data used to improve Google models (unless you opt out)
  • Google Gemini SDK / OpenAI-compatible endpoint available

Vertex AI

  • Enterprise managed platform; pay-per-use
  • GCP IAM authentication + service accounts
  • Data not used for model training
  • SOC2, HIPAA, ISO 27001 compliance
  • VPC private endpoints, data residency controls
  • Supports all Gemini models + fine-tuning + model garden (3P models)

Gemini Model Family

ModelContextInput ($/1M)Strengths
Gemini 2.5 Flash1M tokens$0.15 (short) / $0.40 (long)Best price/performance; fast; 1M context
Gemini 2.5 Pro1M tokens$1.25 (short) / $2.50 (long)Top coding + maths benchmarks; thinking mode; multimodal
Gemini 1.5 Flash (legacy)1M tokens$0.075Stable, fast, very cheap for high-volume tasks

Note: "short" context pricing applies to prompts under 128K tokens; "long" pricing applies above 128K. For 1M token use cases, plan for the higher rate.

1M Token Context in Practice

Gemini 2.5's 1M token context window is the largest in production among major commercial APIs. What this enables:

  • Entire codebase analysis — Load 50K+ lines of code in a single call and ask architectural questions across the full context
  • Large document processing — A 1M context fits approximately 700–800 pages of text; entire books or document collections fit in one call
  • Multi-document reasoning — Feed hundreds of PDFs simultaneously and ask questions that require synthesising across them
  • Long video analysis — Up to ~1 hour of video can be processed as frames in the context

Cost consideration: A single 1M-token call to Gemini 2.5 Pro costs approximately $2.50 (input) + output tokens. For repeated queries over the same large document, consider context caching (see below) or RAG for documents queried many times.

Multimodal Inputs

Gemini natively handles multiple input types in a single API call:

  • Images — PNG, JPEG, GIF, WebP; up to 3,600 images per request
  • Video — MP4, MOV, AVI up to 1GB; frames sampled automatically
  • Audio — WAV, MP3, FLAC; models transcribe and reason over audio
  • Documents — PDF, text, HTML, CSV; processed natively
  • Code — All programming languages; strong code understanding

Mixing modalities in one call is first-class: e.g., "Here is a diagram [image], the related code [code], and a voice note explaining the context [audio] — what should I change?"

Grounding with Google Search

Gemini can ground its responses in Google Search results in real time:

{
  "tools": [{"google_search": {}}],
  "contents": [{"role": "user", "parts": [{"text": "What happened in AI this week?"}]}]
}

When grounding is enabled, Gemini automatically issues search queries, retrieves results, and incorporates them into the response with citations. This is distinct from RAG — no vector database needed; it uses Google's live search index.

Deep Think (Thinking Mode)

Gemini 2.5 Pro supports a thinking mode analogous to OpenAI's o-series:

{
  "model": "gemini-2.5-pro",
  "generation_config": {
    "thinking_config": {
      "thinking_budget": 8192
    }
  }
}

The thinking budget controls how many tokens the model spends reasoning before answering. Gemini 2.5 Pro with thinking enabled performs competitively with o3 and DeepSeek-R1 on maths and coding benchmarks.

Context Caching

Gemini supports explicit context caching — you store large context on Google's servers and reference it by cache ID in subsequent calls:

  • Cache contents for 1 hour to 1 month
  • Cached tokens cost ~4× less than fresh input tokens
  • Minimum 32K tokens to cache
  • Huge savings when the same large document is queried many times

Vertex AI Production Setup

For production Vertex AI deployment:

  1. Enable the Vertex AI API in your GCP project
  2. Create a service account with roles/aiplatform.user
  3. Use Application Default Credentials or Workload Identity for authentication
  4. Select a regional endpoint (us-central1, europe-west4, etc.) for data residency
  5. Set up billing alerts — long-context calls can be expensive at scale

Checklist: Do You Understand This?

  • What is the key difference between AI Studio and Vertex AI in terms of data and compliance?
  • What types of use cases justify the 1M token context window?
  • How does Grounding with Search differ from RAG?
  • What is context caching and when is it worth using?
  • How do you enable thinking mode (Deep Think) in the Gemini API?
  • What GCP authentication mechanism should you use for production Vertex AI?