Advanced

Hosted Fine-Tuning Services

Not every team needs to manage GPU infrastructure to fine-tune a model. Hosted fine-tuning services handle the hardware and training pipeline — you provide the data and get back a fine-tuned model endpoint. This page covers the major options and when each makes sense.

OpenAI Fine-Tuning

OpenAI supports fine-tuning for GPT-4o mini and GPT-4o. This is the simplest path if you're already using OpenAI and want to customise model behaviour.

Data Format

OpenAI fine-tuning uses JSONL format with conversation turns:

{"messages": [
  {"role": "system", "content": "You are a helpful product assistant."},
  {"role": "user", "content": "What is your return policy?"},
  {"role": "assistant", "content": "Our return policy allows returns within 30 days..."}
]}
{"messages": [...]}

Process

Upload training JSONL via Files API
Create a fine-tuning job specifying base model, training file, epochs, learning rate
Monitor job status via API or dashboard
Fine-tuned model is assigned a model ID and accessible via standard Chat Completions

Pricing

Training: $0.025 per 1,000 training tokens for GPT-4o mini
Inference: 2–4× higher than base model (fine-tuned endpoint pricing)
1 epoch over 10,000 examples of 500 tokens each = 5M tokens = ~$125

Limitations

You cannot see or export the fine-tuned weights — they stay on OpenAI's infrastructure
Fine-tuned models are not automatically updated when base models update
Data sent to OpenAI; privacy implications apply

Google Vertex AI Supervised Tuning

Vertex AI supports supervised fine-tuning for Gemini models:

Dataset stored in Google Cloud Storage (GCS)
Supports Gemini Flash and Pro models
Managed training pipeline — no GPU provisioning needed
Fine-tuned models deployed to Vertex AI endpoints
Training data stays within your GCP project/region

Particularly useful for teams already on GCP wanting to fine-tune Gemini for specific enterprise tasks with data residency requirements.

Together.ai Fine-Tuning

Together.ai offers fine-tuning for open-weight models (LLaMA, Mistral, Qwen, etc.) at competitive prices:

Models available: LLaMA 3.1 (8B, 70B), Mistral 7B, Qwen 2.5
Training cost: ~$0.002–0.01 per 1K training tokens (significantly cheaper than OpenAI)
Fine-tuned model deployed as a private Together.ai endpoint
Can also download the fine-tuned weights (unlike OpenAI)
JSONL format similar to OpenAI

Best choice when: you want to fine-tune an open-weight model without managing GPUs, at lower cost than OpenAI, and with the option to export weights for self-hosting.

Hugging Face AutoTrain

AutoTrain provides a no-code/low-code interface for fine-tuning on Hugging Face infrastructure:

Supports text classification, summarisation, question answering, causal LM
Upload dataset via UI; select model; configure hyperparameters visually
Fine-tuned model saved directly to your Hugging Face repository
Can be deployed immediately via HF Inference Endpoints
Pay per compute hour (A10G, A100 GPU options)

Axolotl: Self-Managed on Cloud GPUs

For teams wanting more control without full infrastructure ownership, running Axolotl on a rented cloud GPU (Lambda Labs, Vast.ai, RunPod) is a practical middle ground:

Rent a single A100 80GB at ~$1–3/hour
Run QLoRA fine-tuning via Axolotl YAML config
Full control over model, data, and training parameters
Save fine-tuned weights to your own storage
Cost for a typical fine-tuning run: $10–50 total

Data Preparation: Common Pitfalls

Fine-tuning fails because of bad data more than anything else

Duplicates: Deduplication is essential; duplicated examples overweight certain patterns
Inconsistent quality: Mix of good and bad examples confuses the model
Wrong format: Missing system prompts, wrong role names, malformed JSON
Too narrow: Examples only cover a slice of your actual production inputs
Label leakage: Training examples that won't exist in production (e.g., include the answer in the question)
Insufficient test split: If you use all data for training, you can't evaluate properly

Service Comparison

Service	Base models	Export weights?	Relative cost	Best for
OpenAI	GPT-4o mini, GPT-4o	No	Highest	Existing OpenAI users; easiest integration
Google Vertex AI	Gemini models	No	High	GCP-native teams; Gemini customisation
Together.ai	Llama, Mistral, Qwen	Yes	Medium	Open-weight fine-tuning; cost-conscious
HF AutoTrain	Any HF model	Yes	Medium	HF ecosystem users; broadest model selection
Axolotl + rented GPU	Any open-weight	Yes (you own everything)	Lowest	Full control; sensitive data; advanced users

Deployment After Fine-Tuning

Your options after fine-tuning depend on whether you can export the weights:

No weight export (OpenAI, Vertex): Use the provider's endpoint directly; model stays in their infrastructure
Weights exported: Upload to Hugging Face private repo → deploy via HF Inference Endpoints, or self-host via vLLM/Ollama after GGUF conversion
Convert to GGUF for Ollama: Use llama.cpp convert scripts to quantise exported weights for local serving

Checklist: Do You Understand This?

What JSONL format does OpenAI fine-tuning expect?
What is the key advantage of Together.ai over OpenAI for fine-tuning?
Name three common data preparation mistakes that cause fine-tuning to fail.
What is the "Axolotl + rented GPU" approach and when does it make sense?
After fine-tuning with weight export, how would you serve the model locally via Ollama?