Hosted Fine-Tuning Services
Not every team needs to manage GPU infrastructure to fine-tune a model. Hosted fine-tuning services handle the hardware and training pipeline — you provide the data and get back a fine-tuned model endpoint. This page covers the major options and when each makes sense.
OpenAI Fine-Tuning
OpenAI supports fine-tuning for GPT-4o mini and GPT-4o. This is the simplest path if you're already using OpenAI and want to customise model behaviour.
Data Format
OpenAI fine-tuning uses JSONL format with conversation turns:
{"messages": [
{"role": "system", "content": "You are a helpful product assistant."},
{"role": "user", "content": "What is your return policy?"},
{"role": "assistant", "content": "Our return policy allows returns within 30 days..."}
]}
{"messages": [...]}Process
- Upload training JSONL via Files API
- Create a fine-tuning job specifying base model, training file, epochs, learning rate
- Monitor job status via API or dashboard
- Fine-tuned model is assigned a model ID and accessible via standard Chat Completions
Pricing
- Training: $0.025 per 1,000 training tokens for GPT-4o mini
- Inference: 2–4× higher than base model (fine-tuned endpoint pricing)
- 1 epoch over 10,000 examples of 500 tokens each = 5M tokens = ~$125
Limitations
- You cannot see or export the fine-tuned weights — they stay on OpenAI's infrastructure
- Fine-tuned models are not automatically updated when base models update
- Data sent to OpenAI; privacy implications apply
Google Vertex AI Supervised Tuning
Vertex AI supports supervised fine-tuning for Gemini models:
- Dataset stored in Google Cloud Storage (GCS)
- Supports Gemini Flash and Pro models
- Managed training pipeline — no GPU provisioning needed
- Fine-tuned models deployed to Vertex AI endpoints
- Training data stays within your GCP project/region
Particularly useful for teams already on GCP wanting to fine-tune Gemini for specific enterprise tasks with data residency requirements.
Together.ai Fine-Tuning
Together.ai offers fine-tuning for open-weight models (LLaMA, Mistral, Qwen, etc.) at competitive prices:
- Models available: LLaMA 3.1 (8B, 70B), Mistral 7B, Qwen 2.5
- Training cost: ~$0.002–0.01 per 1K training tokens (significantly cheaper than OpenAI)
- Fine-tuned model deployed as a private Together.ai endpoint
- Can also download the fine-tuned weights (unlike OpenAI)
- JSONL format similar to OpenAI
Best choice when: you want to fine-tune an open-weight model without managing GPUs, at lower cost than OpenAI, and with the option to export weights for self-hosting.
Hugging Face AutoTrain
AutoTrain provides a no-code/low-code interface for fine-tuning on Hugging Face infrastructure:
- Supports text classification, summarisation, question answering, causal LM
- Upload dataset via UI; select model; configure hyperparameters visually
- Fine-tuned model saved directly to your Hugging Face repository
- Can be deployed immediately via HF Inference Endpoints
- Pay per compute hour (A10G, A100 GPU options)
Axolotl: Self-Managed on Cloud GPUs
For teams wanting more control without full infrastructure ownership, running Axolotl on a rented cloud GPU (Lambda Labs, Vast.ai, RunPod) is a practical middle ground:
- Rent a single A100 80GB at ~$1–3/hour
- Run QLoRA fine-tuning via Axolotl YAML config
- Full control over model, data, and training parameters
- Save fine-tuned weights to your own storage
- Cost for a typical fine-tuning run: $10–50 total
Data Preparation: Common Pitfalls
Fine-tuning fails because of bad data more than anything else
- Duplicates: Deduplication is essential; duplicated examples overweight certain patterns
- Inconsistent quality: Mix of good and bad examples confuses the model
- Wrong format: Missing system prompts, wrong role names, malformed JSON
- Too narrow: Examples only cover a slice of your actual production inputs
- Label leakage: Training examples that won't exist in production (e.g., include the answer in the question)
- Insufficient test split: If you use all data for training, you can't evaluate properly
Service Comparison
| Service | Base models | Export weights? | Relative cost | Best for |
|---|---|---|---|---|
| OpenAI | GPT-4o mini, GPT-4o | No | Highest | Existing OpenAI users; easiest integration |
| Google Vertex AI | Gemini models | No | High | GCP-native teams; Gemini customisation |
| Together.ai | Llama, Mistral, Qwen | Yes | Medium | Open-weight fine-tuning; cost-conscious |
| HF AutoTrain | Any HF model | Yes | Medium | HF ecosystem users; broadest model selection |
| Axolotl + rented GPU | Any open-weight | Yes (you own everything) | Lowest | Full control; sensitive data; advanced users |
Deployment After Fine-Tuning
Your options after fine-tuning depend on whether you can export the weights:
- No weight export (OpenAI, Vertex): Use the provider's endpoint directly; model stays in their infrastructure
- Weights exported: Upload to Hugging Face private repo → deploy via HF Inference Endpoints, or self-host via vLLM/Ollama after GGUF conversion
- Convert to GGUF for Ollama: Use
llama.cppconvert scripts to quantise exported weights for local serving
Checklist: Do You Understand This?
- What JSONL format does OpenAI fine-tuning expect?
- What is the key advantage of Together.ai over OpenAI for fine-tuning?
- Name three common data preparation mistakes that cause fine-tuning to fail.
- What is the "Axolotl + rented GPU" approach and when does it make sense?
- After fine-tuning with weight export, how would you serve the model locally via Ollama?