Inference API & Endpoints
Hugging Face offers several ways to run models without managing your own GPU infrastructure — from a free serverless API for prototyping to dedicated production endpoints that autoscale with your traffic.
From Free to Production
Free Inference API
The free Inference API lets you run any public model on the Hub via HTTP — no setup, no GPU required. It's a shared resource, rate-limited, and not intended for production. Great for:
- Testing whether a model does what you need before committing to it
- Prototyping and demos where low traffic is fine
- Educational use and experimentation
import requests
API_URL = "https://api-inference.huggingface.co/models/facebook/bart-large-cnn"
headers = {"Authorization": "Bearer hf_your_token_here"}
response = requests.post(API_URL,
headers=headers,
json={"inputs": "Large text to summarize goes here..."}
)
print(response.json()[0]["summary_text"])Rate limits vary by model and your account tier. Free accounts get a reasonable daily quota for experimentation.
Serverless Inference Endpoints
A step up from the free API — you deploy a specific model on Hugging Face infrastructure and pay only for what you use (per token or per second). The endpoint scales to zero when idle. Good for:
- Low-to-medium traffic production apps that don't need consistent latency
- Cost-optimized deployments where cold starts are acceptable
- Private model deployments (your fine-tuned models)
Dedicated Inference Endpoints
Dedicated Endpoints give you a private, always-on GPU instance running a single model. You choose the hardware tier, region, and scaling behaviour. The endpoint gets a private HTTPS URL.
| Tier | GPU | Cost | Good for |
|---|---|---|---|
| CPU Medium | None — CPU only | ~$0.06/hr | Embedding models, classifiers |
| T4 Small | NVIDIA T4 (16 GB) | ~$0.60/hr | 7B models, real-time inference |
| A10G Small | NVIDIA A10G (24 GB) | ~$1.05/hr | 13B models, image generation |
| A10G Large | 4× A10G (96 GB) | ~$3.80/hr | Larger models, higher throughput |
| A100 Large | NVIDIA A100 (80 GB) | ~$3.15/hr | 70B models, high-quality LLMs |
| H100 | NVIDIA H100 (80 GB) | Custom | Frontier inference, maximum speed |
Dedicated Endpoints expose an OpenAI-compatible API by default — meaning you can use the OpenAI Python client pointed at your endpoint URL. For RAG pipelines, TEI embedding endpoints pair well with the vector databases guide.
Text Generation Inference (TGI)
TGI is Hugging Face's high-throughput LLM serving toolkit. It's what powers Dedicated Endpoints for text generation models — and you can also self-host it via Docker for your own infrastructure.
Key features:
- Continuous batching — processes multiple requests simultaneously for maximum GPU utilization
- PagedAttention — memory-efficient KV cache (similar to vLLM)
- Tensor parallelism — spread a model across multiple GPUs
- Speculative decoding — use a small draft model to generate candidate tokens, verified by the main model
- Structured outputs — grammar-constrained generation for reliable JSON output
- OpenAI-compatible API
# Self-host TGI via Docker
docker run --gpus all -p 8080:80 \
-v $HOME/.cache/huggingface:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id mistralai/Mistral-7B-Instruct-v0.3
# Query the server (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "tgi", "messages": [{"role": "user", "content": "Hello"}]}'Text Embeddings Inference (TEI)
TEI is TGI's companion for embedding models. It serves models like BAAI/bge-large-en-v1.5 or sentence-transformers/all-MiniLM-L6-v2 at high throughput — useful for RAG pipelines that need to embed millions of documents.
# Self-host TEI via Docker docker run --gpus all -p 8080:80 \ -v $HOME/.cache/huggingface:/data \ ghcr.io/huggingface/text-embeddings-inference:latest \ --model-id BAAI/bge-large-en-v1.5
When to Use Each
- Testing a model quickly
- Educational projects
- Infrequent, low-volume use
- Production with variable/bursty traffic
- Cost-sensitive with acceptable cold starts
- Private fine-tuned model serving
- Consistent low-latency SLA required
- High-throughput production serving
- Regulatory requirement for dedicated resources
- Data sovereignty — model never leaves your infra
- Maximum cost control at scale
- Custom modifications to serving logic
Checklist: Do You Understand This?
- Can you call the free Inference API with a Python requests call?
- Do you know when to use Serverless vs Dedicated Endpoints?
- Can you explain what TGI is and why it's faster than naive inference?
- Do you know the difference between TGI (text generation) and TEI (embeddings)?