Intermediate

Inference API & Endpoints

Hugging Face offers several ways to run models without managing your own GPU infrastructure — from a free serverless API for prototyping to dedicated production endpoints that autoscale with your traffic.

From Free to Production

Free Prototyping

Shared, rate-limited, no SLA

Production

Dedicated, autoscaling, private

Free Inference API

Serverless Endpoints

Dedicated Endpoints

TGI self-hosted

Free Inference API

The free Inference API lets you run any public model on the Hub via HTTP — no setup, no GPU required. It's a shared resource, rate-limited, and not intended for production. Great for:

Testing whether a model does what you need before committing to it
Prototyping and demos where low traffic is fine
Educational use and experimentation

import requests

API_URL = "https://api-inference.huggingface.co/models/facebook/bart-large-cnn"
headers = {"Authorization": "Bearer hf_your_token_here"}

response = requests.post(API_URL,
    headers=headers,
    json={"inputs": "Large text to summarize goes here..."}
)
print(response.json()[0]["summary_text"])

Rate limits vary by model and your account tier. Free accounts get a reasonable daily quota for experimentation.

Serverless Inference Endpoints

A step up from the free API — you deploy a specific model on Hugging Face infrastructure and pay only for what you use (per token or per second). The endpoint scales to zero when idle. Good for:

Low-to-medium traffic production apps that don't need consistent latency
Cost-optimized deployments where cold starts are acceptable
Private model deployments (your fine-tuned models)

Dedicated Inference Endpoints

Dedicated Endpoints give you a private, always-on GPU instance running a single model. You choose the hardware tier, region, and scaling behaviour. The endpoint gets a private HTTPS URL.

Tier	GPU	Cost	Good for
CPU Medium	None — CPU only	~$0.06/hr	Embedding models, classifiers
T4 Small	NVIDIA T4 (16 GB)	~$0.60/hr	7B models, real-time inference
A10G Small	NVIDIA A10G (24 GB)	~$1.05/hr	13B models, image generation
A10G Large	4× A10G (96 GB)	~$3.80/hr	Larger models, higher throughput
A100 Large	NVIDIA A100 (80 GB)	~$3.15/hr	70B models, high-quality LLMs
H100	NVIDIA H100 (80 GB)	Custom	Frontier inference, maximum speed

Dedicated Endpoints expose an OpenAI-compatible API by default — meaning you can use the OpenAI Python client pointed at your endpoint URL. For RAG pipelines, TEI embedding endpoints pair well with the vector databases guide.

Text Generation Inference (TGI)

TGI is Hugging Face's high-throughput LLM serving toolkit. It's what powers Dedicated Endpoints for text generation models — and you can also self-host it via Docker for your own infrastructure.

Key features:

Continuous batching — processes multiple requests simultaneously for maximum GPU utilization
PagedAttention — memory-efficient KV cache (similar to vLLM)
Tensor parallelism — spread a model across multiple GPUs
Speculative decoding — use a small draft model to generate candidate tokens, verified by the main model
Structured outputs — grammar-constrained generation for reliable JSON output
OpenAI-compatible API

# Self-host TGI via Docker
docker run --gpus all -p 8080:80 \
  -v $HOME/.cache/huggingface:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id mistralai/Mistral-7B-Instruct-v0.3

# Query the server (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "tgi", "messages": [{"role": "user", "content": "Hello"}]}'

Text Embeddings Inference (TEI)

TEI is TGI's companion for embedding models. It serves models like BAAI/bge-large-en-v1.5 or sentence-transformers/all-MiniLM-L6-v2 at high throughput — useful for RAG pipelines that need to embed millions of documents.

# Self-host TEI via Docker
docker run --gpus all -p 8080:80 \
  -v $HOME/.cache/huggingface:/data \
  ghcr.io/huggingface/text-embeddings-inference:latest \
  --model-id BAAI/bge-large-en-v1.5

When to Use Each

Free Inference API

Testing a model quickly
Educational projects
Infrequent, low-volume use

Serverless Endpoints

Production with variable/bursty traffic
Cost-sensitive with acceptable cold starts
Private fine-tuned model serving

Dedicated Endpoints

Consistent low-latency SLA required
High-throughput production serving
Regulatory requirement for dedicated resources

TGI / TEI self-hosted

Data sovereignty — model never leaves your infra
Maximum cost control at scale
Custom modifications to serving logic

Checklist: Do You Understand This?

Can you call the free Inference API with a Python requests call?
Do you know when to use Serverless vs Dedicated Endpoints?
Can you explain what TGI is and why it's faster than naive inference?
Do you know the difference between TGI (text generation) and TEI (embeddings)?