Intermediate

Inference API & Endpoints

Hugging Face offers several ways to run models without managing your own GPU infrastructure — from a free serverless API for prototyping to dedicated production endpoints that autoscale with your traffic.

From Free to Production

Free Prototyping
Shared, rate-limited, no SLA
Production
Dedicated, autoscaling, private
Free Inference API
Serverless Endpoints
Dedicated Endpoints
TGI self-hosted

Free Inference API

The free Inference API lets you run any public model on the Hub via HTTP — no setup, no GPU required. It's a shared resource, rate-limited, and not intended for production. Great for:

  • Testing whether a model does what you need before committing to it
  • Prototyping and demos where low traffic is fine
  • Educational use and experimentation
import requests

API_URL = "https://api-inference.huggingface.co/models/facebook/bart-large-cnn"
headers = {"Authorization": "Bearer hf_your_token_here"}

response = requests.post(API_URL,
    headers=headers,
    json={"inputs": "Large text to summarize goes here..."}
)
print(response.json()[0]["summary_text"])

Rate limits vary by model and your account tier. Free accounts get a reasonable daily quota for experimentation.

Serverless Inference Endpoints

A step up from the free API — you deploy a specific model on Hugging Face infrastructure and pay only for what you use (per token or per second). The endpoint scales to zero when idle. Good for:

  • Low-to-medium traffic production apps that don't need consistent latency
  • Cost-optimized deployments where cold starts are acceptable
  • Private model deployments (your fine-tuned models)

Dedicated Inference Endpoints

Dedicated Endpoints give you a private, always-on GPU instance running a single model. You choose the hardware tier, region, and scaling behaviour. The endpoint gets a private HTTPS URL.

TierGPUCostGood for
CPU MediumNone — CPU only~$0.06/hrEmbedding models, classifiers
T4 SmallNVIDIA T4 (16 GB)~$0.60/hr7B models, real-time inference
A10G SmallNVIDIA A10G (24 GB)~$1.05/hr13B models, image generation
A10G Large4× A10G (96 GB)~$3.80/hrLarger models, higher throughput
A100 LargeNVIDIA A100 (80 GB)~$3.15/hr70B models, high-quality LLMs
H100NVIDIA H100 (80 GB)CustomFrontier inference, maximum speed

Dedicated Endpoints expose an OpenAI-compatible API by default — meaning you can use the OpenAI Python client pointed at your endpoint URL. For RAG pipelines, TEI embedding endpoints pair well with the vector databases guide.

Text Generation Inference (TGI)

TGI is Hugging Face's high-throughput LLM serving toolkit. It's what powers Dedicated Endpoints for text generation models — and you can also self-host it via Docker for your own infrastructure.

Key features:

  • Continuous batching — processes multiple requests simultaneously for maximum GPU utilization
  • PagedAttention — memory-efficient KV cache (similar to vLLM)
  • Tensor parallelism — spread a model across multiple GPUs
  • Speculative decoding — use a small draft model to generate candidate tokens, verified by the main model
  • Structured outputs — grammar-constrained generation for reliable JSON output
  • OpenAI-compatible API
# Self-host TGI via Docker
docker run --gpus all -p 8080:80 \
  -v $HOME/.cache/huggingface:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id mistralai/Mistral-7B-Instruct-v0.3

# Query the server (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "tgi", "messages": [{"role": "user", "content": "Hello"}]}'

Text Embeddings Inference (TEI)

TEI is TGI's companion for embedding models. It serves models like BAAI/bge-large-en-v1.5 or sentence-transformers/all-MiniLM-L6-v2 at high throughput — useful for RAG pipelines that need to embed millions of documents.

# Self-host TEI via Docker
docker run --gpus all -p 8080:80 \
  -v $HOME/.cache/huggingface:/data \
  ghcr.io/huggingface/text-embeddings-inference:latest \
  --model-id BAAI/bge-large-en-v1.5

When to Use Each

Free Inference API
  • Testing a model quickly
  • Educational projects
  • Infrequent, low-volume use
Serverless Endpoints
  • Production with variable/bursty traffic
  • Cost-sensitive with acceptable cold starts
  • Private fine-tuned model serving
Dedicated Endpoints
  • Consistent low-latency SLA required
  • High-throughput production serving
  • Regulatory requirement for dedicated resources
TGI / TEI self-hosted
  • Data sovereignty — model never leaves your infra
  • Maximum cost control at scale
  • Custom modifications to serving logic

Checklist: Do You Understand This?

  • Can you call the free Inference API with a Python requests call?
  • Do you know when to use Serverless vs Dedicated Endpoints?
  • Can you explain what TGI is and why it's faster than naive inference?
  • Do you know the difference between TGI (text generation) and TEI (embeddings)?

Page built: 01 Jun 2026