Intermediate

Transformers Library

The Hugging Face Transformers library is the most widely used machine learning library in the world — 200k+ GitHub stars, used by every major AI lab. It provides a unified Python API to load, run, fine-tune, and deploy models across text, vision, audio, and multimodal tasks.

Install

pip install transformers          # CPU only
pip install transformers torch    # + PyTorch (GPU)
pip install transformers[torch]   # + PyTorch dependencies

Pipeline API — 3-Line Inference

The pipeline() function is the highest-level interface. It bundles tokenization, model loading, forward pass, and output decoding into a single call. Most common tasks are supported.

Input text

→

Tokenizer

text → token IDs

→

Model

forward pass

→

Post-process

IDs → text / labels

→

Output

What pipeline() does under the hood in 3 user-facing lines

from transformers import pipeline

# Text generation
gen = pipeline("text-generation", model="meta-llama/Llama-3.2-3B-Instruct")
print(gen("The key to good code is")[0]["generated_text"])

# Summarization
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
print(summarizer("Long article text here...", max_length=130)[0]["summary_text"])

# Sentiment analysis (uses default model if none specified)
classifier = pipeline("sentiment-analysis")
print(classifier("This product is amazing!"))
# [{'label': 'POSITIVE', 'score': 0.9998}]

# Zero-shot classification — no fine-tuning needed
zsc = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
print(zsc("I want to book a flight", candidate_labels=["travel", "food", "sports"]))

# Speech to text
asr = pipeline("automatic-speech-recognition", model="openai/whisper-large-v3")
print(asr("audio.mp3")["text"])

# Image classification
img_clf = pipeline("image-classification", model="google/vit-base-patch16-224")
print(img_clf("cat.jpg"))

Supported Tasks

text-generation

text2text-generation

summarization

translation

fill-mask

text-classification

token-classification

question-answering

zero-shot-classification

image-classification

object-detection

image-segmentation

image-to-text

text-to-image

visual-question-answering

automatic-speech-recognition

audio-classification

text-to-speech

AutoClasses — Model-Agnostic Loading

For more control than pipeline(), use AutoClasses. They detect the model architecture automatically from the Hub config and load the right class:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "mistralai/Mistral-7B-Instruct-v0.3"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,   # half precision — uses half the VRAM
    device_map="auto",           # spread across available GPUs/CPU
)

inputs = tokenizer("Explain RAG briefly:", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Trainer — Fine-Tuning

The Trainer class handles the full fine-tuning loop: batching, gradient accumulation, mixed precision, checkpointing, evaluation, and distributed training. Combine with the datasets library to fine-tune on any Hub dataset:

from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
dataset = load_dataset("imdb")

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    evaluation_strategy="epoch",
    fp16=True,               # mixed precision
)

trainer = Trainer(model=model, args=training_args,
                  train_dataset=dataset["train"], eval_dataset=dataset["test"])
trainer.train()

PEFT — Parameter-Efficient Fine-Tuning

Fine-tuning a full 7B model requires 80+ GB VRAM. PEFT (Parameter-Efficient Fine-Tuning) techniques like LoRA and QLoRA let you fine-tune on consumer hardware by updating only a small set of adapter weights:

from peft import get_peft_model, LoraConfig, TaskType

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,           # rank — lower = fewer params updated
    lora_alpha=32,
    lora_dropout=0.1,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06

Checklist: Do You Understand This?

Can you use pipeline() to run summarization, classification, and speech recognition?
Do you know when to use pipeline() vs AutoClasses + manual inference?
Can you set up a basic fine-tuning run with Trainer?
Do you understand what LoRA does and why it's used for fine-tuning large models?
Do you know what device_map="auto" does?