Hugging Face Datasets
The datasets library gives you one-line access to 500k+ public datasets — and a fast, memory-efficient way to work with large data without loading it all into RAM at once.
Install
pip install datasets
Loading a Dataset
from datasets import load_dataset
# Load the IMDb sentiment dataset
dataset = load_dataset("imdb")
print(dataset)
# DatasetDict({
# train: Dataset({features: ['text', 'label'], num_rows: 25000})
# test: Dataset({features: ['text', 'label'], num_rows: 25000})
# })
# Load a specific split
train_data = load_dataset("imdb", split="train")
# Load with a subset / config
squad = load_dataset("rajpurkar/squad", "plain_text")
# Load just the first 1000 examples (faster for prototyping)
small = load_dataset("imdb", split="train[:1000]")Navigating Dataset Objects
dataset = load_dataset("imdb")
train = dataset["train"]
# Shape
print(len(train)) # 25000
print(train.features) # {'text': Value('string'), 'label': ClassLabel(names=['neg', 'pos'])}
# Single example
print(train[0]) # {'text': '...', 'label': 0}
# Slice
print(train[:3]["text"]) # first 3 texts
# Column
print(train["label"][:10]) # first 10 labelsTransforming Data with map()
map() applies a function to every example — or every batch — and is the primary way to preprocess data (tokenization, formatting, filtering):
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def tokenize(example):
return tokenizer(example["text"], truncation=True, padding="max_length", max_length=512)
# Apply to all examples (cached to disk — won't re-run on reload)
tokenized = train.map(tokenize, batched=True)
# Filter to only positive examples
positives = train.filter(lambda x: x["label"] == 1)
# Rename / remove columns
cleaned = train.rename_column("text", "input_text").remove_columns(["label"])Streaming — Large Datasets Without RAM
Some datasets are hundreds of GB. Loading them into RAM would fail on most machines. Streaming downloads and processes data on-the-fly, one batch at a time:
# Stream a 100GB+ dataset without loading it into RAM
dataset = load_dataset("cerebras/SlimPajama-627B", streaming=True, split="train")
for batch in dataset.iter(batch_size=32):
# Process 32 examples at a time
# batch["text"] is a list of 32 strings
pass
# Works with map() and filter() too
streamed = dataset.map(tokenize, batched=True).filter(lambda x: len(x["text"]) > 100)Streaming: process 100GB+ datasets on a laptop — data never fully loaded into RAM
Arrow Format — Why It's Fast
Datasets uses the Apache Arrow format for on-disk storage. Arrow is a columnar memory format designed for zero-copy reads — meaning data can be read from disk into a model's input buffers without re-copying it in Python memory. Benefits:
- 10–50× faster than loading from CSV/JSON for large datasets
- Memory mapping — only the data you actually access is loaded from disk
- Processed datasets are cached to
~/.cache/huggingface/datasets/—map()only runs once per dataset/function pair
Integration with Transformers Trainer
The Datasets library is the recommended way to supply data to Transformers' Trainer. They're designed to work together:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenized = dataset.map(
lambda x: tokenizer(x["text"], truncation=True, padding="max_length"),
batched=True
)
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
trainer = Trainer(
model=model,
args=TrainingArguments(output_dir="./results", num_train_epochs=3, fp16=True),
train_dataset=tokenized["train"],
eval_dataset=tokenized["test"],
)
trainer.train()Pushing Your Own Dataset
from datasets import Dataset
import pandas as pd
# Create from a pandas DataFrame
df = pd.read_csv("my_data.csv")
dataset = Dataset.from_pandas(df)
# Push to the Hub (public or private)
dataset.push_to_hub("your-username/my-dataset", private=True)Checklist: Do You Understand This?
- Can you load a dataset from the Hub and inspect its structure?
- Do you know how to use
map()to tokenize a dataset for fine-tuning? - Can you explain when to use streaming and why it matters for large datasets?
- Do you understand what Arrow format is and why it's faster than CSV/JSON?
- Can you connect a tokenized dataset directly to a Transformers Trainer?