Intermediate

Hugging Face Datasets

The datasets library gives you one-line access to 500k+ public datasets — and a fast, memory-efficient way to work with large data without loading it all into RAM at once.

Install

pip install datasets

Loading a Dataset

from datasets import load_dataset

# Load the IMDb sentiment dataset
dataset = load_dataset("imdb")
print(dataset)
# DatasetDict({
#     train: Dataset({features: ['text', 'label'], num_rows: 25000})
#     test: Dataset({features: ['text', 'label'], num_rows: 25000})
# })

# Load a specific split
train_data = load_dataset("imdb", split="train")

# Load with a subset / config
squad = load_dataset("rajpurkar/squad", "plain_text")

# Load just the first 1000 examples (faster for prototyping)
small = load_dataset("imdb", split="train[:1000]")

dataset = load_dataset("imdb")
train = dataset["train"]

# Shape
print(len(train))           # 25000
print(train.features)       # {'text': Value('string'), 'label': ClassLabel(names=['neg', 'pos'])}

# Single example
print(train[0])             # {'text': '...', 'label': 0}

# Slice
print(train[:3]["text"])    # first 3 texts

# Column
print(train["label"][:10])  # first 10 labels

Transforming Data with map()

map() applies a function to every example — or every batch — and is the primary way to preprocess data (tokenization, formatting, filtering):

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize(example):
    return tokenizer(example["text"], truncation=True, padding="max_length", max_length=512)

# Apply to all examples (cached to disk — won't re-run on reload)
tokenized = train.map(tokenize, batched=True)

# Filter to only positive examples
positives = train.filter(lambda x: x["label"] == 1)

# Rename / remove columns
cleaned = train.rename_column("text", "input_text").remove_columns(["label"])

Streaming — Large Datasets Without RAM

Some datasets are hundreds of GB. Loading them into RAM would fail on most machines. Streaming downloads and processes data on-the-fly, one batch at a time:

# Stream a 100GB+ dataset without loading it into RAM
dataset = load_dataset("cerebras/SlimPajama-627B", streaming=True, split="train")

for batch in dataset.iter(batch_size=32):
    # Process 32 examples at a time
    # batch["text"] is a list of 32 strings
    pass

# Works with map() and filter() too
streamed = dataset.map(tokenize, batched=True).filter(lambda x: len(x["text"]) > 100)

Hub

Dataset stored remotely

→

Stream

Download 32 examples

→

map()

Tokenize in-flight

→

Trainer

Train on batch

→

Next batch

Repeat

Streaming: process 100GB+ datasets on a laptop — data never fully loaded into RAM

Arrow Format — Why It's Fast

Datasets uses the Apache Arrow format for on-disk storage. Arrow is a columnar memory format designed for zero-copy reads — meaning data can be read from disk into a model's input buffers without re-copying it in Python memory. Benefits:

10–50× faster than loading from CSV/JSON for large datasets
Memory mapping — only the data you actually access is loaded from disk
Processed datasets are cached to ~/.cache/huggingface/datasets/ — map() only runs once per dataset/function pair

Integration with Transformers Trainer

The Datasets library is the recommended way to supply data to Transformers' Trainer. They're designed to work together:

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

tokenized = dataset.map(
    lambda x: tokenizer(x["text"], truncation=True, padding="max_length"),
    batched=True
)

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

trainer = Trainer(
    model=model,
    args=TrainingArguments(output_dir="./results", num_train_epochs=3, fp16=True),
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
)
trainer.train()

Pushing Your Own Dataset

from datasets import Dataset
import pandas as pd

# Create from a pandas DataFrame
df = pd.read_csv("my_data.csv")
dataset = Dataset.from_pandas(df)

# Push to the Hub (public or private)
dataset.push_to_hub("your-username/my-dataset", private=True)

Checklist: Do You Understand This?

Can you load a dataset from the Hub and inspect its structure?
Do you know how to use map() to tokenize a dataset for fine-tuning?
Can you explain when to use streaming and why it matters for large datasets?
Do you understand what Arrow format is and why it's faster than CSV/JSON?
Can you connect a tokenized dataset directly to a Transformers Trainer?