Data Cleaning with AI
Data cleaning is where most data work actually happens — and where AI provides genuine leverage. AI can write the pandas or SQL code to fix specific problems, explain what each cleaning step does, and help you build a systematic pipeline. The key is being specific about the problem: vague requests produce vague cleaning that misses the actual issues in your data.
The Data Cleaning Prompt Formula
Every effective data cleaning prompt includes four things: the role, the task, the context (what the data contains and what is wrong with it), and the output format.
You are a data engineer. Clean the following dataset.
Context: This is a CSV of [DESCRIBE WHAT THE DATA REPRESENTS]. It has [N] columns and [M] rows.
Known issues:
- [SPECIFIC ISSUE 1 — e.g., "the date column has mixed formats: some are MM/DD/YYYY, some are ISO 8601"]
- [SPECIFIC ISSUE 2 — e.g., "the amount column has some values as strings with a $ prefix"]
- [SPECIFIC ISSUE 3 — e.g., "there are ~5% missing values in the email column"]
Tools: Python with pandas. Target: a clean DataFrame ready for [DOWNSTREAM USE — analysis / ML / export].
Output: A single function that takes the raw DataFrame and returns a cleaned DataFrame. Include a comment for each transformation explaining what it fixes.
Listing known issues explicitly produces targeted cleaning code. Generic "clean this data" prompts produce generic code that misses your specific problems.
Common Cleaning Tasks and Prompts
Missing Values
Write pandas code to handle missing values in this DataFrame. For each column:
- [column_a]: drop the row if missing (it is required)
- [column_b]: fill with the column median (it is a numeric metric)
- [column_c]: fill with "Unknown" (it is a categorical label)
- [column_d]: forward-fill using the previous row value (it is a time-series field)
After each fill, print the count of remaining nulls to confirm it worked.
Be explicit about the strategy for each column — median/mean/mode/forward-fill have different implications for downstream analysis.
Type Standardisation
Standardise these columns to consistent types:
- date_col: convert to pandas datetime, handling both "MM/DD/YYYY" and ISO 8601 formats
- amount_col: remove "$" and "," characters, convert to float64
- phone_col: strip all non-digit characters, store as string (not int — leading zeros matter)
- status_col: lowercase, strip whitespace, normalise all case variants ("Active", "ACTIVE") to the canonical value "active"
For any value that cannot be converted, set it to NaN with a note of the original value in a separate _original column for audit.
Deduplication
Write deduplication code for this DataFrame.
Exact duplicates: rows where [column_a] and [column_b] are identical — keep the row with the most recent [timestamp_col].
Fuzzy duplicates in [name_col]: rows where names are similar but not identical (e.g., "John Smith" vs "J. Smith" vs "JOHN SMITH") — use fuzzy string matching to group them. For each group, keep the row with the most complete data.
Before deduplication: print the count of exact and approximate duplicates found.
After deduplication: print the count of rows removed.
Fuzzy deduplication uses libraries like thefuzz or rapidfuzz. Always confirm the groups found before auto-dropping — fuzzy matching has false positives.
Outlier Detection
Detect outliers in [column_name] using the IQR method.
Output:
1. A summary: count of outliers, min/max of non-outlier range, and the 5 most extreme outlier values
2. A boolean mask column "is_outlier" on the DataFrame (do not drop them — flag them)
3. A plot of the distribution before and after flagging (matplotlib)
Context: [describe what this column represents and whether extreme values could be legitimate — e.g., "transaction amount in USD — values over $10,000 are unusual but possible for enterprise customers"]
Flag outliers rather than dropping them — review before deciding. Legitimate extreme values exist in most real datasets.
AI vs Rule-Based Cleaning
| Task | Use AI to | Use rules/code for |
|---|---|---|
| Type conversion | Generate the pandas type conversion code | The actual conversion in production — deterministic is safer |
| Free-text normalisation | Classify, categorise, or normalise unstructured text values | High-volume production pipelines (AI classification is slow and costly at scale) |
| Anomaly investigation | Explain why a value looks anomalous; suggest whether to flag or drop | The final decision — only you know what's legitimate in your domain |
| Pipeline code | Generate the initial cleaning function | Running it — test on a sample before running on the full dataset |
Checklist: Do You Understand This?
- What are the four elements of an effective data cleaning prompt?
- Why should you flag outliers rather than dropping them — what could go wrong if you drop immediately?
- When deduplicating, what should you do before auto-dropping fuzzy matches — and why?
- For a phone number column, why should you store as string rather than integer after cleaning?
- When is AI classification for free-text normalisation not appropriate for production, and what should you use instead?
- What should you always do before running AI-generated cleaning code on a full dataset?