Beginner

Data Quality with AI

Traditional data quality relies on fixed rules: "column X must not be null," "value must be in this list." These rules are brittle — they cannot adapt to context, catch novel anomalies, or explain why a value looks wrong. AI complements rule-based quality systems by handling the cases that rules cannot express. This page covers how to use AI for generating validation rules, detecting anomalies, and building quality pipelines that work at both the rule layer and the reasoning layer.

The Two-Layer Data Quality Model

Layer 1: Rule-based checks (run always)

Schema validation: correct column types and names
Not-null checks on required columns
Range checks: values within expected bounds
Referential integrity: foreign keys exist
Format validation: email regex, ISO dates, phone patterns
Tools: Great Expectations, Pandera, dbt tests

Layer 2: AI-based checks (for complex cases)

Context-aware anomaly detection: value is within range but unexpected given other columns
Semantic validation: does this address/product name/company name look real?
Pattern change detection: statistical shift in distributions over time
Free-text quality: is this a valid product description, not spam?
Cross-column coherence: is this age realistic given this birth year?

Generating Validation Rules with AI

Prompt to generate Great Expectations checks:

Generate a Great Expectations validation suite for the following table.

Table: [TABLE NAME]

Description: [WHAT THIS TABLE STORES AND HOW IT IS USED]

Schema:

[PASTE DDL OR COLUMN LIST WITH TYPES]

Business rules I know:

- [RULE 1 — e.g., "status must be one of: active, inactive, pending"]

- [RULE 2 — e.g., "amount must be positive and less than 1,000,000"]

- [RULE 3 — e.g., "created_at must be after 2020-01-01"]

Generate: expect_column_values_to_not_be_null, expect_column_values_to_be_in_set, expect_column_values_to_be_between, and any other relevant expectations.

Include a comment on each expectation explaining what business rule it enforces.

Prompt to generate dbt test YAML:

Generate dbt test YAML for this model. Include:

- not_null tests for required columns

- unique tests for primary key columns

- accepted_values tests for enum columns (with the valid values listed)

- relationships tests for foreign key columns

Model name: [MODEL NAME]

Columns: [PASTE COLUMN DEFINITIONS]

Foreign keys: [DESCRIBE RELATIONSHIPS]

AI-Powered Anomaly Detection

For anomalies that rule-based checks cannot catch, use AI to reason about whether values are plausible given context. Claude 3.7 and GPT-4o class models have proven most reliable for generating accurate validation logic when given clear, specific prompts.

Contextual anomaly detection prompt:

Analyse the following records and identify any that look anomalous. For each anomaly:

- State which field looks wrong

- Explain why it looks anomalous given the other fields in the same row

- Rate your confidence: HIGH / MEDIUM / LOW

- Suggest whether to: flag for review / auto-reject / accept with warning

Context: [DESCRIBE WHAT THIS DATA IS AND WHAT IS NORMAL]

[PASTE RECORDS — as JSON or CSV, with headers]

Use MEDIUM/LOW confidence flags as a filter — do not auto-reject on AI's assessment alone. Have a human review MEDIUM and HIGH flags.

Building a Data Quality Pipeline

Profile your data first. Ask AI: "Given this schema and sample data, what quality issues are most likely to appear?" Use this to write targeted rules, not generic checks.
Generate rule-based checks. Use the prompts above to generate Great Expectations or dbt tests. Review and correct the generated rules before running.
Set failure thresholds, not pass/fail. A column having 2% nulls is different from 40%. Configure thresholds that trigger alerts, not just binary pass/fail.
Add AI anomaly checks for high-value data. Run AI contextual checks on records flagged by rules but not clearly bad, or on high-stakes records (large transactions, new customer signups).
Log and learn. Track which quality checks fail most often. Ask AI to suggest new rules from historical failure patterns: "These records failed quality checks in the past month: [paste]. Suggest new rules to catch them earlier."

Tools and Integration

Rule-based quality tools

Great Expectations: Python-native, generates expectation suites, HTML reports
Pandera: pandas integration, schema validation with type-level and value-level checks
dbt tests: SQL-level tests built into your transformation pipeline
Soda Core: YAML-based quality checks, integrates with Airflow and dbt

When AI quality checks make sense

High-value records where false positives are acceptable (flag for review, not reject)
Unstructured or semi-structured data that rules cannot express
Small batch sizes where per-record AI cost is acceptable
Generating new rules from historical failures — use AI to write the rules, rules to enforce them

AI quality checks at scale

AI per-record quality checking is too slow and expensive for large batch pipelines. Use AI to generate rules and investigate anomalies, not to evaluate every row in production. Apply AI checks on sampled or flagged subsets, not full table scans.

Checklist: Do You Understand This?

What is the two-layer data quality model — what does each layer handle and why are both needed?
What should you include in a prompt to generate a Great Expectations suite — beyond just the schema?
What is contextual anomaly detection, and why can't rule-based checks handle it?
Why should you set failure thresholds (e.g., 2% null rate) rather than binary pass/fail for quality checks?
When is per-record AI quality checking cost-appropriate, and when is it not?
How can historical quality failures be used to improve your validation rules?