TDD for data. Is this a thing?

Test-Driven Development (TDD) has been a staple of software engineering for decades. It was first proposed by Kent Beck in the late 1990s and early 2000s. He popularized TDD as part of Extreme Programming (XP) and wrote the influential book “Test Driven Development: By Example” in 2003.

Book cover of 'Test-Driven Development: By Example' by Kent Beck, featuring a dark background with the title prominently displayed.

The mantra of “Red → Green → Refactor” has shaped how teams build reliable, maintainable applications. But when it comes to data engineering, analytics, and machine learning pipelines, TDD is far less common. Is TDD even possible for data work? And if so, what approaches are teams actually using today?

A Quick Refresher on TDD

Classic TDD Process:

Write a failing test (Red) before writing any production code.
Write the minimal code needed to make the test pass (Green).
Refactor while keeping tests green.

The aim is:

Early defect detection.
Clear definition of “done”.
Incremental, safe changes.

In application development, these tests typically verify business logic and API contracts. But in the data world, things are different.

Why TDD is Harder for Data

Data projects face challenges that make TDD less straightforward:

Data is variable: It changes over time, often in ways you can’t fully predict.
Schema drift: Upstream systems alter field names or types without notice.
Uncontrollable sources: APIs, vendor feeds, or partner data may arrive late or malformed.
Non-deterministic transformations: Aggregations, joins, and machine learning can produce different results due to timing or randomness.

The fundamental question is:

How do you write the test first if you don’t yet know the data?

The Case for Data TDD

While traditional TDD is rare in data work, a modified approach is gaining traction:

Define expected data behavior before building transformations.
Write data quality tests that fail initially.
Only then implement transformations until tests pass.

TDD in ETL/ELT Development

Teams using dbt or SQL-based pipelines can:

Write schema tests (not_null, unique) before creating transformations.
Define expected row counts, value ranges, and referential integrity rules up front.

For example, a dbt test might look like:

YAML code snippet showing accepted values for the 'country_name' field, including 'United States', 'Canada', and 'Mexico', with a warning severity configuration.

And will raise a warning if the country name is not part of that enumerated set. For more information on dbt, please read:

Data Build Tool (dbt)

TDD for Data APIs

If your pipeline exposes a data API:

Write API contract tests first (e.g., “GET /orders?date=2025-08-14 returns orders in a JSON schema”).
Implement the ingestion/processing only after the test suite exists.

This is predicated on a stable set of synthetic data that is used to populate the test database. Start with synthetic test data that covers edge cases (nulls, duplicates, out-of-range values). Build transformations that make tests pass against mock data before connecting to real sources.

Popular Tools & Practices for Data TDD

dbt tests Schema & relationship tests in SQL
Great Expectations Declarative data validation
pytest + pandas Custom data validation logic
Soda Core Data quality monitoring
Faker / Synthetic data generators Mock datasets for TDD
Contract testing (e.g., Pact) Data API validation

Soda vs Great Expectations

Is Data TDD Popular?

Right now, TDD for data is pretty niche, but growing:

Analytics engineering teams using dbt often follow a TDD-like pattern.
Regulated industries (finance, healthcare) use test-first for compliance.
Data mesh adopters write contract tests between data products.
Most machine learning teams don’t follow strict TDD, but use test-first ideas for feature validation.

However, in traditional data engineering, TDD is still rare, most testing happens after pipeline implementation.

In my view dbt is making all of this simpler and I expect to see more adoption and better tooling emerge in the future.

Benefits of TDD in Data Work

Clear acceptance criteria for pipelines before coding.
Fewer production surprises from schema drift or bad data.
Confidence in refactoring SQL and transformations.
Faster onboarding — new engineers know what “good” data looks like.

Limitations & Gotchas

Test writing takes time — cost/benefit must be justified.
Real-world data is messy — tests can be brittle if too strict.
Upstream data sources may break tests often (false alarms).
For streaming data, asserting exact expected values is tricky.

Final Thoughts

TDD for data is possible, but requires adaptation:

Instead of testing code behavior alone, you test data behavior.
You often start with data quality assertions and schema contracts rather than unit tests.
Adoption is strongest in analytics engineering and data mesh product teams, weaker in traditional ETL.

If your organisation values trustworthy, maintainable, and evolvable data products, adopting a TDD mindset, even partially, can significantly reduce defects and improve stakeholder confidence. Demonstrate value with examples to win hearts and minds.

References

dbt Testing Docs — https://docs.getdbt.com/docs/build/tests
Great Expectations — https://greatexpectations.io/
Soda Core — https://docs.soda.io/
Meszaros, G. (2007). xUnit Test Patterns: Refactoring Test Code.
Fowler, M. — Test-Driven Development — https://martinfowler.com/bliki/TestDrivenDevelopment.html

A Quick Refresher on TDD

Why TDD is Harder for Data

The Case for Data TDD

TDD in ETL/ELT Development

TDD for Data APIs

Popular Tools & Practices for Data TDD

Is Data TDD Popular?

Benefits of TDD in Data Work

Limitations & Gotchas

Final Thoughts

Share this:

Related Posts

Semantic Models

Why Iceberg

Keys Are The Key

Discover more from Where Data Engineering Meets Business Strategy