Mastering SEO: Essential Tips for Boosting Your Website

I only recently discovered this pattern for data model design. If you’ve heard of the centipede schema previously, then it’s probably as a warning rather than a recommendation. This rare dimensional modeling pattern represents what happens when the normalization philosophy of snowflake schemas is taken to its logical (or perhaps illogical) extreme.

Named for its many legs (of normalized tables) extending from the central fact table, the centipede schemas failures teaches valuable lessons about the limits of normalization in analytical databases.

The term was coined in The Data Warehouse Toolkit, Ralph Kimball and Margy Ross.

What Is a Centipede Schema?

A centipede schema is an extremely normalized dimensional model. Dimension tables are decomposed into many levels of normalized sub-tables, creating a structure with numerous branching paths radiating from the fact table. If a snowflake schema has two or three levels of normalization, a centipede might have five, six, or more.

Imagine a product dimension that doesn’t just split into product and category, but into:

product
subcategory
category
department
division
product line
brand
manufacturer
manufacturer country
manufacturer regio

Each in its own table, each requiring a separate join to reconstruct the full dimensional context. When diagrammed, these many normalized branches extending from the fact table resemble the legs of a centipede. It’s normalization pursued with such vigor that practicality gets left behind.

Database Theory Run Amok

Some data modelers, trained (or not) in relational database theory and normalization principles, apply those same principles rigidly to data warehousing. If third normal form is good for transactional databases, surely it’s good for analytical databases too, right?

This thinking leads to normalizing everything that can possibly be normalized. Every attribute that depends on something other than the primary key gets its own table. The result is technically correct from a normalization perspective, but practically unusable for analytics.

Centipede schemas evolve accidentally over time. You start with a reasonable snowflake schema with two levels of normalization. Then someone needs to add an attribute, and it makes sense to create another table. Then another. And another. Before long, you’re twelve tables deep wondering how you got here. But it’s still technically correct…so we are all good.

This gradual accretion of normalized tables is how most centipedes are born, not through deliberate design but through the accumulated weight of incremental decisions that each seemed reasonable at the time.

The problems with centipede schemas are numerous and severe enough that they’re almost never deliberately implemented.

Query Complexity

Writing queries against a centipede schema is nightmarish. A simple question like “show me sales by product category and customer region” might require joining ten or twelve tables. Even experienced SQL developers struggle to keep track of all the join paths and conditions.

The cognitive load of understanding how to navigate the schema becomes a significant barrier to self-service analytics. Business users can’t write their own queries, and even technical users spend more time figuring out joins than analyzing data.

Performance Degradation

Every join has a cost, and centipede schemas rack up those costs quickly. Query optimizers can only do so much. When you’re joining through six levels of geography tables just to get a customer’s country, performance suffers dramatically.

In large-scale data warehouses, these performance problems compound. What might be acceptable with thousands of rows becomes a wait to the heat death of the universe with billions. The database spends more time joining tables than actually aggregating and analyzing data.

Maintenance Nightmare

With dozens or hundreds of small normalized tables, maintenance becomes a serious challenge. Understanding the schema requires extensive documentation (and how many data modellers do you truly know thsat are diligent with documentation – let’s be optimistic and say 10%). Making changes requires touching many tables. Testing becomes complex because every query involves many moving parts.

When something breaks, troubleshooting is difficult. Is the problem in the join logic? In one of the many tables? In how the tables relate to each other? The debugging surface area is enormous.

BI Tool Incompatibility

Business intelligence tools are designed for star schemas or modest snowflakes. They’re not equipped to handle deeply nested normalization. Visual query builders become unusable. Report generation slows to a crawl.

Users end up bypassing the BI tool entirely, writing raw SQL or giving up on self-service analytics altogether. The tools you invested in to democratize data become ineffective.

Optimization Impossibility

With so many join paths, query optimization becomes extremely difficult. Indexing strategies that help one query pattern hurt another. Materialized views to improve performance multiply rapidly because of all the possible join combinations.

Database administrators spend disproportionate time tuning a schema that fundamentally resists optimization because its structure works against how analytical queries actually work.

Despite these obvious practical problems, it’s worth understanding why someone might think centipede schemas are a good idea.

It Has Some Good Foundational Ideas

From a pure database theory perspective, centipede schemas achieve normalization nirvana. There’s no redundancy. Every fact appears exactly once. Updates are clean and affect only the relevant row in the relevant table.
The strict normalization enforces referential integrity thoroughly. You can’t have inconsistent data because there’s only one place for each piece of data to live. Foreign key constraints ensure everything links correctly.
In theory, the storage savings from extreme normalization are maximized. You’re not repeating “United Kingdom” a million times, it appears once in the country table. An oblique form of run length encoding we see in Parquet made real.
If your source systems are themselves highly normalized, a centipede schema might mirror that structure, potentially simplifying ETL logic.

The problem is that these theoretical benefits don’t translate to real-world analytical value, and the practical costs are crushing.

But…There May Be A Solution

If you are using Snowflake, then there is a feature that may address all these issues. That feature is Join Elimination

(Redundant/Unnecessary Join Elimination through the RELY constraint property).

Lets look at an example:

-- Store table
CREATE TABLE stores (
    store_id INT PRIMARY KEY,
    store_name VARCHAR(100),
    city_id INT
);

-- City table
CREATE TABLE cities (
    city_id INT PRIMARY KEY,
    city_name VARCHAR(100),
    state_id INT
);

-- State table
CREATE TABLE states (
    state_id INT PRIMARY KEY,
    state_name VARCHAR(100),
    country_id INT
);

-- Country table
CREATE TABLE countries (
    country_id INT PRIMARY KEY,
    country_name VARCHAR(100)
);

-- Store table
CREATE TABLE stores (
    store_id INT PRIMARY KEY,
    store_name VARCHAR(100),
    city_id INT
);

-- City table
CREATE TABLE cities (
    city_id INT PRIMARY KEY,
    city_name VARCHAR(100),
    state_id INT
);

-- State table
CREATE TABLE states (
    state_id INT PRIMARY KEY,
    state_name VARCHAR(100),
    country_id INT
);

-- Country table
CREATE TABLE countries (
    country_id INT PRIMARY KEY,
    country_name VARCHAR(100)
);

Without Join Elimination

If you write a query that only needs store and city information:

SELECT 
    s.store_name,
    c.city_name
FROM stores s
JOIN cities c ON s.city_id = c.city_id
JOIN states st ON c.state_id = st.state_id  -- Not needed!
JOIN countries co ON st.country_id = co.country_id  -- Not needed!
WHERE s.store_id = 123;

SELECT 
    s.store_name,
    c.city_name
FROM stores s
JOIN cities c ON s.city_id = c.city_id
JOIN states st ON c.state_id = st.state_id  -- Not needed!
JOIN countries co ON st.country_id = co.country_id  -- Not needed!
WHERE s.store_id = 123;

A naive optimizer would execute all four joins, even though the query doesn’t use any columns from states or countries.

With Join Elimination

A sophisticated optimizer recognizes that states and countries aren’t needed and eliminates them:

-- Optimizer transforms to:
SELECT 
    s.store_name,
    c.city_name
FROM stores s
JOIN cities c ON s.city_id = c.city_id
WHERE s.store_id = 123;

-- Optimizer transforms to:
SELECT 
    s.store_name,
    c.city_name
FROM stores s
JOIN cities c ON s.city_id = c.city_id
WHERE s.store_id = 123;

It works only when the primary and foreign keys are defined. This can be an issue in Snowflake as constraints are not enforced and many developers don’t encode them, see this post that talks to that point:

Snowflake’s Indexing Methods

The Lesson

The centipede schema teaches a crucial lesson: theoretical purity in data modeling doesn’t automatically translate to practical value. The principles that govern transactional database design don’t necessarily apply to analytical databases.

Normalization is a tool, not a goal. It should be applied judiciously where it provides genuine benefits, not pursued as an end in itself. In analytical databases, denormalization is often the right answer because it aligns with how the data will actually be used.

No respected data warehouse methodology recommends centipede schemas. Kimball’s dimensional modeling explicitly favors denormalization. Inmon’s Corporate Information Factory uses normalization in the data warehouse layer but denormalizes in data marts. Data vault, despite its normalization, structures itself differently than traditional dimensions.

The consensus is clear: extreme normalization in analytical databases creates more problems than it solves. The centipede schema stands as a monument to what happens when good principles are applied in the wrong context.

If you encounter the term “centipede schema” in practice, it’s almost certainly being used as a warning label. “Don’t let this turn into a centipede schema” is data modeling code for “we’re normalizing too much, and we need to stop before it becomes unmanageable.” The centipede schema isn’t so much a design pattern as it is an anti-pattern—an example of what not to do. Its value lies not in implementation but in illustration: this is where excessive normalization leads, and these are the reasons to avoid going there. Sometimes the best designs aren’t the most elegant or theoretically pure. They’re the ones that work—the ones people can actually use to answer business questions. That’s a lesson the centipede schema teaches well, even if accidentally.

Centipede Schema

What Is a Centipede Schema?

Database Theory Run Amok

Query Complexity

Performance Degradation

Maintenance Nightmare

BI Tool Incompatibility

Optimization Impossibility

It Has Some Good Foundational Ideas

But…There May Be A Solution

Without Join Elimination

With Join Elimination

The Lesson

Log-Structured Merge Trees

FinOps Maturity Curve for Data

Normal Forms

Anchor Modeling

Model Context Protocol

What Is a Centipede Schema?

Database Theory Run Amok

Query Complexity

Performance Degradation

Maintenance Nightmare

BI Tool Incompatibility

Optimization Impossibility

It Has Some Good Foundational Ideas

But…There May Be A Solution

Without Join Elimination

With Join Elimination

The Lesson

Share this:

Related Posts

Trending now

Discover more from Data Lingua. Where Data Engineering Meets Agentic Business Strategy