The Catalog Wars

Outside the data world, “catalog” sounds vaguely 19th-century. Inside the data world, catalogs are the latest battleground in the open-table-format wars, and the choice of catalog increasingly determines what your data architecture can do. The shape of the fight: Databricks vs Snowflake vs the open-source camp, with Iceberg sitting in the middle as the format everyone wants to claim. Worth understanding because the choice has real consequences and most teams are making it without realising.

Why Iceberg

What a catalog actually is

In Iceberg-world, a catalog is the service that maps table names to the latest metadata file pointer. When a query engine wants to read finance.revenue, it asks the catalog “where’s the current metadata for finance.revenue,” the catalog responds with an S3 path, and the engine reads the metadata to find the data files.

That’s the literal job. What makes catalog choice strategic is everything that gets bundled around that job – access control, audit, governance, lineage, semantic enrichment. The catalog becomes the place where you implement table-level policy. Whoever owns the catalog owns the governance layer of your data stack.

The contenders

Unity Catalog (Databricks). Originally Databricks-only, opened up in 2024. Iceberg support has been expanding. The strong story is governance – Unity has the most mature column-level access control, row-level filters, and audit story of any of the open catalogs. The weak story is vendor association – even as “open,” Unity is shepherded by Databricks, and the most polished integration is with Databricks. Open in the sense that you can run it, less open in the sense of community governance.

Polaris (Snowflake). Snowflake’s answer, donated to the Apache Software Foundation in 2024 and now Apache Polaris. Iceberg-native by design. Strong story: it’s Apache-governed (not Snowflake-controlled), and Snowflake’s ecosystem reach gives it real adoption momentum. Weak story: governance features are catching up to Unity rather than leading.

Apache Gravitino. Open-source from the start, originally from Datastrato. Multi-format (not just Iceberg), multi-engine, federated metadata model. Strong story: cleanest open-source story, unifies catalogs from different sources. Weak story: less mature than Unity or Polaris, smaller commercial backing, and the federation model is theoretically clean but operationally complex.

Project Nessie. Originally Dremio’s. Git-like semantics for table state – branches, commits, rollbacks. Conceptually delightful for analytics engineers who want CI/CD for data. Adoption smaller than the others, but real. Worth knowing about.

AWS Glue Catalog. The default if you live in AWS. Iceberg-aware. Solid. Less feature-rich than the dedicated catalogs above on governance, more deeply integrated with the AWS analytics services. Real choice for AWS-native shops.

What the wars are about

Three things, mostly.

First, who owns the governance layer. The catalog is where access control lives. If your catalog is Databricks-flavoured, your security model is Databricks-shaped. Same for Snowflake, same for AWS. The catalog is the lock-in point even when the underlying storage format is open.

Second, whose query engine reads tables most efficiently. A Snowflake-managed catalog with Snowflake reading tables is going to be more polished than the same Snowflake-managed catalog read by Trino, at least at first. The vendors are competing on integration depth even when claiming neutrality.

Third, the meta-platform play. Whoever owns the catalog owns the metadata. Metadata is what the AI-on-data layer is going to need (more on this in later posts). The catalog is the strategic asset in any “agentic data” pitch, which is why every cloud vendor is investing in it.

The Iceberg REST spec

The thing that makes the wars less existential than they could be is the Iceberg REST catalog specification. It defines a common API for talking to a catalog – what tables exist, fetch metadata, commit changes. Any catalog that implements REST can be talked to by any engine that speaks REST.

This matters because it means you don’t have to pick one catalog forever. You can migrate. You can mix catalogs for different domains. You can swap engines without swapping catalogs. The lock-in is real but not absolute, provided you stick to the REST API and don’t lean on catalog-specific extensions.

The honest recommendation

If you’re primarily on Databricks, Unity. The integration depth is real and the governance is mature.

If you’re primarily on Snowflake and serious about Iceberg, Polaris. Apache governance, native Snowflake integration, momentum.

If you’re primarily on AWS, Glue. It’s already there, it’s good enough, and the integration with the AWS analytics services is the path of least resistance.

If you’re explicitly multi-cloud or multi-engine and you want to avoid lock-in, Gravitino. Accept that it’s less mature and you’ll be doing more integration work.

If you’re a smaller team that wants Git-like data ops semantics, Nessie. Niche but real.

The catalog wars will keep going. The Iceberg REST spec means they don’t have to be existential. Pick the catalog that fits your current architecture, design with the REST API as the boundary, and keep the option of swapping later. You probably will, eventually.

Next week – PII discovery, the security problem that nobody’s really solved despite a decade of trying.