DuckLake: Simplifying Lakehouse Architecture with SQL Databases

There are some great names in software engineering and this is most definitely one of them. The lakehouse architecture promised to combine the best of data warehouses and data lakes. In practice, it delivered quite a lot of complexity. We found ourselves managing snapshot files, dealing with compaction strategies, and debugging obscure metadata oddities (I talk about how new table formats are helping with this space below).

Why Iceberg

DuckLake, announced by the DuckDB team in 2025, takes a fundamentally different approach. Instead of inventing new metadata formats and catalog protocols; it uses something that already works: a SQL database. How retro. Metadata lives in PostgreSQL, MySQL, SQLite, or DuckDB itself. Data lives in standard Parquet files. That’s the entire architecture. The result is lakehouse capabilities – ACID transactions, time travel, schema evolution, concurrent writes—without the operational burden of managing custom metadata systems.

ACID

This simplicity isn’t aesthetic. Organizations already know how to operate SQL databases. They have expertise, tooling, monitoring, and backup strategies learned over the past few decades. Treating metadata as database tables rather than custom file formats means leveraging decades of database engineering rather than reimplementing it poorly. DuckLake demonstrates that sophisticated data lake features don’t require sophisticated metadata systems.

The Core Insight – SQL Databases Are Good at Managing State

Obviously…

Modern lakehouse formats emerged because managing state through files is hard. When multiple writers need to update data lakes concurrently, coordinating through file systems becomes complex quickly. You need optimistic concurrency control, conflict resolution, and atomic operations, all implemented through careful file manipulation and locking protocols that are fragile and difficult to debug.

DuckLake’s insight is that SQL databases already solve these problems elegantly. Transactions, concurrency control, isolation levels, and atomic commits are fundamental database capabilities. Instead of reimplementing these features through custom metadata files, use an actual database for metadata. The catalog database provides all the coordination, consistency, and concurrency guarantees that lakehouse formats require.

The data itself remains in Parquet files for all the good reasons: columnar storage is efficient for analytics, Parquet is an open standard with broad tool support, and object storage is cheap and scalable. DuckLake doesn’t change the data layer, but it does changes how metadata is managed by trading custom file formats for standard SQL tables.

Which DataFrame?

This architectural choice reaps benefits because metadata queries are just SQL queries. Understanding what tables exist, what schemas they have, or what snapshots are available doesn’t require learning new APIs or tools. Anyone who knows SQL can inspect the catalog. Backup and recovery strategies for metadata use standard database practices. High availability, replication, and disaster recovery all leverage existing database capabilities.

Getting Started – It’s Remarkably Simple

Installing DuckLake requires only the DuckDB extension. No separate services to deploy, no cluster configuration, no complex setup.

INSTALL ducklake;
LOAD ducklake;

-- Create a DuckLake using local DuckDB as catalog
ATTACH 'ducklake:my_data.ducklake' AS my_lake;
USE my_lake;

CREATE TABLE users (
    id INTEGER,
    name VARCHAR,
    email VARCHAR
);

INSERT INTO users VALUES 
    (1, 'Alice', 'alice@example.com'),
    (2, 'Bob', 'bob@example.com');

SELECT * FROM users;

INSTALL ducklake;
LOAD ducklake;

-- Create a DuckLake using local DuckDB as catalog
ATTACH 'ducklake:my_data.ducklake' AS my_lake;
USE my_lake;

CREATE TABLE users (
    id INTEGER,
    name VARCHAR,
    email VARCHAR
);

INSERT INTO users VALUES 
    (1, 'Alice', 'alice@example.com'),
    (2, 'Bob', 'bob@example.com');

SELECT * FROM users;

That’s a complete working example. The ATTACH statement creates the DuckLake catalog database file and sets up the necessary schema. Creating tables and inserting data works exactly like any database. Behind the scenes, DuckLake writes metadata to the catalog database and data to Parquet files, but you don’t think about this separation.

For production scenarios with multiple users, swap the local DuckDB catalog for PostgreSQL.

INSTALL postgres;
LOAD postgres;

-- Use PostgreSQL as catalog for multi-user access
ATTACH 'ducklake:postgres://user:pass@localhost/catalog' AS prod_lake (
    DATA_PATH 's3://my-bucket/lake-data/'
);

USE prod_lake;

INSTALL postgres;
LOAD postgres;

-- Use PostgreSQL as catalog for multi-user access
ATTACH 'ducklake:postgres://user:pass@localhost/catalog' AS prod_lake (
    DATA_PATH 's3://my-bucket/lake-data/'
);

USE prod_lake;

Multiple DuckDB clients can now connect concurrently to the same catalog database, reading and writing the same DuckLake tables. PostgreSQL’s transaction system coordinates access, preventing conflicts and maintaining consistency. This “multiplayer DuckDB” capability addresses one of DuckDB’s core limitations, single-writer file locking, by moving coordination to a database designed for concurrent access.

Time Travel – History Without Complexity

Time travel in DuckLake comes naturally from how snapshots work. Every operation that modifies data creates a new snapshot in the catalog. The data files themselves remain immutable; snapshots just point to different sets of files. Querying historical data means querying an older snapshot.

-- Create and modify data
CREATE TABLE products (id INTEGER, name VARCHAR, price DECIMAL);

INSERT INTO products VALUES (1, 'Widget', 19.99);
-- Snapshot 1

UPDATE products SET price = 24.99 WHERE id = 1;
-- Snapshot 2

INSERT INTO products VALUES (2, 'Gadget', 29.99);
-- Snapshot 3

-- Query current state
SELECT * FROM products;

-- Query as it was at snapshot 1
SELECT * FROM products AT (VERSION => 1);

-- See what changed between snapshots
SELECT * FROM table_changes('products', 1, 3);

-- Create and modify data
CREATE TABLE products (id INTEGER, name VARCHAR, price DECIMAL);

INSERT INTO products VALUES (1, 'Widget', 19.99);
-- Snapshot 1

UPDATE products SET price = 24.99 WHERE id = 1;
-- Snapshot 2

INSERT INTO products VALUES (2, 'Gadget', 29.99);
-- Snapshot 3

-- Query current state
SELECT * FROM products;

-- Query as it was at snapshot 1
SELECT * FROM products AT (VERSION => 1);

-- See what changed between snapshots
SELECT * FROM table_changes('products', 1, 3);

The AT clause specifies which snapshot to query. The table_changes function shows insertions, deletions, and updates between snapshots. This isn’t just for auditing or debugging. Time travel enables reproducing analyses with the exact data that was current when the analysis first ran, critical for scientific reproducibility and regulatory compliance.

The implementation is straightforward because snapshots are rows in catalog tables. Querying historical data is joining to the appropriate snapshot metadata and reading the Parquet files that were current at that snapshot. No complex snapshot file management or garbage collection of intermediate states. The catalog database tracks everything through normal tables and foreign keys.

Time-Series Data Modeling

What You Give Up and What You Gain

DuckLake makes deliberate trade-offs that clarify its intended use cases. Like other lakehouse formats, DuckLake does not support indexes, primary keys, foreign keys, or unique and check constraints. These database features require infrastructure that doesn’t exist when data lives in Parquet files. DuckLake is for analytical workloads where full table scans are common and transactional guarantees matter more than point lookups. The dependency on a catalog database means you’re running a database server. For single-user scenarios with DuckDB or SQLite catalogs, this is trivial. For production scenarios with PostgreSQL, this means PostgreSQL administration. The catalog database becomes critical infrastructure that must be available for DuckLake access. This is a trade-off: operational simplicity in metadata management at the cost of depending on database availability. If you already have a dedicated infrastructure and DBAs then this is less of a concern, but it’s something to consider very carefully.

What you gain is substantial, however.

Multi-table transactions work naturally because the catalog database provides transaction boundaries.
Atomic operations across multiple tables don’t require custom coordination protocols (these are some of the most challenging runtime tasks a data engineer faces).
Concurrent writers don’t step on each other because the database handles locking.
Schema evolution is transactional, preventing partially-applied changes.

These capabilities that require complex engineering in file-based systems, come for free from the database. The small files problem that plagues data lakes is mitigated through data inlining. Small updates can be buffered in the catalog database itself before being flushed to Parquet files. This happens transparently, reducing the proliferation of tiny files that degrades performance in file-based systems. When small files do accumulate, compaction is simpler because the catalog database coordinates the operation cleanly.

Where Does DuckLake Fit?

DuckLake makes sense for organizations wanting lakehouse capabilities without lakehouse complexity. If you need ACID transactions on data lake storage but don’t want to operate complex metadata systems, DuckLake provides a simpler path. If you’re already using DuckDB for analytics and need multi-user capabilities, DuckLake extends DuckDB naturally.

The sweet spot is analytical workloads that benefit from both the economics of object storage and the reliability of transactional databases. Data science teams working collaboratively on shared datasets benefit from concurrent access without stepping on each other. Analytics pipelines benefit from atomic multi-table operations. Organizations with SQL expertise but limited experience operating specialized data infrastructure benefit from leveraging familiar database skills.

DuckLake is less appropriate when you need the absolute maximum performance of specialized systems. PostgreSQL or MySQL as a catalog introduces latency that purely file-based systems avoid. If your access patterns are purely append-only with no updates or deletes, simpler approaches without lakehouse complexity might suffice. If you need features specific to Delta Lake or Iceberg like their specific ecosystem integrations, those remain better choices.

Why Iceberg

The competitive positioning is clear. DuckLake competes with Delta Lake plus Unity Catalog or Iceberg plus Polaris Catalog—the full lakehouse stack including both the table format and catalog system. For organizations willing to adopt those stacks, they provide mature, production-tested solutions with extensive tooling. For organizations finding that complexity excessive, DuckLake offers a radically simpler alternative.

Why Delta Lake

Broader Implications

DuckLake represents a design philosophy that questions whether data systems must be as complex as they’ve become (might be talking myself out of a job here…). The data infrastructure industry has trended toward specialized formats, protocols, and systems for every use case. Each adds operational burden, requires specialized expertise, and introduces potential failure modes.

The DuckDB team’s approach with DuckLake suggests an alternative: leverage existing, well-understood systems wherever possible. SQL databases are mature, well-understood, and operated by millions of organizations. Using them for metadata management recognizes that the problem metadata management solves is fundamentally a database problem.

This philosophy extends beyond DuckLake. The entire DuckDB project embodies simplicity, from its zero-configuration embedded architecture to its ability to query files directly without loading them. DuckLake continues this philosophy into the domain of lakehouses, demonstrating that sophisticated capabilities don’t require sophisticated infrastructure when you make the right architectural choices.

The Core Insight – SQL Databases Are Good at Managing State

Getting Started – It’s Remarkably Simple

Time Travel – History Without Complexity

What You Give Up and What You Gain

Where Does DuckLake Fit?

Broader Implications

Share this:

Related Posts

Keys Are The Key

WebAssembly in Data Processing

The 6 Pillars

Discover more from Where Data Engineering Meets Business Strategy