Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124

Data Lakes promised everything. Store all your data in one place, in any format, ready for any workload. The reality was horrific. Data Lakes became data swamps, filled with inconsistent data, failed jobs leaving partial writes, no way to roll back mistakes, and queries that returned different results depending on when you ran them. Schema on-read? Good luck with that.
Delta Lake emerged not as a revolutionary new idea, but as a pragmatic solution to the very real problems that plagued production data lakes. Understanding why Delta Lake matters requires understanding what was broken in the first place.
A Data Lake organises data in the same way my mum organises applications on her Windows desktop (sorry mum).
The Data Lake concept was elegant. Just dump raw data into cheap object storage like S3, store it in open formats like Parquet, and query it with tools like Spark, Presto, or Hive. No expensive proprietary databases, no schema-on-write constraints, just raw flexibility. But this flexibility came with serious costs. When a Spark job writing Parquet files to S3 failed halfway through, you got partial data. Some files were written, some weren’t. Which ones? There was no atomicity, no way to know which files were part of a consistent dataset. We would discover corruption days later when queries started returning nonsensical results. Oh for a transaction.
Time travel didn’t exist. If you made a mistake and overwrote critical data, it was gone. Forever. If you wanted to reproduce yesterday’s analysis with yesterday’s data, you better have kept backups. Most teams didn’t have comprehensive backup strategies because backing up petabytes of Data Lake storage was hilariously expensive.
Schema evolution was a nightmare. Adding a column meant either rewriting all your data or dealing with schema mismatches across different tools. This was all before Iceberg solved a lot of these challenges remember. Changing data types was even worse. Teams would maintain elaborate versioning schemes and write defensive code to handle multiple schema versions simultaneously. It tooks weeks to plan changes.
Consistency problems plagued multi-user environments. One reader would query while another writer was updating files. The reader might see a mix of old and new data, or worse, fail entirely because files disappeared mid-query. Coordinating readers and writers required careful job scheduling and often led to underutilized resources as teams avoided conflicts.
Performance degraded over time as small file problems accumulated. Thousands of tiny Parquet files meant slow queries and expensive listing operations. Compaction was manual, error-prone, and disruptive to ongoing operations. Run length encoding is optimized for large files, not thousands of tiny ones. Some teams ran weekend-long compaction jobs that blocked all other access to the data.
Simple data manipulation operations were incredibly expensive. Need to update a single row? You had to rewrite the entire partition. Delete rows based on GDPR requests? Rewrite everything. What should have been simple database operations became expensive batch jobs that could take hours or days.
These weren’t edge cases. They were daily pain points that engineering teams dealt with through elaborate workarounds, complex orchestration, and careful timing of jobs. Every team had horror stories about lost data, inconsistent results, and late-nights when something went wrong. Which it often did.
Delta Lake, open-sourced by Databricks in 2019, addressed some of these problems systematically. It’s not a new storage format or a proprietary system. It’s a transactional layer on top of Parquet files that brings database-like guarantees to data lakes. The genius of Delta Lake is that it doesn’t fight against the data lake architecture but instead augments it with the reliability features that were missing.
Delta Lake’s transaction log is its foundation. Every operation, whether writing, updating, or deleting data, is recorded as a transaction in a JSON log stored alongside the data files. Operations either complete fully or not at all. No more partial writes, no more corrupt datasets from failed jobs.
The transaction log itself is simple. It’s a sequence of JSON files in a special directory that record every change to the table. Each JSON file describes what changed in that transaction: which Parquet files were added, which were removed, what the schema was, what statistics were captured. Readers consult this log to determine which files constitute the current consistent view of the table.
This simple mechanism solves the atomicity problem completely. When a write job crashes, the transaction never completes, the log never gets a new entry, and readers never see partial data. It just works. The elegance is that it requires no coordination service, no distributed locking, no complex consensus protocol. The transaction log uses optimistic concurrency control with atomic file operations, taking advantage of the guarantees that cloud object stores already provide.
Because every transaction is logged with a timestamp and version number, Delta Lake naturally supports time travel. Want to query the table as it existed yesterday? Just read the transaction log up to yesterday and use those Parquet files. The data files themselves don’t move or change; the transaction log simply tells you which ones to consider for any given point in time.
This capability transforms how teams work with data. Reproducibility becomes trivial. You can rerun last week’s analysis with last week’s data, even though the table has changed substantially since then. This is critical for debugging, auditing, and complying with regulations that require demonstrating what data you had at specific times.
Rollback becomes instantaneous. Made a mistake in your ETL job? Accidentally deleted critical rows? Restore to a previous version in seconds. There’s no need to restore from backups or replay logs. The old data is still there; you just need to point the table metadata at an earlier version. This turns data disasters into minor inconveniences.
The implications for experimentation are profound. Data scientists can try risky transformations knowing they can instantly revert if something goes wrong. Teams can run parallel experiments on the same table at different versions without creating copies. A/B testing different data processing approaches becomes straightforward.
Delta Lake tracks schema in the transaction log and supports both adding columns and evolving data types safely. When you add a new column, Delta Lake handles missing values in old data transparently, automatically providing null values for records written before the column existed. This means schema evolution doesn’t require rewriting your entire dataset.
Schema enforcement works in the opposite direction, preventing bad data from entering your lake. Try to write data that doesn’t match the schema, and Delta Lake rejects it before any files are written. This eliminates an entire class of data quality issues where corrupted or misformatted data silently pollutes your lake and causes failures weeks later when someone finally queries that partition.
The combination of schema evolution and enforcement provides the flexibility teams need while maintaining data integrity. You can adapt your schema as requirements change without elaborate migration procedures, but you’re protected from accidental schema drift or malformed data. It’s the best of both worlds: database-like guarantees with data lake flexibility.
One of Delta Lake’s most powerful features is supporting UPDATE and DELETE operations efficiently. While Parquet files themselves are immutable, Delta Lake’s transaction log tracks which rows are logically deleted or updated. Behind the scenes, Delta Lake rewrites only the affected files, adding new files with updated data and marking old files as removed in the transaction log.
This makes operations like GDPR compliance straightforward. When a user requests deletion of their data, you can run a simple DELETE statement that executes in minutes rather than rewriting terabytes of data. Error correction becomes practical. Found a bug in your data transformation logic? Update the affected rows without touching the rest of the dataset.
The MERGE operation enables sophisticated change data capture patterns. You can ingest a stream of database changes and apply them to your lake with a single statement that handles inserts, updates, and deletes appropriately. This pattern powers real-time data warehousing on data lakes, bringing together the freshness of streaming systems with the analytical power of batch processing.
Delta Lake includes several features that address the small file problem and improve query performance without requiring manual intervention. File statistics are stored in the transaction log for every data file, including min and max values for each column, null counts, and record counts. Query engines use these statistics for data skipping, avoiding reading files that provably don’t contain relevant data.
Z-ordering is a multi-dimensional clustering technique that co-locates related data to improve query performance when filtering on multiple columns. Unlike traditional partitioning which only helps with single-column filters, Z-ordering maintains locality across multiple dimensions simultaneously. This dramatically improves performance for complex queries without the maintenance burden of multiple partition schemes.
Automatic compaction addresses the small file problem by periodically merging small files into larger, more efficient ones. This happens transparently in the background without disrupting ongoing operations. Delta Lake’s optimized write capabilities can also repartition and sort data during writes to create optimal file sizes and layouts from the start.
Multiple readers and writers can operate on a Delta table simultaneously with guaranteed consistency. Optimistic concurrency control ensures that conflicting writes are detected and handled appropriately. When conflicts occur, transactions can be automatically retried with the latest state, making concurrent access practical without coordination overhead.
A reader querying a table while a writer is updating it sees a consistent snapshot. The reader isn’t blocked, doesn’t need to wait for writes to complete, and the query results are deterministic. This is table stakes for traditional databases but was nearly impossible with raw Parquet files on object storage. Teams can stop carefully scheduling jobs to avoid conflicts and simply run workloads when they make sense.
The obvious question arises: if you want ACID transactions and all these database features, why not just use a database? The answer lies in the unique combination of capabilities that Delta Lake provides.
Scale and cost are the most immediate factors. Data Lakes scale horizontally on cheap object storage that costs pennies per gigabyte per month. Traditional databases require expensive storage tiers and have practical limits on dataset size. When you’re storing petabytes of data, the cost difference between object storage and database storage becomes prohibitive.
Flexibility is equally important. Delta Lake works with the entire Spark ecosystem, plus an expanding list of other engines. You can use notebooks for exploration, streaming jobs for real-time ingestion, batch jobs for heavy transformations, and ML pipelines for modeling, all against the same underlying data. You’re not locked into a specific query engine or processing paradigm. If you can avoid vendor lock-in, you absolutely should.
The open format matters more than many people initially realize. Delta tables are ultimately Parquet files with a JSON transaction log. If you absolutely need to, you can read them with any tool that understands Parquet. There’s no proprietary storage format, no special encoding, no vendor lock-in. This provides insurance against both vendor risk and technology changes.
Separation of compute and storage has become a fundamental architectural principle. Spin up processing power when you need it, shut it down when you don’t. Pay for storage independently of compute. Scale them separately based on your workload characteristics. This elasticity is central to modern cloud architectures and difficult to achieve with traditional database systems.
Delta Lake gives you database guarantees with data lake economics and flexibility. That’s the sweet spot for many modern data workloads. You’re not making a trade-off; you’re getting the best of both worlds.
Delta Lake is the foundation of what has become known as the lakehouse architecture, which combines the best aspects of data lakes and data warehouses. From data lakes, it inherits cheap storage, open formats, schema flexibility, and support for all data types including structured, semi-structured, and unstructured data. From data warehouses, it gains ACID transactions, schema enforcement, time travel, efficient updates and deletes, and query optimization.
This convergence is powerful because it eliminates the traditional pattern of maintaining separate systems for different workloads. You no longer need one system for raw data storage, another for cleaned and transformed data, and yet another for serving analytics. You can build a single platform that handles raw data ingestion and storage, ETL and data transformation, interactive analytics and business intelligence, machine learning and data science, and real-time streaming alongside batch processing.
All of these workloads operate on the same underlying dataset with consistent semantics and transactional guarantees. This eliminates data duplication, simplifies architecture, reduces operational complexity, and most importantly, ensures that everyone is working with the same version of truth.
Real-world usage of Delta Lake reveals patterns that demonstrate its versatility. Streaming and batch processing unite naturally under Delta Lake. You can write streaming data continuously while batch jobs read from the same table, all with full consistency. Streaming jobs can also read from Delta tables, enabling complex stream processing pipelines where one stage’s output becomes another’s input without intermediate storage systems.
Change data capture becomes elegant with Delta Lake’s MERGE operation. You can ingest a stream of database changes and apply them to your lake, maintaining a complete history of how data evolved. This pattern enables real-time data warehousing where your analytical datasets stay current with your operational databases, often with latency measured in seconds rather than hours.
Slowly changing dimensions, historically one of the more painful data warehouse patterns to implement, become straightforward with Delta Lake. Type 2 SCD implementations that track full history with effective dates can be expressed as simple MERGE statements. The combination of efficient updates and full history makes temporal analysis practical at scale.
Time series analysis benefits tremendously from Delta Lake’s time travel capability. You can compare the current state of your data with its state at any previous point in time using simple temporal joins. This enables trend analysis, anomaly detection, and forecasting without maintaining separate historical snapshots.
Delta Lake exists within a competitive landscape. Apache Iceberg and Apache Hudi offer similar capabilities, creating what’s been called the table format wars in modern data engineering. Understanding the differences helps in making informed choices.
Apache Iceberg takes a more specification-driven approach with a focus on broader engine support. It works well with Spark, Flink, Trino, Presto, and other engines, making it attractive for organizations with heterogeneous data processing stacks. Iceberg has strong architectural design and benefits from diverse community backing across multiple companies and projects.
Apache Hudi pioneered many concepts that both Delta Lake and Iceberg later adopted, particularly around streaming use cases and record-level updates. It has excellent change data capture support and was designed from the ground up for upsert-heavy workloads. Hudi’s focus on incremental processing and streaming makes it compelling for certain use cases. I’ve only ever used Iceberg and Delta Tables so can’t really comment on Hudi in particular.
Delta Lake benefits from Databricks’ backing, tight Spark integration, and strong performance characteristics. It’s arguably the most mature and widely adopted of the three formats, with the largest production deployments and most extensive tooling ecosystem. The recent introduction of Universal Format, which allows Delta Tables to be read as Iceberg or Hudi, addresses interoperability concerns.
In practice, all three formats solve similar problems with different design philosophies and trade-offs. The choice often comes down to ecosystem fit, existing infrastructure investments, and team expertise. The encouraging aspect is that competition has driven all three formats to improve rapidly, and they’re converging on many core capabilities.
Delta Lake shines in several scenarios that characterize modern data platforms. If you’re building on Spark, Delta Lake’s integration is seamless and performant, making it a natural choice. When you need reliable data pipelines where failures don’t corrupt your data and operations are atomic, Delta Lake eliminates entire classes of operational problems.
Very small datasets measured in gigabytes rather than terabytes often work better in traditional database systems. The operational overhead of data lake infrastructure doesn’t pay off at small scale, and database systems provide richer SQL support, better tooling, and simpler operations.
When workloads require specific database capabilities like Postgres-compatible SQL with its rich function library, foreign key constraints, or complex transaction patterns, you should use an actual database. Fighting against your workload’s natural requirements rarely ends well.
Millisecond latency requirements push you toward systems designed for low-latency access. Data lakes are optimized for throughput, not latency. While Delta Lake is faster than raw Parquet, it’s still serving data from object storage with fundamentally different performance characteristics than memory-based or SSD-based databases.
Multi-engine requirements where you need strong support for Trino, Flink, and other non-Spark engines might be better served by Iceberg, which was designed with engine portability as a core principle.
The table format space continues evolving rapidly. Delta Lake is expanding engine support beyond Spark, with connectors for various query and processing engines improving steadily. This addresses one of the historical criticisms of Delta Lake being too Spark-centric.
Performance optimizations like Photon and liquid clustering promise to make Delta Lake competitive with specialized analytical databases for query performance. These features intelligently organize data on disk to accelerate queries without manual tuning.
Universal Format represents an interesting direction where Delta Lake tables can be read as Iceberg or Hudi, enabling interoperability across different parts of an organization that have standardized on different formats. This pragmatic approach acknowledges that large organizations often have diverse tooling and reduces the friction of format choices.
Streaming improvements continue with focus on lower latency, better throughput, and more streaming-native features. The gap between batch and streaming semantics continues to narrow, making truly unified architectures more practical.
Deeper cloud integration across AWS, Azure, and GCP means better support for cloud-native features like fine-grained access control, encryption, and lifecycle management, making Delta Lake feel more like a native cloud service than a layer on top of object storage.
Delta Lake succeeded because it solved real problems that data engineers faced daily. It didn’t reinvent storage or require abandoning existing infrastructure. It added a thin transactional layer that made data lakes reliable, consistent, and manageable.
The problems Delta Lake addresses aren’t exotic. They’re the mundane operational challenges that plague production data systems: partial writes from failed jobs, inconsistent reads during concurrent access, inability to fix mistakes without expensive rewrites, poor performance from accumulated small files, and lack of support for updates and deletes.
By solving these problems pragmatically with an open format layered on top of Parquet, Delta Lake made data lakes viable for mission-critical workloads. The architectural decision to augment rather than replace existing data lake infrastructure meant that adoption was incremental rather than requiring wholesale platform migration.
In an industry that often gravitates toward complexity and novel approaches, Delta Lake’s value proposition is refreshingly simple: it makes your data lake not suck. For the thousands of data teams struggling with reliability, consistency, and operational pain in their data lakes, that’s more than enough. It’s transformative.