Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124

Memory is cheap, but it ain’t free. In the world of modern data engineering, compression is everywhere. It’s in your Parquet files, your Kafka messages, your database storage engine, your .txt file and your API responses. Yet despite its ubiquity, compression remains one of the least understood aspects of data system design. It just seems to work so let’s not worry about it.
However, choosing the wrong codec can cost you in performance or storage efficiency. Choosing the right one can be the difference between a system that scales and one that collapses under its own weight.
The one constant is change. I once worked on a system that logged everything in excruciating levels of detail – orders, events etc. Hundreds of times a second. Tiny, kilobyte sized dumps across a massively distributed estate. Over time, we started to notice some interesting patterns emerge via our grafana monitoring solution:

We were hitting utilization limits, but we had already compressed the operational data in-situ. What was causing this? Those tiny writes were no longer tiny. They were tens of terabytes in size and growing. Full fat, uncompressed text files that were now impacting the performance of our production estate.

This guide explores the full landscape of compression codecs used in modern data development, from the speed-obsessed to the space-optimized and everything in between.
Compression works by finding and exploiting patterns:
Before diving into specific codecs, it’s important to understand the fundamental trade-off space. Every compression algorithm sits somewhere on a multi-dimensional spectrum:
Compression Speed vs Decompression Speed vs Compression Ratio vs CPU Utilization vs Memory Requirements
There is no free lunch. Fast compression usually means lower ratios. High ratios usually mean slower speeds. Understanding where your workload sits on this spectrum is the first step to making intelligent choices.
Let’s take an example JSON payload:
{"timestamp":"2024-01-15T10:00:00","level":"INFO","message":"User login successful","user_id":12345}
{"timestamp":"2024-01-15T10:00:01","level":"INFO","message":"User login successful","user_id":12346}
{"timestamp":"2024-01-15T10:00:02","level":"ERROR","message":"User login failed","user_id":12347}
{"timestamp":"2024-01-15T10:00:03","level":"INFO","message":"User login successful","user_id":12348}
{"timestamp":"2024-01-15T10:00:04","level":"INFO","message":"User login successful","user_id":12349}Original size: ~450 bytes
Dictionary:
A = "timestamp"
B = "level"
C = "message"
D = "user_id"
E = "2024-01-15T10:00:0"
F = "INFO"
G = "User login successful"
Compressed representation:
{"A":"E0","B":"F","C":"G","D":12345}
{"A":"E1","B":"F","C":"G","D":12346}
{"A":"E2","B":"ERROR","C":"User login failed","D":12347}
{"A":"E3","B":"F","C":"G","D":12348}
{"A":"E4","B":"F","C":"G","D":12349}Estimated compressed size: ~220 bytes (plus 100 bytes for dictionary) = 320 bytes
Savings: ~30% compression
This works great when we have repeating patterns. Let’s try a simpler example first:
Original:
"AAAAABBBBBCCCCCAAAAABBBBB"
RLE Compressed:
"5A5B5C5A5B"
Savings: 25 bytes → 10 bytes (60% compression!)
This technique is heavily used in data frames like Apache Parquet.
This is the most powerful for our simple JSON example. It looks backward for matching sequences:
Line 1: {"timestamp":"2024-01-15T10:00:00","level":"INFO","message":"User login successful","user_id":12345}
[Store fully - nothing to reference yet]
Line 2: {"timestamp":"2024-01-15T10:00:01","level":"INFO","message":"User login successful","user_id":12346}
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This part is almost identical to line 1!
Instead of storing: {"timestamp":"2024-01-15T10:00:0
Store: <reference to position 0, copy 47 chars>
Then store the difference: 1","level":"INFO"...
Line 1: [Full content stored]
Line 2: ←(back 100 chars, copy 47) + "1" + ←(back 50, copy 30) + "6}"
Line 3: ←(back 200 chars, copy 40) + "2" + ←(back differences...)
Result: Each subsequent line only stores the tiny differences!
Estimated compressed size: ~150-200 bytes
Savings: ~55-60% compression
Developed by Google and released in 2011, Snappy represents a philosophy: compress just enough to make I/O faster, but never become the bottleneck yourself. It achieves modest compression ratios of 1.5x to 2x while delivering compression speeds of 250-500 MB/s and decompression speeds exceeding 500 MB/s on modern hardware.
Snappy makes sense when you’re I/O bound but have CPU to spare. It’s the default codec in much of the Hadoop ecosystem, standard in Google’s internal infrastructure, and ubiquitous in Parquet files. The algorithm is also remarkably stable and predictable, making it ideal for production systems where consistency matters.
Where you’ll find it: Hadoop, HBase, Cassandra, Parquet, BigQuery, and countless other big data systems.
If Snappy is fast, LZ4 is blazing. Created by Yann Collet in 2011, LZ4 achieves decompression speeds that often exceed 2 GB/s on modern processors. Its compression ratio is similar to Snappy, but it comes in two flavors: LZ4 (extreme speed) and LZ4HC (high compression, slower but better ratios).
LZ4’s claim to fame is its decompression speed, which makes it perfect for scenarios where data is compressed once but decompressed many times. Gaming engines use it for asset compression. In-memory databases use it to reduce memory pressure without sacrificing access speed.
Where you’ll find it: Redis, Kafka, Hadoop (as an alternative to Snappy), gaming engines, and embedded systems.
The grandfather of fast compression, LZO (Lempel-Ziv-Oberhumer) dates back to 1996. While largely superseded by Snappy and LZ4 in new projects, LZO remains common in legacy Hadoop installations. Its compression ratio and speed are comparable to Snappy, but licensing concerns and the availability of more modern alternatives have reduced its popularity.
Where you’ll find it: Older Hadoop clusters, legacy systems, and embedded devices.
Released by Facebook in 2016, ZSTD represents the modern generation of compression. It’s not just a codec; it’s a family of compression strategies accessible through compression levels 1-22. At level 1, ZSTD performs comparably to Snappy. At level 22, it approaches LZMA compression ratios. This flexibility makes it remarkably versatile.
ZSTD’s killer feature is dictionary compression. When compressing many similar small objects (think JSON documents, log entries, or IoT events), ZSTD can train a dictionary on sample data and achieve compression ratios 2-3x better than without the dictionary. This has made it a favorite for companies like Facebook, Netflix, and Dropbox.
The adoption curve for ZSTD has been steep. Initially positioned as a gzip replacement, it’s now becoming a default choice for new systems that need flexibility. Compression at level 3 provides Snappy-like speed with better ratios. Level 6-9 offers excellent balance for most workloads. Higher levels are reserved for archival and cold storage.
Where you’ll find it: Facebook’s infrastructure, Linux kernel, HTTP compression (replacing gzip), Kafka, Parquet, and increasingly everywhere else.
The workhorse of the internet since 1992, gzip (based on the DEFLATE algorithm) remains ubiquitous despite being relatively slow by modern standards. It achieves good compression ratios (typically 2.5x to 4x) at moderate speeds. While tools like ZSTD often outperform it, gzip’s universal support and decades of optimization make it a safe default.
The algorithm combines LZ77 compression with Huffman coding, providing solid all-around performance. Decompression is reasonably fast, making gzip suitable for scenarios where data is compressed once and decompressed many times.
Where you’ll find it: HTTP compression, file archives (.gz, .tar.gz), older data lakes, backup systems, and anywhere universal compatibility matters.
Offering better compression ratios than gzip (3x to 5x) at the cost of significantly slower compression and decompression, bzip2 uses the Burrows-Wheeler transform algorithm. It was once popular for file archives and backups but has largely been displaced by ZSTD and LZMA for high-ratio compression needs.
Where you’ll find it: File archives, older backup systems, and scenarios requiring better compression than gzip with universal tool support.
When storage is at a premium and CPU time is cheap, LZMA (Lempel-Ziv-Markov chain Algorithm) and its successor XZ deliver compression ratios of 4x to 6x or higher. These algorithms are slow, both for compression and decompression, but they squeeze data remarkably efficiently.
LZMA is less common in real-time data systems but finds its niche in software distribution (Linux packages, installers), archival systems, and anywhere the data will be compressed once and rarely accessed.
Where you’ll find it: Software distribution (.xz files), archival storage, backup systems, and cold data storage.
Originally developed by Google for web font compression, Brotli has evolved into a general-purpose compressor particularly well-suited for text data. It achieves compression ratios 20-30% better than gzip at comparable speeds and has gained widespread adoption for HTTP compression.
Brotli includes a large static dictionary optimized for web content, making it exceptionally efficient for HTML, CSS, and JavaScript. Modern browsers and web servers all support Brotli, and it’s increasingly the default for web compression.
Where you’ll find it: HTTP responses, web applications, CDNs, and text-heavy data storage.
Modern columnar formats like Parquet, ORC, and Arrow don’t just support general-purpose codecs; they implement specialized compression schemes optimized for columnar data:
Run-Length Encoding (RLE) crushes columns with repeated values. A column of 10 million identical values compresses to just a few bytes.
Dictionary Encoding excels for low-cardinality string columns. Store unique values once, then use integer references everywhere else.
Delta Encoding captures differences between consecutive values, perfect for timestamps, sequential IDs, or sorted numerical data.
These techniques often achieve 10x to 100x compression on appropriate data types before applying general-purpose codecs on top.
Where you’ll find it: Parquet, ORC, Apache Arrow, column stores, and analytical databases.
Facebook’s Gorilla algorithm, designed specifically for time-series data, exploits the temporal locality of metrics. It achieves remarkable compression ratios (often 10x or better) on time-series metrics by encoding the difference between consecutive timestamps and using XOR-based compression for values.
Where you’ll find it: Time-series databases like Prometheus, InfluxDB, TimescaleDB, and monitoring systems.
When dealing with sparse datasets or set operations, Roaring Bitmaps provide both compression and fast operations. They dynamically choose between array containers, bitmap containers, and run containers based on density, achieving excellent compression while maintaining O(1) operations.
Where you’ll find it: Search engines, analytics databases, bitmap indexes, and set-based computations.
Today’s data architectures rarely use a single codec. Instead, they employ layered compression strategies:
Consider a modern data lakehouse architecture:
Streaming data arrives compressed with LZ4 in Kafka, gets processed in Spark using Snappy-compressed Parquet files, lands in a Delta Lake with ZSTD compression, and eventually moves to archival storage with high-level ZSTD or columnar-specific compression. Each stage optimizes for its specific access patterns and performance requirements.
For columnar analytics: Native columnar compression (RLE, dictionary, delta) combined with general-purpose codecs.
The compression landscape continues evolving rapidly. Several trends are emerging:
Adaptive Compression: Systems that automatically choose codecs based on data characteristics and access patterns.
Hardware Acceleration: QPL (Query Processing Library) and other hardware-accelerated compression are making higher-ratio codecs practical for hot paths.
Machine Learning Compression: Neural compression models are beginning to outperform traditional algorithms for specific data types, though they remain experimental.
Context-Aware Compression: Like ZSTD’s dictionary training but more sophisticated, learning compression strategies from data semantics rather than just byte patterns.
Compression is not just a storage optimization; it’s a fundamental architectural decision that affects performance, cost, and scalability across your entire data platform. Understanding the full landscape of available codecs and their trade-offs is essential for building modern data systems that scale efficiently.
The era of “just use gzip” is over. Modern data systems demand nuanced compression strategies that adapt to data lifecycle, access patterns, and performance requirements. Master the compression codec landscape, and you unlock one of the highest-leverage optimizations available in data engineering.
You must be logged in to post a comment.