Log-Structured Merge Trees

When you run a modern database like Cassandra, RocksDB, or ScyllaDB, there’s a good chance it’s powered by a Log-Structured Merge Tree (LSM Tree). This data structure is built for one thing above all else:

High-throughput writes at scale

Illustration of the compaction process in a Log-Structured Merge Tree (LSM Tree), showing multiple levels with sorted files being merged over time to create fewer, larger files. — **Data Compaction & LSM Tree**

What Is an LSM Tree?

An LSM Tree is a storage data structure optimized for write-heavy workloads.
Instead of writing updates directly to the main on-disk structure (as a B-tree would), LSM Trees:

Buffer writes in memory (MemTable).
Append to a sequential log for durability.
Periodically flush memory to disk in sorted files (SSTables).
Merge & compact files over time to maintain sorted order and remove old data.

For more information on B-Trees, see this post.

What are B-Trees?

Why Use LSM Trees?

Fast Writes – Writes are sequential, avoiding costly random I/O.
Batch-Friendly – Flushing and merging amortizes disk seeks.
Scalable – Handles massive datasets across distributed nodes.

The trade-off: Reads can be slower than B-trees because data may be scattered across multiple SSTables — but Bloom filters and compaction help mitigate this.

How It Works – Step-by-Step

Incoming writes go to MemTable (in memory).
Also appended to a WAL (Write-Ahead Log) for crash recovery.
When MemTable is full, it’s flushed to disk as a sorted SSTable.
Over time, SSTables are merged and deduplicated.
Compaction also enforces TTL and removes deleted data (tombstones).
Look in MemTable first.
Then search SSTables, using Bloom filters to skip irrelevant files.

Bloom Filters

Example: LSM Tree in Action (Cassandra-Style)

-- Insert high-volume write data 

INSERT INTO sensor_data (sensor_id, timestamp, reading) 
VALUES ('A123', toTimestamp(now()), 42.5); 

-- Query for the latest readings 
SELECT * FROM sensor_data WHERE sensor_id = 'A123' ORDER BY timestamp DESC LIMIT 10;

-- Insert high-volume write data 

INSERT INTO sensor_data (sensor_id, timestamp, reading) 
VALUES ('A123', toTimestamp(now()), 42.5); 

-- Query for the latest readings 
SELECT * FROM sensor_data WHERE sensor_id = 'A123' ORDER BY timestamp DESC LIMIT 10;

Under the hood, Cassandra stores this data in MemTables → SSTables using LSM trees, giving write-heavy IoT ingestion pipelines massive throughput.

Who Uses LSM Trees?

Apache Cassandra – Distributed NoSQL, time-series, and IoT data.
ScyllaDB – Low-latency, Cassandra-compatible database.
RocksDB – Embedded key-value store from Facebook.
LevelDB – Lightweight key-value store from Google.
HBase – Bigtable-inspired, Hadoop-backed store.

Advantages of LSM Trees

Exceptional write performance.
Ideal for time-series and streaming workloads.
Efficient use of SSDs/HDDs through sequential I/O.

Drawbacks

Read amplification, more work to find a single key.
Space amplification from multiple SSTable versions before compaction.
Write stalls during compaction under heavy load.

Final Thoughts

If B-trees are the Swiss Army knife of databases, LSM Trees are the bulldozers, built to push huge volumes of data into persistent storage as fast as possible. For write-heavy workloads like IoT telemetry, messaging systems, and real-time analytics, LSM Trees remain one of the most important data structures in modern systems.

Original LSM Tree Paper (O’Neil et al., 1996)

What Is an LSM Tree?

Why Use LSM Trees?

How It Works – Step-by-Step

Example: LSM Tree in Action (Cassandra-Style)

Who Uses LSM Trees?

Advantages of LSM Trees

Drawbacks

Final Thoughts

Share this:

Related Posts

Semantic Models

DuckLake

Why Iceberg

Discover more from Where Data Engineering Meets Business Strategy