Category Software

You like DAGs?

In the vast landscape of data engineering and computational systems, Directed Acyclic Graphs (DAGs) quietly orchestrate the flow of tasks, computations, and dependencies. From compiler optimizations to machine learning pipelines, DAGs are everywhere (including a film with Brad Pitt, for…

What Will Replace SQL?

Nothing. Ok, I should probably write a bit more than that. Nothing replaces SQL outright. Instead, three complementary layers are emerging around SQL. It will likely be with us forever, but it’s interesting to understand what alternatives exist and why.…

The Importance of Asking ‘Why’

Let’s look at some hypothetical examples you may have actually witnessed with your engineering teams. We have one server, but probably need two. Let’s use Kubernetes. We need to count the number of messages Let’s use Redis. We want to…

Data Build Tool (dbt)

Every generation of data tooling has its keystone. In the 1980s, relational databases defined the foundation. In the 2000s, Hadoop represented a seismic shift in scale. Today, in the cloud-first era of the modern data stack, one tool stands out…

Pandas or Polars?

For over a decade (I had to check this as it made me feel old) Pandas has been the go-to Python library for data analysis. Its DataFrame API has shaped how millions of analysts, scientists, and engineers work with tabular…

Apache Iceberg. All Hail the King

The last five years have seen a number of open data table formats vying for position, these include Apache Iceberg, Delta Lake and Apache Hudi. By mid-2025, the winner is clear: Apache Iceberg. This is not just a technical victory…

API Design – Data Products

The term data product has become ubiquitous in modern data organizations, but its meaning often remains fuzzy. Teams talk about building data products, while creating the same old dashboards, reports, and datasets they’ve always built. Is this new Excel spreadsheet…

DataFrames. The Wrong Choice?

DataFrames dominate modern data analysis. If I had a £ for every time I typed… import pandas as pd Whether its Pandas/Polars in Python, Spark Data Frames, or R’s original implementation, the abstraction has become the default for manipulating tabular…

Soda vs Great Expectations

Data contracts are becoming the backbone of modern data architectures. As organisations shift from ad-hoc pipelines to product-oriented data ecosystems, they need guarantees: that data will arrive on time, in the right shape, and with the expected semantics. This is…

Databricks Indexing

Databricks (like Snowflake) doesn’t rely on traditional B-trees, because it’s built on a cloud-native, columnar, distributed file architecture. It avoids B-trees entirely because the cost of maintaining per-row index structures would destroy the scalability benefits of its append-only, distributed Parquet…