Change Data Capture

Change Data Capture (CDC) is one of those concepts that seems deceptively simple:

Just capture the changes

But it soon becomes complex once you start implementing at scale. In today’s world of streaming analytics, real-time machine learning, and event-driven microservices, CDC is a foundational building block for keeping data synchronized across systems without bulk reloads or downtime.

What is CDC?

Change Data Capture is the process of identifying and delivering changes made to a data source (typically a database) so that consumers can react to them.
Instead of pulling full tables or large data extracts, CDC continuously streams only the inserts, updates, and deletes.

Core benefits:

  • Efficiency — Avoids scanning entire datasets.
  • Low Latency — Near real-time replication to downstream systems.
  • Decoupling — Producers and consumers can evolve independently.
  • Auditability — Captures a record of changes for compliance.

A Brief History of CDC

CDC’s roots can be traced back to mainframe replication in the 1970s and 1980s, when vendors like IBM developed log-based capture for DB2 and IMS to synchronize data between batch jobs and transaction systems.

In the 1990s:

  • ETL tools such as Informatica and DataStage introduced CDC capabilities to reduce nightly batch times.
  • Database vendors like Oracle, Microsoft, and Sybase added proprietary log mining APIs.

By the 2010s:

  • The streaming revolution (Apache Kafka, Flink, Spark Structured Streaming) made CDC the primary pattern for moving operational data into analytics systems in near-real time.
  • Open source projects like Debezium democratized log-based CDC for a variety of databases.

Today, CDC is embedded in data lakehouse ingestion, cloud ETL pipelines, and event-driven architectures.

CDC Capture Techniques

CDC can be implemented in multiple ways, each with trade-offs.

Table Diff / Timestamp Based

  • Compare current state with last extracted snapshot.
  • Or filter by last_updated > :max_timestamp.
  • Pros: Simple to implement.
  • Cons: Requires extra columns (e.g., updated_at), can miss deletes, not true real-time.

SQL Example:

SELECT * FROM orders WHERE last_updated > '2025-08-01 00:00:00';

Trigger-Based

  • Database triggers write changes to an audit table.
  • Pros: Works even without access to transaction logs.
  • Cons: Adds write latency, increases database load.

Example (PostgreSQL):

CREATE TABLE order_changes ( order_id INT, change_type TEXT, changed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); 
CREATE OR REPLACE FUNCTION log_order_changes() RETURNS TRIGGER AS 
$$ BEGIN INSERT INTO order_changes(order_id, change_type) 
VALUES (NEW.order_id, TG_OP); 
RETURN NEW; END; 
$$ 
LANGUAGE plpgsql; 
CREATE TRIGGER orders_cdc AFTER INSERT OR UPDATE OR DELETE ON orders 
FOR EACH ROW EXECUTE FUNCTION log_order_changes();

Log-Based (Preferred)

  • Reads changes from the database transaction log (e.g., MySQL binlog, PostgreSQL WAL, Oracle redo logs).
  • Pros: Low overhead, captures deletes, can be asynchronous.
  • Cons: Requires access to logs, more complex parsing.

Tools: Debezium, Oracle GoldenGate, SQL Server CDC, AWS DMS.


Application-Level Events

  • The app publishes an event when it changes data.
  • Pros: Flexible, integrates with event-driven systems.
  • Cons: Risk of divergence between DB state and event messages if not transactional.

CDC Architectures in Practice

CDC is not just about capturing — it’s about delivering changes reliably.

Common CDC Data Flow

[ Source DB ] --> [ CDC Capture ] --> [ Message Broker / Stream ] --> [ Consumers ]

  • Capture Layer — Reads from triggers or logs.
  • Transport Layer — Kafka, Kinesis, Pulsar, or cloud pub/sub.
  • Consumer Layer — Data warehouse, search index, ML feature store, cache.

Example: Log-Based CDC with Kafka + Debezium

  1. Debezium Connector reads the DB transaction log.
  2. Produces events to a Kafka topic.
  3. Consumers subscribe and process events in real-time.

Kafka Topic Structure:

  • db.orders — events for the orders table.
  • Event payload contains before and after states.

CDC in Modern Data Engineering

CDC is at the heart of modern real-time analytics architectures:

  • Lakehouse ingestion — Load updates directly into Delta Lake, Apache Iceberg, or Hudi.
  • Microservices sync — Keep caches, search indexes, and reporting DBs up to date.
  • Machine Learning — Feed online feature stores like Feast with low latency.
  • Compliance & Auditing — Maintain immutable logs for SOX, HIPAA, GDPR.

Designing a Robust CDC Pipeline

When designing CDC, you need to address:

  • Idempotency — Consumers must handle duplicate events.
  • Ordering Guarantees — Keyed ordering for correctness (e.g., Kafka partitions by primary key).
  • Schema Evolution — CDC must handle added/removed columns without breaking.
  • Error Handling — Dead letter queues for invalid messages.
  • Replayability — Ability to reprocess historical changes from logs.

Python Example — Consuming CDC Events from Kafka

Here’s a simple CDC consumer using confluent-kafka in Python:

from confluent_kafka import Consumer, KafkaError 
import json conf = { 
  'bootstrap.servers': 'localhost:9092', 
  'group.id': 'cdc-consumer', 
  'auto.offset.reset': 'earliest' 
} 
consumer = Consumer(conf) 
consumer.subscribe(['db.orders']) 
try: 
  while True: 
    msg = consumer.poll(1.0) 
    if msg is None: 
      continue 
    if msg.error(): 
      if msg.error().code() != KafkaError._PARTITION_EOF: 
        print(f"Error: {msg.error()}") 
        continue 
  event = json.loads(msg.value().decode('utf-8')) 
  print("Change event:", event) 
except KeyboardInterrupt: 
  pass 
finally: consumer.close()

This script reads Debezium CDC events from the db.orders topic and prints them in JSON format.


Python Example — Applying CDC Events to a Target Database

import psycopg2 import json 
def apply_change(change_event): 
  op = change_event['op'] 
  # c = create, u = update, d = delete 
  data = change_event['after'] 
  if op != 'd' and change_event['before'] 
    conn = psycopg2.connect("dbname=analytics user=etl password=secret") 
    cur = conn.cursor() 
  if op == 'c': 
    cur.execute("INSERT INTO orders VALUES (%s, %s, %s)", (data['id'], data['status'], data['amount']))
  elif op == 'u': 
    cur.execute("UPDATE orders SET status=%s, amount=%s WHERE id=%s", (data['status'], data['amount'], data['id'])) 
  elif op == 'd': 
    cur.execute("DELETE FROM orders WHERE id=%s", (data['id'],)) conn.commit() cur.close() conn.close()

CDC and Schema Evolution

CDC complicates schema changes. If a new column is added:

  • Log-based systems like Debezium propagate it automatically (if configured).
  • Consumers must handle unknown fields gracefully.
  • Best practice: Use schema registry (e.g., Confluent Schema Registry with Avro/Protobuf).

Real-World Pitfalls & Lessons Learned

1. Latency Spikes
If transaction logs grow faster than they can be processed, CDC lags. Use monitoring to detect backlog.

2. Re-ordering Issues
Distributed brokers may deliver messages out of order for different keys. Partition wisely.

3. Large Transactions
One huge batch update can produce millions of CDC events, overwhelming consumers.

4. Schema Drift
Lack of governance can cause downstream breakages. Always integrate CDC into your data governance process.


CDC in the Cloud & SaaS Era

Cloud providers now offer managed CDC pipelines:

  • AWS Database Migration Service (DMS) — CDC to S3, Redshift, Kafka.
  • Azure Data Factory — Mapping Data Flows with CDC.
  • Google Datastream — CDC to BigQuery and Cloud Storage.
  • Fivetran / Hevo / Stitch — SaaS CDC connectors.

These services simplify deployment but can hide operational complexity — you still need to handle schema changes, retries, and ordering.


CDC for Event-Driven Microservices

CDC is not just for analytics — it can drive domain events in microservice architectures:

  • Database as the single source of truth.
  • Changes emitted as events for other services.
  • Enables outbox pattern for guaranteed delivery.

Outbox Pattern:
Write to a dedicated outbox table in the same transaction as the business change, then a CDC process reads and publishes from it.


The Future of CDC

We are seeing a shift from CDC as an ETL technique to CDC as the backbone of distributed systems:

  • CDC + CQRS — Command Query Responsibility Segregation with event sourcing.
  • CDC + Streaming Joins — Real-time joins across heterogeneous sources.
  • CDC + ML — Real-time feature updates and online inference.

Summary

CDC has evolved from a niche database replication feature into a critical enabler of:

  • Real-time analytics
  • Event-driven microservices
  • Streaming ETL
  • Regulatory compliance

When designed well, CDC pipelines can power systems that are reactive, scalable, and low-latency — while minimizing the strain on source systems.


  • Choose log-based CDC when possible — it’s the most efficient.
  • Always design for idempotency and replayability.
  • Monitor lag and schema changes proactively.
  • Integrate CDC into your governance and architecture strategy.

References

  1. Debezium Documentation — https://debezium.io/documentation/
  2. Oracle GoldenGate Overview — https://www.oracle.com/middleware/technologies/goldengate.html
  3. IBM InfoSphere Data Replication — https://www.ibm.com/products/data-replication
  4. AWS DMS CDC — https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Introduction.html
  5. Fowler, M. — Event Sourcing and the Outbox Patternhttps://martinfowler.com/articles/patterns-of-distributed-systems/outbox.html

Discover more from Where Data Engineering Meets Business Strategy

Subscribe now to keep reading and get access to the full archive.

Continue reading