Data Delivery Mechanisms

Every data platform eventually faces the same question:

How do we actually get data from point A to point B?

It’s a deceptively simple question with dozens of possible answers, each with its own trade-offs in terms of complexity, reliability, latency, security, and cost. The mechanism you choose for delivering data shapes everything that comes after:

– how quickly data becomes available

– how hard it is to debug problems

– what failure modes you need to handle

– how much operational overhead you incur.

A command line interface displaying an FTP session connected to a server with prompts for commands and user interactions.

Modern organizations use a patchwork of delivery mechanisms, often accumulated over decades as different teams solved different problems with different technologies. A veritable cottage industry of ETL pipelines.

You might have APIs for real-time data exchange, FTP (1970), SFTP (1997) for legacy batch transfers, message queues for event streaming, database replication for keeping systems synchronized, and file sharing for ad-hoc data distribution. Each mechanism made sense in its original context, but the resulting complexity creates integration challenges and operational burden.

APIs: Real-Time Request-Response

APIs, particularly REST APIs, have become the default mechanism for delivering data in request-response patterns. A client needs specific data, makes a request to an API endpoint, and receives a response containing that data. This synchronous model aligns naturally with how applications work and how developers think about data access.

The appeal of APIs is their immediacy and simplicity from the client perspective. You need a customer record, you call GET /customers/123, and you get back JSON with customer data. This real-time access means data is always current, never stale. You’re querying the source of truth directly rather than working with copies that might be outdated.

APIs excel for transactional patterns where small amounts of data are requested frequently. Looking up individual records, validating information, or retrieving specific subsets of data all work naturally as API calls. The granular access pattern means you only transfer data that’s actually needed, minimizing bandwidth and processing overhead.

The challenge with APIs for data delivery at scale is that they’re designed for individual requests, not bulk transfer. If you need to synchronize thousands or millions of records, making individual API calls becomes impractical. Even with batch endpoints that return multiple records per request, the overhead of HTTP requests, authentication, and response processing limits throughput compared to bulk transfer mechanisms. Also what happens when chunk 6/8976 fails? Do you continue and try to recover/retry, or fail.

Rate limiting and throttling are necessary for API stability, but complicate bulk data delivery. Most APIs limit how many requests you can make per second or minute to prevent abuse and maintain service quality. When you’re trying to transfer large datasets, these limits force you to slow down, space out requests, and implement complex retry logic. What should be a simple data transfer becomes an exercise in working around rate limits.

Authentication and authorization add overhead to every API request. Even with efficient token-based auth, validating credentials and checking permissions takes time. For high-frequency data access, this overhead matters. Connection pooling and credential caching help, but can’t eliminate the per-request auth cost entirely.

APIs work best for operational data access where applications need current information on demand. They’re less suitable for bulk data synchronization, historical data transfers, or feeding analytical systems that process large volumes. The request-response model that makes APIs convenient for real-time access becomes a limitation for batch-oriented data movement.

GraphQL APIs deserve special mention as an evolution addressing some REST API limitations. GraphQL lets clients specify exactly what data they need in a single request, reducing over-fetching and under-fetching problems. For complex data relationships, GraphQL can be more efficient than making multiple REST calls. However, GraphQL doesn’t fundamentally change the synchronous request-response model or solve the challenges of bulk data transfer.

File Transfer Protocols – The Batch Workhorse

File-based data delivery through protocols like FTP, SFTP, and FTPS remains ubiquitous despite being decades old. The pattern is straightforward: generate a file containing data, transfer it to a destination, and have the recipient process it. This batch-oriented approach handles large volumes efficiently and works reliably even with limited network connectivity.

SFTP, the secure variant using SSH, has largely replaced plain FTP for data delivery in security-conscious environments. SFTP provides encryption in transit and authentication, making it acceptable for sensitive data that FTP wasn’t suitable for. The protocol is mature, well-supported, and understood by operations teams everywhere.

The simplicity of file transfer is both strength and weakness. It’s conceptually simple: files go from one place to another. But that simplicity hides operational complexity. You need mechanisms to detect when files arrive, validate they transferred completely, handle failures and retries, and coordinate processing. File transfer itself is simple; the orchestration around it is where complexity lives.

A screenshot of an SFTP client interface displaying a list of files, their sizes, dates, and permissions, along with a status log indicating a directory change.
https://www.dart.com/

File naming conventions become critical in file-based data delivery because names encode metadata about content, timing, and processing requirements. A file named customers_20241120_001.csv tells you it contains customer data from November 20, 2024, and it’s the first file that day. These conventions must be documented and followed consistently or automation breaks down.

Scheduling and coordination challenges arise because file-based delivery is asynchronous. The sender produces files on their schedule, the recipient processes them on theirs, and these schedules must align somehow. Mismatches lead to delayed processing, missed files, or systems waiting indefinitely for files that aren’t coming.

File size and splitting considerations matter for large datasets. A 100GB file is unwieldy to transfer and process. Splitting into smaller files improves parallelization and failure recovery but complicates coordination. How do you know when all parts have arrived? Do you process parts independently or wait for the complete set? These questions have no universal answer.

Compression is almost always worthwhile for file-based data delivery because network transfer time dominates. Spending CPU time compressing files saves more time in transfer and reduces bandwidth costs. The compression ratio you achieve depends on data characteristics, but 5-10x is common for text data, making compression effort highly worthwhile.

FTPS, FTP over TLS, is the alternative to SFTP that uses the FTP protocol with TLS encryption. It’s less common than SFTP but appears in environments where FTP infrastructure already exists and adding encryption is easier than switching protocols. FTPS has some technical quirks around active versus passive mode and firewall traversal that make SFTP generally preferable for new implementations.

Message Queues: Event-Driven Delivery

Message queues like Apache Kafka, RabbitMQ, AWS SQS, and Azure Service Bus enable event-driven data delivery where producers publish messages that consumers process asynchronously. This decoupling of producers and consumers provides flexibility that synchronous mechanisms can’t match.

Kafka has become the dominant message queue for data delivery in modern architectures because it provides both message queue semantics and distributed log capabilities. Producers write messages to topics, consumers read from topics, and Kafka persists messages for configurable retention periods. This persistence means consumers can replay historical messages or catch up after downtime.

The publish-subscribe pattern that message queues enable allows multiple consumers to process the same data stream independently. One consumer might feed real-time analytics, another might update a search index, and a third might trigger business workflows. This pattern avoids the fan-out complexity of having producers know about all consumers.

Event streaming through message queues provides near-real-time data delivery with throughput that scales far beyond what APIs can achieve. Kafka clusters can handle millions of messages per second with sub-second latency. This combination of high throughput and low latency makes message queues suitable for use cases that need both qualities.

The trade-off is operational complexity. Running and maintaining message queue infrastructure requires expertise that not all teams have. Kafka in particular has a reputation for operational complexity, though managed services like Confluent Cloud and AWS MSK reduce this burden. The distributed nature of these systems means more components that can fail and more scenarios to handle.

Message ordering guarantees vary by system and configuration. Kafka provides ordering within partitions but not across partitions. RabbitMQ provides ordering in some configurations but not others. Understanding and configuring ordering guarantees correctly is critical for use cases where message sequence matters.

Consumer management and offset tracking require careful handling. Consumers must track which messages they’ve processed to avoid reprocessing or missing messages. Kafka’s consumer group coordination handles this automatically in many cases, but corner cases around rebalancing and failure recovery require understanding.

Schema evolution in message-based systems needs attention because producers and consumers evolve independently. Adding fields to messages should be backward compatible. Schema registries like those in Confluent Schema Registry help manage evolution by storing schemas and enforcing compatibility rules.

Database Replication: Keeping Systems in Sync

Database replication mechanisms deliver data by keeping multiple database instances synchronized. Changes in the source database propagate to replica databases automatically, providing near-real-time data delivery without application-level integration. This works for keeping read replicas current or feeding data warehouses from operational databases.

Logical replication captures changes from database transaction logs and applies them to other systems. PostgreSQL logical replication, MySQL binlog replication, and Oracle GoldenGate all follow this pattern. The source database doesn’t need to know about replicas; the replication system reads changes from transaction logs and forwards them.

Change Data Capture (CDC) tools like Debezium extend logical replication by streaming database changes to message queues. This enables event-driven architectures where database changes trigger downstream processing. CDC bridges the gap between traditional databases and event streaming architectures.

The appeal of database replication is that it’s low-touch for applications. You don’t modify application code to publish changes; the database handles it automatically. Every insert, update, and delete is captured and replicated without explicit integration work. This makes replication attractive for feeding analytical systems from operational databases.

The challenge is that database replication delivers all changes, not just the data you care about. You can’t easily filter replication to include only certain tables or certain types of changes. This all-or-nothing approach means downstream systems must handle the full change stream and filter to what they need.

Replication lag is the inevitable delay between changes occurring in the source database and appearing in replicas. This lag might be milliseconds in ideal conditions but can grow to seconds or minutes under load or after failures. Applications using replicas must tolerate this eventual consistency.

Schema changes complicate replication because changes to the source schema must be handled by replication systems and applied to destinations. Some schema changes replicate automatically, others require manual intervention. Planning schema evolution across replicated environments requires coordination.

Conflict resolution becomes necessary in multi-master replication where multiple databases accept writes. When the same record is modified in different databases simultaneously, conflicts must be resolved. Resolution strategies range from last-write-wins to application-specific logic, each with trade-offs.

Object Storage and Data Lakes: Bulk Static Delivery

Cloud object storage like Amazon S3, Azure Blob Storage, and Google Cloud Storage has become a common data delivery mechanism, particularly for large datasets and data lake architectures. Data producers write files to object storage, consumers read them when needed. The storage system mediates delivery without active file transfer.

The advantage of object storage as delivery mechanism is its simplicity and scalability. Writing data is straightforward: create objects in buckets. Reading is equally simple: fetch objects. The storage system handles durability, availability, and scale without intervention. This hands-off operation is appealing compared to maintaining file servers or coordinating transfers.

Object storage pricing models align well with data delivery because you pay for storage and transfer rather than idle capacity. If you’re using object storage for other purposes, using it for data delivery adds minimal cost. The storage layer you’re already paying for doubles as a delivery mechanism.

Event notifications from object storage enable reactive processing. When new data arrives, the storage system can trigger Lambda functions, send messages to queues, or invoke webhooks. This turns passive storage into an active component of data pipelines without requiring polling or scheduling.

The challenge with object storage as delivery mechanism is coordination. How do consumers know when new data is available? Do they poll for new objects, rely on notifications, or check manifests? Each approach works but requires implementation and has failure modes. Object storage itself is reliable; coordination around it is where complexity lives.

Data organization in object storage matters significantly because it affects discoverability and performance. Using consistent prefix patterns like year=2024/month=11/day=20/ enables partition pruning in query engines and makes finding data programmatically straightforward. Poor organization makes data delivery through object storage painful.

Versioning and lifecycle management features in object storage help manage data delivery workflows. Versioning prevents accidental overwrites, letting you recover previous versions of delivered data. Lifecycle policies can automatically archive or delete old data, managing storage costs without manual intervention.

Access control and encryption in object storage provide security for data delivery. IAM policies control who can write and read data. Encryption at rest and in transit protects sensitive information. These features make object storage acceptable for delivering data that file transfer protocols might require VPNs or dedicated networks for.

Direct Database Connections: Shared Access

Sometimes the simplest data delivery mechanism is giving consumers direct access to databases where data lives. This eliminates delivery complexity entirely—there’s no separate delivery mechanism because consumers read data directly from its source. This approach is common for read-heavy analytical workloads.

Read replicas enable shared database access without impacting production workloads. Consumers query replicas that stay synchronized with the primary database through replication. This provides current data without adding load to production systems. Most database platforms support read replicas with minimal configuration.

The simplicity of direct database access is compelling when it works. There’s no separate data pipeline to build, monitor, or maintain. No synchronization lag because you’re reading from the source of truth. No storage duplication because there’s only one copy. The operational burden is minimal compared to building explicit delivery mechanisms.

The challenges emerge at scale. Database connections consume resources that limit concurrent users. Query performance can degrade as the number of consumers increases. A slow query from one consumer can impact others. These resource contention issues make direct access less suitable as the number of consumers grows.

Schema coupling creates dependencies where consumers must adapt to schema changes in the source database. When the database owner adds or modifies tables, consumers might break. This coupling makes evolution harder and creates coordination overhead across teams.

Security boundaries become unclear with direct database access. Consumers need credentials and network access to databases, expanding the security perimeter. Fine-grained access control is harder because database permission systems weren’t designed for complex multi-tenant access patterns.

Query engines like Presto, Trino, and Apache Drill provide abstraction layers that enable shared database access without some of the direct access challenges. These engines connect to multiple databases, provide a unified query interface, and can enforce access control. They mediate access rather than providing direct connections.

Webhooks: Event-Driven Push

Webhooks enable push-based data delivery where producers call consumer-provided HTTP endpoints when data is available. This inverts the typical pull-based pattern where consumers request data. Webhooks are popular for event notifications and integrations where real-time delivery matters.

The webhook pattern is simple: when something happens, the producer makes an HTTP POST to a URL the consumer provided, with data about the event in the request body. The consumer processes the webhook request and returns a response indicating success or failure. This synchronous acknowledgment provides immediate feedback.

Webhooks excel for event-driven integrations where latency matters and event volumes are moderate. A payment processor sends webhooks when transactions complete. A CRM sends webhooks when contacts are created or modified. These event-driven patterns enable real-time integration without polling.

The challenges with webhooks appear at scale or when reliability matters. Webhooks are synchronous from the producer’s perspective, so slow or unavailable consumers block the producer. Producers must implement retry logic for failed webhook deliveries, manage timeout policies, and potentially queue webhooks when consumers are down.

Webhook authentication requires careful design because webhooks push data to URLs, which could be spoofed. Consumers must verify webhooks actually came from legitimate producers. This typically involves signature verification using shared secrets or public key cryptography.

Replay and ordering problems arise because webhooks are independent HTTP requests that can arrive out of order, be duplicated, or be lost. Consumers must handle these scenarios, often by tracking event IDs and timestamps to deduplicate and order events correctly.

Webhook discovery and registration processes vary widely. Some systems have UI for registering webhook endpoints. Others require API calls. Managing webhook configurations across multiple systems becomes operational overhead.

RPC and gRPC: Efficient Service Communication

Remote Procedure Call mechanisms like gRPC provide efficient data delivery for service-to-service communication. RPC makes calling remote functions feel like calling local functions, abstracting away network communication details. gRPC in particular has gained adoption for high-performance service meshes.

gRPC uses Protocol Buffers for efficient binary serialization and HTTP/2 for transport, achieving better performance than REST APIs with JSON. This efficiency matters for high-frequency service communication where serialization and network overhead accumulate.

Streaming support in gRPC enables bidirectional communication and long-lived connections. Clients and servers can stream data to each other concurrently, enabling patterns that are awkward or impossible with request-response APIs. This streaming capability suits use cases like real-time data feeds or progress updates.

Strong typing through Protocol Buffers provides schema validation and code generation. Service definitions specify exactly what data is sent and received. This eliminates ambiguity and enables generating client and server code in multiple languages from the same definition.

The trade-off with gRPC is complexity compared to simple REST APIs. Setting up gRPC requires generating code from proto files, configuring serialization, and managing versioning of proto definitions. The binary protocol makes debugging harder than text-based protocols because you can’t easily inspect messages.

Language support for gRPC is good but not universal. Popular languages have official or community-supported libraries, but less common languages might lack good gRPC support. This can limit where gRPC is practical compared to REST APIs that work everywhere.

Browser support for gRPC is limited, requiring proxies like gRPC-Web that translate between browser-compatible formats and gRPC. This makes gRPC less suitable for client-facing APIs, though it remains excellent for backend service communication.

Data Sharing Platforms: Collaborative Access

Modern data sharing platforms like Snowflake Data Sharing, Delta Sharing, and AWS Data Exchange enable sharing data without copying it. Providers grant access to datasets, consumers query them directly. This emerging pattern simplifies data delivery for multi-party collaborations.

Snowflake’s data sharing lets providers share live data with consumers in different Snowflake accounts. Consumers query shared data as if it were in their own account, but the data physically remains in the provider’s account. This eliminates data copying and synchronization while providing near-real-time access.

Delta Sharing provides an open protocol for sharing data stored in Delta Lake format. It works across cloud platforms and doesn’t require both parties to use the same data platform. This openness makes Delta Sharing more flexible than proprietary sharing mechanisms tied to specific vendors.

The appeal of data sharing platforms is eliminating data movement. Traditional delivery mechanisms copy data from provider to consumer, creating synchronization lag, storage duplication, and version management challenges. Data sharing platforms keep one copy that both parties access.

Access control and governance are built into sharing platforms, providing fine-grained control over who can access what data. Providers can revoke access instantly, audit who accessed what data, and update shared data without consumer coordination.

The limitation is that both parties must have compatible technology stacks. Snowflake sharing requires both sides using Snowflake. Delta Sharing requires consumers capable of reading Delta Lake format. This creates adoption friction compared to universal mechanisms like file transfer.

Cost models for data sharing platforms vary. Snowflake charges consumers for compute used to query shared data. Other platforms might charge for data transfer even without copying. Understanding cost implications is important because shared data might be less expensive or more expensive than copying depending on usage patterns.

Choosing the Right Mechanism

Selecting appropriate data delivery mechanisms requires evaluating multiple dimensions simultaneously. There’s no decision tree that produces the right answer, but understanding key trade-offs helps narrow options.

Latency requirements heavily influence mechanism choice. Real-time needs point toward APIs, message queues, or database replication. Batch delivery suits file transfer or object storage. Attempting to use batch mechanisms for real-time needs or real-time mechanisms for bulk batch transfer leads to poor outcomes.

Data volume and frequency interact with latency. Small volumes at high frequency suit APIs or RPC. Large volumes at low frequency suit file transfer or object storage. Large volumes at high frequency require message queues or replication. Volume and frequency together constrain viable mechanisms.

System capabilities and constraints matter significantly. If one system can only produce files, you’re probably using file transfer regardless of other factors. If consumers can’t accept pushed data, pulling mechanisms are required. Working within system limitations is often more practical than choosing ideal mechanisms that aren’t feasible.

Organizational factors like existing infrastructure and team skills influence practical choices. Using mechanisms you already operate and understand reduces risk compared to introducing new technology. The theoretically optimal mechanism that requires learning new tools and building new infrastructure might not be worth it.

Security and compliance requirements can mandate certain mechanisms or rule out others. Highly sensitive data might require encryption and dedicated networks that favor file transfer over APIs. Audit requirements might favor mechanisms with built-in logging over those requiring custom audit solutions.

Operational complexity is a real cost that’s easy to underestimate. Simple mechanisms that you can operate reliably often beat sophisticated mechanisms that introduce failure modes and operational burden. The best mechanism is one that works reliably with your team’s capabilities and available time.

Hybrid Approaches – Combining Mechanisms

Real-world data platforms rarely use a single delivery mechanism. Different use cases have different requirements, and mixing mechanisms appropriately provides flexibility while managing complexity.

Lambda architectures explicitly combine batch and streaming mechanisms. Batch processing handles historical data through bulk transfer mechanisms. Streaming handles real-time data through message queues. Results merge to provide complete views. This pattern acknowledges that different delivery mechanisms suit different time horizons.

Gateway patterns abstract underlying delivery mechanisms behind consistent interfaces. Consumers interact with a gateway that handles communication with various backends using appropriate mechanisms. This insulates consumers from delivery mechanism complexity while enabling optimized delivery per backend.

Hybrid cloud patterns use different mechanisms for different cloud environments. Data might replicate between clouds using object storage but be delivered within clouds using cloud-native services. This matches mechanism capabilities to network characteristics.

The challenge with hybrid approaches is managing multiple mechanisms’ operational complexity. Each mechanism requires monitoring, has failure modes, and needs operational procedures. Adding mechanisms adds overhead, so combining mechanisms should provide clear value rather than happening accidentally through different teams making independent choices.

The Future – Where Data Delivery Is Headed

Data delivery mechanisms continue evolving as technologies mature and requirements change. Several trends are shaping future delivery patterns.

Streaming is increasingly default for new architectures. The benefits of real-time data delivery and event-driven systems are well-understood, and streaming infrastructure has matured. More workloads that traditionally used batch delivery are moving to streaming where the operational complexity is justified. What used to be a complex and expensive solution now comes with brillian tooling and elegant solutions, which follow on from the work of Jay Kreps and Kappa Architecture.

Data sharing platforms will likely grow in importance as they mature and standardize. The ability to share data without copying reduces operational burden and simplifies multi-party collaboration. Overcoming the limitations around technology lock-in will accelerate adoption.

Edge computing introduces new delivery patterns where data must move to and from edge locations with intermittent connectivity and limited bandwidth. Mechanisms optimized for edge delivery, with robust offline handling and efficient synchronization, will emerge.

Real-time analytics is pushing delivery mechanisms toward lower latency with higher throughput. The traditional separation between operational and analytical systems is blurring as analytical workloads expect real-time data. Delivery mechanisms that support both transactional and analytical access efficiently will become more important.

Summary

Data delivery mechanisms are the invisible infrastructure enabling data platforms. Choosing appropriate mechanisms requires understanding trade-offs between latency, throughput, reliability, complexity, and cost. No universal best mechanism exists; the right choice depends entirely on specific requirements.

APIs work for real-time request-response. File transfer handles batch workloads. Message queues enable event streaming. Database replication keeps systems synchronized. Object storage provides scalable bulk delivery. Direct database access eliminates delivery complexity. Webhooks enable push-based integration. RPC provides efficient service communication. Data sharing platforms eliminate copying.

Understanding the full landscape enables intentional choices rather than defaulting to familiar patterns. The mechanism you choose shapes system architecture, operational burden, and what’s possible. These choices matter more than they might seem because delivery mechanisms are difficult to change after systems depend on them.

The best data platforms use multiple mechanisms appropriately, matching mechanism to use case. This requires accepting some complexity in exchange for optimized delivery per requirement. The alternative, forcing all delivery through one mechanism, creates either performance problems or unnecessary complexity.

Data delivery is infrastructure that should fade into the background once implemented correctly. When delivery mechanisms work reliably with appropriate performance characteristics, teams can focus on what to do with data rather than how to move it. That’s the goal: making data delivery reliable enough to be boring.

Discover more from Where Data Engineering Meets Business Strategy

Subscribe now to keep reading and get access to the full archive.

Continue reading