Why Time Is Hard

Stream processing has a dirty secret. The hard part isn’t the streaming. It’s the time. Specifically, the difference between event time (when the thing happened) and processing time (when your system saw it), and the bugs that hide in the gap between them. Most stream processing failures I’ve seen trace back to someone treating these as the same thing.

The two times

Event time is the timestamp the event carries. A user clicked the button at 10:42:13. A sensor recorded a temperature at 10:42:17. A transaction was authorised at 10:42:20. These are facts about the world, baked into the event by whatever produced it.

Processing time is when your system saw the event. The click hit your pipeline at 10:42:14, perhaps. The sensor data arrived at 10:42:19. The transaction event was in your stream at 10:42:21. These are facts about your system, baked in by whichever broker, queue, or processor handled them.

For local, low-latency, in-order, no-failure systems, the two are basically the same. For real systems on real networks with real failures and real retries, they diverge constantly. An event might arrive 30 seconds late because a producer retried. Or hours late because a mobile app was offline. Or out of order because a network path changed. Processing time tells you when your system noticed. Event time tells you when reality happened.

Almost every aggregation you want to compute – sales per hour, errors per minute, page views per day – is a question about event time. If you compute it on processing time, the numbers are wrong in subtle ways. The Monday-evening dashboard that shows a dip every time someone’s connection drops. The hourly count that’s actually a count of “hour the event was received,” which is correlated with the load on your ingestion pipeline more than with the world.

Windows

If event time is what matters, you need to group events by event time, not processing time. This is what windows are. Tumbling windows (every event lands in exactly one window). Sliding windows (events can land in multiple overlapping windows). Session windows (a window for every burst of activity from a single user). The classic.

Conceptually simple. Operationally, here’s the problem. When can you say a window is “done”? In batch this is trivial – you process all the data, then you know. In stream you never know for sure. An event for the 10:00–11:00 window could arrive at 11:30, or 12:00, or three days later, and unless you know your data perfectly, you can’t rule it out.

So either you wait forever and never emit a result, or you decide at some point to declare the window closed and move on. That decision is the watermark.

Watermarks

A watermark is a promise from the stream processor: “I do not expect to see any more events with event time before T.” When the watermark advances past the end of a window, the window is closed and the result is emitted.

The watermark is necessarily a heuristic. The processor looks at the event times it’s been seeing, applies some configurable allowance for lateness (5 seconds, 5 minutes, whatever you decide), and decides how far behind “now” the watermark should be. Too aggressive and you’ll close windows before late events arrive, dropping them or routing them to a separate late-data path. Too conservative and your dashboards lag.

Most stream processors let you configure both watermark generation and late-data handling. Flink’s API is the most explicit about this. Spark Structured Streaming’s is less so but the underlying concepts are the same.

Where the bugs live

The classic failure modes:

Processing-time windows where event-time was needed. The default in some tools. Your dashboards look fine until you compare them to authoritative numbers and discover they’re consistently off.
Watermarks too aggressive. Late events are silently dropped. Nobody notices until an external audit. The producer of those late events is usually mobile or batch, both of which generate naturally late data.
Watermarks too conservative. Pipeline lags. Latency budget blown. Pressure to make watermarks more aggressive. Goto first bullet.
Event time not actually in the event. Or it’s there but in the wrong timezone, or in milliseconds when you assumed seconds, or as a string that needs parsing. The hard cases are the ones where the field exists and is wrong.
Clock skew between producers. Events arrive with event times that disagree on what “now” means. NTP isn’t as universal as you’d hope.

What helps

The discipline that catches most of this:

Always use event time for business-meaning windows. Processing time only for operational metrics (lag, throughput).
Make event time a required, validated field in every event schema. Reject events without it at the producer.
Use a single timezone (UTC, always) and a single representation (epoch millis is the safest).
Measure how late your events actually arrive in production. Distribution, not average. The 99th percentile is what your watermark needs to accommodate.
Have an explicit late-data path. Don’t drop them silently; route them somewhere they can be inspected.
Test your pipeline with delayed and out-of-order events. The happy path doesn’t prove anything.

Time is hard. It’s hard in distributed systems generally, and it’s especially hard in stream processing because you’re forced to make decisions before you have all the information. The architectures that work are the ones that take this seriously rather than pretending it’s fine.

Next post – backpressure, the polite word for “we’re overwhelmed,” and what to do when your stream pipeline keeps dying at 3am.