Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124

Unity, Polaris, Gravitino, Nessie. The catalog wars nobody outside the industry knows about, and why the strategic asset is the metadata, not the table.

Lineage is one of those things every data team agrees they should have and almost nobody has properly. Why column-level lineage is harder than the demos suggest.

Data contracts had a hype cycle. The good ideas survived; the silver-bullet pitch didn't. What's left, and what the term still obscures.

Backpressure is the polite word for 'we're overwhelmed.' What the four strategies are, why Kafka doesn't really do it, and what works in production.

Event time vs processing time, the watermark heuristic, and where the bugs hide. The hard part of stream processing isn't the streaming.

Change data capture is quietly eating batch ETL. What CDC actually does, what Debezium got right, and the architectures it unlocks.

Exactly-once is the messaging guarantee everyone wants and almost nobody has. What it actually means, and why your idempotence is probably wrong.

Everyone talks about the embedding model. Almost nobody talks about the reranker. In any retrieval system that gets to acceptable quality, the reranker is doing more of the work.

The 2022 pitch was that vectors would replace keyword search. By 2024 every serious system was hybrid. Why the old technique refused to die.

The word semantic is rather heavily used (often incorrectly) when discussing data models. The word semantic itself is an adjective relating to meaning in language or logic. When we think…

How you split a document before embedding determines what can ever be retrieved. Most RAG failures aren't model failures — they're chunking failures.

An embedding isn't meaning. It's a lossy compression with a similarity-preserving objective. Once you see it that way, every weird vector search behaviour makes sense.
You must be logged in to post a comment.