Hybrid Search: Why Pure Vector Lost

The dominant marketing position in 2022 was that vector search would replace keyword search. Lexical retrieval was an artefact of the pre-LLM era. Embeddings captured meaning; BM25 captured surface form; meaning won. By 2024, every serious retrieval system on earth was hybrid. Pure vector lost. It’s worth understanding why, because the same mistake gets made every few years at a different layer of the stack.

What pure vector got right

Credit where due. Semantic search works for the case it was designed for: paraphrased natural-language queries against natural-language documents. “How do I cancel my subscription” finds a help article titled “Terminating your membership” without anyone hand-curating a synonym dictionary. That was a real step change. The decade of work in lexical retrieval before that had been incremental: better stemmers, better stopword lists, better tokenisers, smarter BM25 parameters. None of it solved the synonym problem. Embeddings did, almost as a side effect.

So when the dedicated vector databases shipped in 2021–22 and the demos were genuinely impressive, the enthusiasm wasn’t silly. The pitch was “semantic search is here, the old way is dead.” First half true. Second half wasn’t.

Where it fell apart

Pure vector search is bad at exact matching. Product codes. Error codes. Account numbers. Version strings. Anything where the surface form is the meaning. Search for ORA-00942 and a pure vector system happily hands you back results about other Oracle errors that share “topic” but not the specific code you needed. The compression argument applies: the embedding has thrown away the surface form, and the surface form was what mattered.

It’s also bad at rare terms. If a word appears once in the corpus, its embedding reflects the model’s prior more than the corpus distribution. BM25 handles both cases trivially because it’s an inverted index over actual terms. Exact lookups hit. Rare terms get scored higher precisely because they’re rare. The old technology kept working in exactly the places the new technology was weakest.

And the new technology was weak in places nobody bothered to advertise. Out-of-domain queries (where the query language differs from the training language) degrade badly. Multilingual mismatch is a quiet killer. Domain-specific jargon embeds poorly until you fine-tune. Numerical queries (“documents from after 2020 mentioning Q3 revenue”) are not what embeddings are for at all.

How hybrid actually works

The pattern that won is conceptually simple. Run BM25 and vector search in parallel. Each produces a ranked list. Then combine them.

The most common combination method is reciprocal rank fusion, which is so simple you can write it on a napkin:

def rrf(rankings, k=60):
"""
rankings: list of ranked lists, each a list of doc_ids
k: smoothing constant (60 is the conventional default)
"""
scores = {}
for ranking in rankings:
for rank, doc_id in enumerate(ranking, start=1):
scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank)
return sorted(scores.items(), key=lambda x: -x[1])

That’s the whole algorithm. No tuning, no normalisation across incompatible score ranges, no learned weights. RRF works embarrassingly well as a baseline. Most teams who go “we need something more sophisticated” later quietly come back to RRF after their learned fusion model becomes a maintenance burden.

The more sophisticated alternative is to pass both candidate sets to a cross-encoder reranker, which reads each (query, document) pair properly and produces a calibrated relevance score. Better than RRF, but more expensive at query time.

Where the engines have converged

Every credible search system now does some version of hybrid. The list, briefly:

  • Elasticsearch / OpenSearch – the retriever abstraction lets you compose BM25 and vector queries in one request.
  • Postgres + pgvector – vectors via pgvector, lexical via tsvector, combined in one SQL query.
  • Snowflake – native VECTOR_COSINE_SIMILARITY plus Cortex Search for hybrid.
  • Pinecone – hybrid index mode with sparse-dense fusion.
  • Weaviate – hybrid search built in, configurable alpha for vector/lexical balance.
  • Qdrant – sparse vectors plus dense vectors, fused at query time.

The engines that started lexical and bolted on vectors (Elasticsearch, Postgres) carry decades of lexical retrieval maturity. The engines that started pure vector and added BM25 (Pinecone, Weaviate) are still catching up on the lexical side. If you have to choose today, bias toward an engine that takes both retrieval modes seriously.

Why the marketing told a different story

Two reasons. The first is that “keyword search is dead” is a punchier pitch than “we blend two retrieval techniques.” A new category always needs a foil. The second is more uncomfortable: the vendors who built pure vector products had a commercial interest in not telling you their product alone was insufficient. By 2023, the customers who’d shipped pure vector were quietly bolting BM25 back on. The vendors quietly added hybrid modes. The narrative caught up to the engineering about eighteen months later.

This pattern repeats itself. Pure neural beats statistical NLP, until it doesn’t and hybrid wins. Pure ML beats hand-engineered features, until hybrid wins. Pure agentic systems beat handcrafted pipelines, until hybrid wins. The old methods are old because they work. Combining the new with the old is almost always better than replacing one with the other.

The practical configuration

If you’re starting a retrieval system today, the defensible default is:

  1. Index documents in your warehouse with both vector embeddings and lexical (BM25 or full-text).
  2. Run both at query time, retrieve top 50 from each.
  3. Fuse with RRF.
  4. If quality still isn’t there, add a cross-encoder reranker on the top 100.
  5. Evaluate on your own data with your own queries. Not MTEB.

That recipe will outperform any pure-vector setup on any non-toy corpus I’ve worked with. It’s also cheaper to operate, because BM25 is essentially free compared to dense retrieval.

Pure vector lost. Long live the boring old inverted index, doing the work nobody wants to credit it for.

Next up – the reranker, the bit of the retrieval pipeline nobody talks about that’s doing more of the work than everything else combined.

Discover more from Data Lingua. Where Data Engineering Meets Agentic Business Strategy

Subscribe now to keep reading and get access to the full archive.

Continue reading