What’s a Reranker?

Everyone talks about the embedding model. Hardly anyone talks about the reranker. This is backwards, and the consequences are visible in any retrieval system that’s ever shipped to real users. In any pipeline that gets to acceptable quality, the reranker is doing more of the actual ranking work than the embedding step. Yet teams routinely spend months evaluating embedding APIs and then leave the reranker as a TODO.

What a reranker actually is

A reranker is a second-pass model that takes a candidate set from your first-pass retrieval and re-scores it by reading the query and each candidate together. The first pass returns the top 100 candidates cheaply, using a model that has never seen the query and the document in the same forward pass. The reranker then reads each (query, document) pair properly, end to end, and produces a calibrated relevance score for that specific pair.

The technical distinction is bi-encoder versus cross-encoder. Embedding models are bi-encoders: query and document are encoded independently, then compared by cosine similarity. Fast at query time, because document embeddings are precomputed, but the model never gets to see both inputs at once. Rerankers are cross-encoders: query and document are concatenated and passed through a transformer that produces a single relevance score. Slow per pair, because you can’t precompute anything, but vastly more accurate.

The trade-off is the entire story: bi-encoders scale to a billion documents at query time and are mediocre at ranking; cross-encoders are excellent at ranking but can only handle the hundred candidates your first pass shortlisted. The architecture writes itself.

Why this matters more than the embedding

Here’s the counter-intuitive bit. The first-pass retrieval only needs to be good enough to put the right answer somewhere in the top 100. That’s a much easier problem than getting it to position 1. Almost any half-decent embedding model paired with BM25 will manage this on a normal corpus. Recall@100 of 95%+ is the baseline, not the achievement.

The reranker then does the actual ranking work – sorting that top 100 into the top 5 you’ll actually feed to the LLM, or surface to the user. This is where precision is won or lost.

In practical evaluation, the spread between embedding models in recall@100 is small – usually a couple of percentage points between the top contenders. The spread between rerankers in recall@5 or precision@3 is large – often 15–20 percentage points. Switching from OpenAI to Cohere embeddings might move your benchmark by 2%. Adding a cross-encoder reranker after the same retrieval almost always moves it by an order of magnitude more. This is not subtle.

The players

Cohere Rerank – the commercial default. Hosted API, multilingual, fast. The easy-on button for most projects.
BGE-Reranker – open-weights, runs locally, multilingual, competitive quality. The pick when you can’t send queries to a third party.
Jina Reranker – another open-weights option, decent multilingual support, smaller footprint than BGE.
Voyage Rerank – commercial API, good benchmarks, less ubiquitous than Cohere.
LLM-as-reranker – using Claude or GPT with a prompt that asks “score these documents for relevance to this query.” Expensive, but trivially good and sometimes worth the cost.

For most teams, Cohere’s API gets you running in an afternoon, and the improvement on your eval set will be larger than anything else you can do in a week.

Cost shape

Rerankers and embeddings have opposite cost profiles. Embeddings are expensive per index (you re-embed when the corpus changes), cheap per query. Rerankers are cheap per index (nothing to precompute), expensive per query (you run the model on every candidate, every time).

That shape determines architecture. If your corpus changes daily, embedding cost matters. If you have ten million queries per day, reranker cost matters. Most production systems sit somewhere in the middle, and the operational answer is usually: rerank the top 100 with a fast cross-encoder, accept the per-query cost, treat it as part of the cost of serving good results.

A back-of-envelope: reranking 100 candidates with Cohere Rerank costs roughly $0.001 per query at current pricing. At a million queries a month, that’s $1,000. Not nothing. Not catastrophic either. Compare to the cost of mediocre results forcing you to stuff the top 50 candidates into a longer LLM context: the LLM tokens will dwarf the reranker bill on most workloads.

Where it fits in the pipeline

The canonical pipeline that works in 2026:

			
query
  ->  expand / rewrite (optional)
  ->  BM25 retrieval        --]  fuse with RRF
  ->  vector retrieval       --]  top ~100 candidates
  ->  cross-encoder rerank   -->  top 5
  ->  LLM (or user)

		

Each stage trims aggressively. Retrieval over millions of documents narrows to 100. Reranker narrows 100 to 5. LLM reads 5. The funnel shape is what makes it cheap to run end-to-end. Without the reranker, you have to either feed the LLM 50 candidates and pay LLM token costs, or feed it 5 and risk that the right answer wasn’t in your top 5. Either way you lose.

The thing to actually do this week

If you have a retrieval system that returns the right answer in the top 20 but rarely in the top 3, you don’t need a better embedding model. You need a reranker. Cohere’s API takes an afternoon to wire in. The improvement on your eval set will be larger than anything else you can do in a week, including swapping embedding models, fine-tuning, or upgrading the LLM at the end of the pipeline.

The reranker is the bit nobody talks about because it doesn’t look exciting on a slide. It’s just a transformer reading a pair of texts and outputting a number. There’s no leaderboard drama, no contentious benchmark, no new architecture every quarter. Which is also exactly why it works.

That’s December done. January takes the same thinking up a level – RAG as an architecture, not a feature. What the vendors quietly omit when they ship you a chatbot.