Reranking

Definition

Reranking is a second-stage scoring step that takes an initial candidate set retrieved cheaply (first stage) and rescores it using a more expensive, higher-quality model. The reranker sees both the query and document together, enabling richer interaction than first-stage retrievers.

Why Reranking Exists

First-stage retrievers (BM25, Bi-Encoder) must score millions of documents quickly — they use independent query/document representations or simple term statistics. This limits their expressiveness. A reranker only needs to score ~100–1000 candidates, so it can afford deep query-document interaction.

Standard Pipeline

Query
  │
  ▼
First-stage retrieval (BM25 / bi-encoder ANN)   → top-1000 candidates
  │
  ▼
Reranker (cross-encoder or LLM)                 → rescored top-1000
  │
  ▼
Final ranked list (top-10/20 served to user)

Reranker Types

Cross-Encoder (Most Common)

A Cross-Encoder processes query + document concatenated as a single sequence, producing a relevance score. Sees full interaction between query and document tokens.

  • Models: cross-encoder/ms-marco-MiniLM-L-6-v2, Cohere Rerank, bge-reranker-*
  • Latency: ~50–200ms for 100 candidates on GPU
  • Quality: significantly better than bi-encoder for nuanced relevance

LLM-as-Reranker

Use a large language model (LLM as Judge) to score or listwise-rank candidates. Higher quality, much higher cost.

  • Pointwise: “Is this document relevant to the query? Yes/No”
  • Listwise: “Rank these 10 documents by relevance”

ColBERT Late Interaction

ColBERT / Late Interaction sits between bi-encoder speed and cross-encoder quality — efficient enough for first-stage in some setups, but also used as a reranker.

Reranking in RAG

In RAG pipelines, reranking is critical: the LLM context window is limited, so only the top 3–5 chunks are included. A reranker narrows 50–100 retrieved chunks down to the best ones before LLM generation.

Tradeoffs

First-StageReranker
ThroughputMillions of docs/secHundreds of docs/sec
Latency<10ms50–500ms
QualityGoodExcellent
CostLowHigher

When Reranking Becomes a System Boundary

From When Reranking Becomes a System Boundary (Ravindra Harige):

Retrieval defines eligibility; reranking defines order. If a document is not retrieved, no downstream stage can recover it.

Ranking as a Projection

Reranking does not redo retrieval-time computation (term matches, field contributions, BM25 components). It operates on a compressed, lossy representation of what survived into the candidate set. This is structural — not a failure of implementation.

Compensatory Reranking

The system crosses a boundary when performance gains come from widening the rerank window rather than improving retrieval. At that point the window size is load-bearing (not a latency knob), and reranking has become compensatory.

Evaluation Split

StageMetricBlindspot
RetrievalRecall@KNDCG can improve while recall is weak
RerankingNDCG, MRRMetrics improve while user-visible relevance plateaus

Retrieval (engineering) and reranking (ML/data science) are owned by different teams with different metrics. Neither dashboard shows the full picture. The gap closes only when someone is accountable for the space between them.