Cross-Encoders, ColBERT, and LLM-Based Re-Rankers

Source: https://medium.com/@aimichael/cross-encoders-colbert-and-llm-based-re-rankers-a-practical-guide-a23570d88548 Author: Michael Ryaboy (Developer Advocate, KDB.AI)

Summary

Practical comparison of three reranking approaches for production search systems, with guidance on when to use each.

The Three Approaches

Cross-Encoders

  • Process query-document pairs jointly through transformer
  • State-of-the-art accuracy (MRR@10 > 40 on MS MARCO)
  • Each document requires a full forward pass → high latency, high GPU cost
  • Use when: high-stakes queries, small result sets, quality > speed

ColBERT (Late Interaction)

  • Pre-compute token-level document embeddings offline
  • Query-time: compute query token embeddings, then MaxSim vs. stored document tokens
  • Middle ground: more nuanced than vector similarity, cheaper than cross-encoder
  • Storage: can reach tens of gigabytes for large catalogs
  • Use when: large-scale reranking where cross-encoder latency is prohibitive

LLM-Based Re-Rankers

  • Apply flexible, dynamic criteria (freshness, authority, custom logic) without retraining
  • Individual requests: “might cost a few cents and add over a second of latency”
  • Use when: rare high-value queries, or criteria that change frequently

Layered Pipeline Strategy

BM25 / vector retrieval (thousands)
    ↓
ColBERT refinement (hundreds)
    ↓
Cross-encoder or LLM polishing (top 20-50)

Success Metrics

MRR, NDCG, precision@5, user conversion rates, support ticket reduction