Cross-Encoders, ColBERT, and LLM-Based Re-Rankers

Source: https://medium.com/@aimichael/cross-encoders-colbert-and-llm-based-re-rankers-a-practical-guide-a23570d88548 Author: Michael Ryaboy (Developer Advocate, KDB.AI)

Summary

Practical comparison of three reranking approaches for production search systems, with guidance on when to use each.

Pre-compute token-level document embeddings offline
Query-time: compute query token embeddings, then MaxSim vs. stored document tokens
Middle ground: more nuanced than vector similarity, cheaper than cross-encoder
Storage: can reach tens of gigabytes for large catalogs
Use when: large-scale reranking where cross-encoder latency is prohibitive

Apply flexible, dynamic criteria (freshness, authority, custom logic) without retraining
Individual requests: “might cost a few cents and add over a second of latency”
Use when: rare high-value queries, or criteria that change frequently

BM25 / vector retrieval (thousands)
    ↓
ColBERT refinement (hundreds)
    ↓
Cross-encoder or LLM polishing (top 20-50)

MRR, NDCG, precision@5, user conversion rates, support ticket reduction