Improving Retrieval with LLM-as-a-Judge
Build reusable relevance evaluation datasets for your data using LLMs as judges — calibrated against human preferences, then scaled to larger collections.
Ground truth dataset components
- Query (from production logs — real queries are best)
- Passage/Document (sampled via “pooling” — top-k from multiple retrieval methods)
- Relevance judgment (graded: 0=irrelevant, 1=relevant, 2=highly relevant)
Minimum viable: 25 queries (TREC standard). Single-technique pooling introduces bias.
Evaluation metrics
- P@k: relevant results in top-k / k (ignores position)
- Recall@k: proportion of relevant docs retrieved in top-k
- nDCG@k: grade + position weighted, normalized to 0–1 (preferred for RAG)
LLM-as-judge methodology
1. Calibration
- Create small labeled dataset (26 queries, 90 triplets for search.vespa.ai)
- Run GPT-4o with calibrated prompt on same pairs
- Check confusion matrix correlation
Result: strong agreement, few disagreements >1 level. Good nDCG@10 correlation between human and GPT labels across 9 retrieval methods.
2. Prompt design
- Clear 0–2 scoring scale with definitions
- Two static demonstration examples
- Step-by-step reasoning instructions
- System role emphasizing relevance assessment
3. Scale up
With validated correlation → expand to full collection:
- search.vespa.ai: 386 queries, 6 retrieval methods, 10,372 unique query-passage pairs
- Label distribution: 4,817 irrelevant / 4,642 relevant / 913 highly relevant
Benefits vs. human labeling
- Cost-effective and fast
- Enables rapid iteration
- Scales beyond expert capacity
Key principle
“The main point: create your own dataset, for your own data and retrieval use case.” Generic collections don’t capture domain-specific relevance.
Related Concepts
- LLM as Judge — the methodology explained and demonstrated
- Judgment Lists — the ground truth dataset built with LLM judges
- Search Evaluation — relevance evaluation framework
- NDCG — nDCG@k as the primary evaluation metric
- Precision and Recall — P@k and Recall@k explained
- Dense Vector Retrieval — one of the 6 retrieval methods benchmarked
People
- Jo Kristian Bergum — Vespa; author