LLM as Judge
Definition
Using a large language model (LLM) to evaluate the relevance of search results — as a cheaper, faster alternative to human annotators for creating Judgment Lists or running automated quality checks.
Why LLM Judges?
Human annotation for Search Evaluation is expensive:
- Typical cost: 5.00 per (query, document) judgment
- For 1,000 queries × 10 candidates each = 50,000
- Turnaround time: days to weeks
LLM judgment:
- Cost: ~0.01 per judgment (100–1000x cheaper)
- Turnaround: minutes
- Scalable: run continuously as part of CI/CD
How It Works
Point-wise Scoring
Rate each (query, document) pair independently:
def llm_judge_relevance(query, document, llm):
prompt = f"""Rate the relevance of this document to the query on a scale of 0-3:
Query: {query}
Document: {document}
0 = Not relevant
1 = Marginally relevant
2 = Relevant
3 = Highly relevant
Return only the integer grade."""
return int(llm.generate(prompt))Pairwise Comparison (Stronger Signal)
Ask the LLM which of two documents is more relevant:
def llm_compare(query, doc_a, doc_b, llm):
prompt = f"""Which document is more relevant to this query?
Query: {query}
Document A: {doc_a}
Document B: {doc_b}
Answer: A or B"""
return llm.generate(prompt)Pairwise comparison is generally more reliable than absolute scoring.
Nugget-Based Evaluation
LLM first identifies “answer nuggets” (key facts needed to answer the query), then checks which retrieved documents contain them.
Vespa’s LLM Judge for Retrieval
Vespa’s blog post demonstrates using an LLM to evaluate first-stage retrieval quality:
- For each query, generate an “ideal answer” with the LLM
- Judge retrieved documents: does this document contain information needed for the ideal answer?
- Aggregate to compute system-level NDCG approximation
Key result: LLM judgments correlate well (Spearman’s ρ ≈ 0.85–0.90) with human judgments for factual queries.
Limitations
- Positional bias: LLMs prefer the first document presented
- Length bias: LLMs often favor longer documents
- Self-citation bias: LLMs may prefer documents stylistically similar to their training data
- Hallucination: LLM may recall facts not in the document and rate it highly
- Calibration: absolute scores are unreliable; relative pairwise comparisons are better
- Cost at scale: even cheaper than humans, still non-trivial at millions of judgments
Best Practices
- Use pairwise comparisons, not point-wise scoring
- Run multiple LLM calls per judgment and take majority vote
- Validate on a small set of human judgments before trusting LLM judges
- Use a capable LLM (GPT-4, Claude) — cheaper models have significantly worse calibration
Related Concepts
- Pointwise Relevance Evaluation — score each item independently; simplest LLM judge pattern
- Pairwise Relevance Evaluation — LHS/RHS comparison; stronger signal than pointwise
- Listwise Relevance Evaluation — rank full candidate list; most holistic
- Judgment Lists — what LLM judges help create
- Search Evaluation — where LLM judgments are used
- NDCG — metric computed from LLM-generated grades
- RAG — LLM judge also used for RAG faithfulness/relevancy evaluation
- Agentic Search — LLM verification step is a form of LLM judgment
People
Articles
-
LLM-as-a-Judge When to Use Reasoning CoT and Explanations — Aparna Dhinakaran; explanation-first pattern; CoT has mixed evidence
-
Jo Kristian Bergum — Vespa “Improving retrieval with LLM-as-a-judge”
-
Daniel Tunkelang — traditional human judgment advocate; acknowledges LLM judges
Articles
- Using LLMs to Amplify Human Labeling and Improve Dash Search Relevance — Dmitriy Meyerzon; LLM calibrated on human labels → 100x scale-up; DSPy for prompt optimization; context-aware evaluation with tool use
- Search Quality Assurance with AI as a Judge — Tao Ruangyam; Zalando production pipeline; NER-clustered test queries; GPT-4o; ~$250/run for 1,500 segments × 25 results; pre-launch market validation
- Classic ML to Cope with Dumb LLM Judges — Doug Turnbull; per-attribute LLM signals as ML features → decision tree; 96.7% precision on 40% of pairs