Evaluating Search: Using Human Judgments

Source: https://dtunkelang.medium.com/evaluating-search-using-human-judgement-fbb2eeba37d9
Author: Daniel Tunkelang

Summary

Part of Daniel Tunkelang’s “Evaluating Search” series. Argues that human judgment remains the gold standard for search quality evaluation despite being expensive, and explains how to design annotation programs that produce reliable signals.

Why Human Judgment?

Automated metrics (NDCG, MRR) require human relevance judgments as input — they can’t exist without them. The question isn’t whether to use human judgment, but how:

  • Directly: annotate each (query, document) pair
  • Indirectly: collect click signals and use as proxy
  • Automated: use LLM as Judge as proxy for human judgment

Each proxy introduces error. Human direct annotation is the source of ground truth.

Designing a Good Annotation Program

1. Annotation Guidelines

The most important investment. Vague guidelines → inconsistent grades → misleading metrics.

Good guidelines:

  • Domain-specific: “relevant” for legal queries is different from fashion
  • Example-rich: 2–3 concrete examples per grade level
  • Edge case coverage: borderline cases explicitly addressed
  • Actionable: annotator should be able to judge in <30 seconds

2. Annotator Selection

  • Domain expertise required for specialized content
  • Multiple annotators per query (3+ for expensive queries)
  • Regular calibration sessions: re-judge the same queries periodically

3. Inter-Annotator Agreement

Measure Kappa/Krippendorff’s Alpha. Use as ongoing health metric:

  • If agreement drops: guidelines need clarification
  • If agreement is consistently high: guidelines working well

4. Sampling Strategy

Judgment lists should cover:

  • Head queries (most traffic) — highest business impact
  • Torso queries — broad coverage
  • Tail queries — edge cases, failure modes
  • Adversarial queries — known hard cases

See Query Types for taxonomy.

Human Judgment vs. Click Data

Tunkelang frames the comparison:

DimensionHuman JudgmentClick Data
ScaleLimited (hundreds/day)Massive (millions/day)
QualityHigh (with good guidelines)Noisy (position bias)
CoverageCuratedWhatever users searched
CostHigh ($1–5/pair)Free
FreshnessRequires updateReal-time

Best practice: use both. Calibrate click-based proxies against human judgments periodically.

Judgment Lists in Agentic Systems

For Agentic Search, judgment is more complex:

  • Not just “is this document relevant?” but “did the agent reach a correct final answer?”
  • Requires multi-step evaluation
  • Per-document NDCG doesn’t capture full-session quality

People