Measuring Search, A Human Approach

Effective search improvement requires online and offline evaluation working together — not independently.

The limitation of online metrics (A/B testing)

A/B testing captures user interaction patterns effectively but has a critical limitation: it reveals what users do, not why they do it.

CTR, for example, may promote engaging-but-less-relevant results. Users seeking quick answers may click on satisfactory results and move on — creating misleading engagement signals that don’t reflect relevance.

Human judgment as counterbalance

Human raters evaluate query-document pairs for relevance — focused assessments unaffected by surface-level appeal. They identify:

  • Why content matters to specific queries
  • Corner cases that log metrics miss

Limitations: raters are imperfect proxies — they lack knowledge of individual user tasks and motivations.

Launch reviews

Rubinstein advocates for launch reviews — team discussions examining human ratings alongside A/B test results — to understand whether algorithm changes achieve intended improvements or require recalibration.

Human evaluation is indispensable for preventing misguided optimization based on incomplete metrics.

People