Measuring Search, A Human Approach
Effective search improvement requires online and offline evaluation working together — not independently.
The limitation of online metrics (A/B testing)
A/B testing captures user interaction patterns effectively but has a critical limitation: it reveals what users do, not why they do it.
CTR, for example, may promote engaging-but-less-relevant results. Users seeking quick answers may click on satisfactory results and move on — creating misleading engagement signals that don’t reflect relevance.
Human judgment as counterbalance
Human raters evaluate query-document pairs for relevance — focused assessments unaffected by surface-level appeal. They identify:
- Why content matters to specific queries
- Corner cases that log metrics miss
Limitations: raters are imperfect proxies — they lack knowledge of individual user tasks and motivations.
Launch reviews
Rubinstein advocates for launch reviews — team discussions examining human ratings alongside A/B test results — to understand whether algorithm changes achieve intended improvements or require recalibration.
Human evaluation is indispensable for preventing misguided optimization based on incomplete metrics.