Session vs Query based search evals

Traditional query-based evaluation

Create judgment lists from clickstream data, label results relevant/irrelevant for specific queries, evaluate with NDCG over aggregated signals.

Session-based evaluation

“Replay” individual user interactions directly. Instead of “how many queries improved?”, asks “would we have satisfied past users?” — evaluating whether the system would better satisfy single historical search-and-click events.

Advantages

  • Improved sampling accuracy: mirrors probability-based sampling (like political polling) — every user interaction has equal selection probability. Query-based methods introduce sampling bias by pre-selecting specific queries.
  • Time-sensitive features: aggregating clicks across months masks momentary conditions (e.g., temporary sales). Session-based data captures exact feature values present when users clicked.

Disadvantages

  • Clickstream biases (position bias) remain present
  • Harder to debug per-query since individual queries may contain insufficient data

Balanced view

Both derive from historical sessions and offer different insights:

  • Query-based: excellent for debugging specific search problems
  • Session-based: simulates A/B testing

Neither represents absolute truth — both are imperfect but useful models of search performance.

People