Session-Based Evaluation

Definition

Session-based evaluation measures search quality at the level of a complete user session — a sequence of queries, clicks, reformulations, and final actions — rather than evaluating each query in isolation.

Motivation: Query-Based Evaluation Misses Context

Standard query-based Search Evaluation treats each query independently:

"iphone case" → NDCG@10 = 0.72  ✓
"iphone 15 case" → NDCG@10 = 0.81  ✓

But these queries may be from the same user who couldn’t find what they wanted and reformulated. Query-based evaluation says “both queries performed well” while the user experience was poor (they had to reformulate).

Session-based evaluation captures this failure:

Session: "iphone case" → no click → "iphone 15 case" → no click → "buy iphone case amazon" → exit site
Session outcome: FAILURE (user abandoned to competitor)

Session Signals

SignalPositiveNegative
Query reformulationNone (neutral at best)Multiple reformulations = friction
ClickClick occurredNo clicks = zero-click failure
Dwell timeLong dwell = satisfactionShort dwell after click = poor result
Session depthN/ADeep scrolling = can’t find what they want
Session outcomeConversion/successAbandonment
Return to resultsN/APogosticking = dissatisfied with result

Session Success Metrics

Task Completion Rate

Binary: did the user accomplish their goal within the session? Requires conversion event (purchase, form submit, page visit) or human judgment.

Abandonment Rate

Percentage of sessions with no positive engagement (no click, no conversion).

Reformulation Rate

Number of query modifications per session. High = system failed first interpretation.

Click Residual

Click Residual: clicks that “should” have happened but didn’t.

Session-Based vs. Query-Based: When to Use Each

ScenarioRecommended
Ranking algorithm A/B testQuery-based (NDCG)
Search UX changeSession-based
Measuring user satisfactionSession-based
Component evaluation (retrieval quality)Query-based
Conversational search qualitySession-based (required)
E-commerce search business impactBoth

Agentic Search is inherently session-based — the agent’s multi-turn retrieval is a session. Evaluating an agentic search system requires session-level metrics:

  • Was the final answer correct?
  • How many retrieval turns were needed?
  • Was the retrieval cost within budget?

Query-level NDCG doesn’t apply to agentic search.

People