Explicit Relevance Evaluation with Probability-Proportional-to-Size Sampling
Explicit relevance evaluation: domain experts manually rate search result quality. Less initial investment than implicit methods (click analysis), stronger direct signal.
Why PPTSS over simple random sampling
Probability-proportional-to-size sampling weights frequent queries appropriately while still including less common ones — creating samples that mirror actual user traffic patterns. Simple random sampling over-represents tail queries.
Practical guidance
Query sample size
Start with 50 queries. This aligns with TREC standards and provides sufficient initial data. Multiple batches can follow.
Query quality validation
Manually review to eliminate:
- Generic or ambiguous terms
- Queries affected by ranking rules
- Traffic unrelated to your inventory
Aim for ~50 viable queries after filtering.
Timeline expectations
- Defining information needs: ~1 hour with subject matter experts
- For 50 queries × 2 rankers × depth 5 (500 total judgments): ~1h40m at ~5 ratings/minute
Tooling
Tools like Quepid facilitate the explicit evaluation workflow.