Query Sampling

How to construct a representative query set for offline evaluation. The quality of your judgment-based metrics (NDCG, MAP, MRR) is only as good as the queries you evaluate against.

A biased sample produces misleading metrics — optimizing against it can harm real user experience.


The Problem with Naive Sampling

If you sample uniformly from your query log, rare queries dominate the sample (they are the vast majority of unique queries). Your evaluation then measures performance on the long tail, which may not reflect where most user value comes from.

If you sample by traffic volume, head queries dominate, and you miss systematic failures on the tail.

The right approach depends on your evaluation goal.


Sampling Methods

Random Sampling

Draw queries uniformly at random from the query log (unique queries or all occurrences).

  • Pro: simple, unbiased with respect to query distribution
  • Con: over-represents the long tail (which has many unique queries but little traffic); expensive to label if most sampled queries are rare

Best for: auditing coverage, discovering unknown failure modes.


Stratified Sampling

Divide the query space into strata (e.g., by frequency tier, query type, domain) and sample independently from each stratum.

Example strata:

  • Head (top 1% of queries by volume)
  • Torso (1-20%)
  • Tail (bottom 80%)
  • By query type: navigational, transactional, informational

Sample a fixed number from each stratum regardless of stratum size.

  • Pro: guarantees representation of all segments; prevents any one tier from dominating
  • Con: requires defining meaningful strata; labels from different strata have different signal value

Best for: balanced evaluation across query tiers; catching failures in head queries that a uniform sample would dilute.


Probability-Proportional-to-Size (PPS) Sampling

Sample queries with probability proportional to their traffic volume (query frequency).

  • High-traffic queries are more likely to be sampled → more annotation effort goes where user impact is highest
  • Rare queries still appear, just proportionally less

Formally:

  • Pro: annotation effort is allocated where it matters most (by traffic impact); avoids wasting labels on rare queries
  • Con: long-tail failures are underrepresented; rare but high-value query types may be missed

Best for: efficient use of annotation budget when you want your eval to reflect user-weighted impact; standard choice for production eval sets.


Comparison

MethodHead coverageTail coverageAnnotation efficiencyBest for
RandomLowHighLowDiscovery, coverage audits
StratifiedGuaranteedGuaranteedMediumBalanced regression testing
PPSHighProportionalHighProduction quality tracking

Practical Guidance

  • Use PPS as your default production eval set — it reflects where users actually live
  • Maintain a separate stratified set for regression testing — ensures a bad tail change doesn’t go undetected
  • Refresh sampling periodically — query distributions shift with product changes, seasonality, and growth
  • Label a minimum of 100-300 queries before trusting NDCG trends; 1000+ for stable estimates