Common Pitfalls of Onsite Search Experimentation
Andreas Wagner (searchHub) identifies two major A/B testing traps in onsite search.
Pitfall 1: Test Scope Too Broad
Small configuration changes (e.g., adding vector search for long-tail queries) won’t move average-across-all-traffic metrics. Solution: segment by affected queries (e.g., queries > 5 words) and test only within that segment. Positive in sum doesn’t mean positive for everyone.
Pitfall 2: Session-Level Randomization + Late-Stage KPIs = Carry-Over Effect
Using session-level randomization with ARPU (Average Revenue Per User) as the primary metric is dangerous because:
- Users have memory across sessions. A user exposed to treatment in session 1 may return to buy in session 2 (assigned to control).
- This misattributes the positive effect of the treatment to the control group.
Simulation results: When sessions genuinely independent → A/B test correctly detects effect 100% of time. When sessions non-independent (carry-over) → correctly detects only 90% of time and underestimates effect size by ~30% (detects 1.4% instead of 2%).
Solution
- User-level randomization: each user always sees the same variant across all sessions
- Guardrail metrics: add earlier-funnel KPIs (e.g., Average Added Basket Value Per User) alongside ARPU to detect assignment persistence issues
Key Rule
If sessions might not be independent, randomize at user level. Cookie-based user-level randomization is imperfect but far better than session-level for purchase funnels.