Airbnb — ML-Powered Experiences Ranking

Problem

Experiences (tours, classes, activities hosted by locals) launched with heuristic ranking. The team needed to rank results by likelihood of booking — accounting for the specific guest’s preferences, trip context, and experience quality signals — not just static popularity.

Four-Stage Progression

Each stage built on the previous and was measured separately via A/B test.

Stage 1 — Baseline GBDT (+13% bookings)

Gradient Boosted Decision Trees, binary classification (booked/not). 50,000 labeled examples. Features: duration, price, review count, booking velocity, occupancy.

Stage 2 — Personalization (+7.9%)

Two new dimensions:

  • Context from booked homes: trip dates, distance from accommodation to experience
  • User click history: Category Intensity = weighted sum of clicks with recency decay; time-of-day preferences

Offline pre-computation: rankings pre-computed for 1M+ active users daily to keep latency manageable.

Stage 3 — Online Scoring (+5.1%)

Switched to real-time inference to include query-time features unavailable offline:

  • Distance to entered location
  • Guest count
  • Browser language
  • Origin-destination travel patterns

2M+ training examples, 90 features total.

Stage 4 — Business Rules (+2.2%)

Up-weighted training examples to promote business goals:

  • Quality: 5-star rebooking experiences weighted 1.5x
  • Emerging hits: new experiences with rapid booking velocity +14% boost
  • Category diversity: penalize showing all experiences in same category +2.3%

Cumulative total: ~28% booking improvement across all four stages.

Key Lessons

  • Incremental ML adoption works: each stage had clear scope, clear metric, and was independently validated
  • Personalization requires deliberately designed features — generic click signals need domain transformation (Category Intensity, not raw clicks)
  • Offline pre-computation is a practical first step for personalization; accept staleness in exchange for latency
  • Business rules embedded in training data weighting are more robust than score overrides — they generalize across queries rather than requiring per-query configuration
  • Diversity must be explicitly rewarded, not assumed to emerge from relevance signals

What to Steal

  • Category Intensity feature formula: weighted click sum with recency decay — applicable to any personalized ranking
  • Stage gating pattern: validate each ML addition independently before adding the next layer
  • Business rule via training data weighting, not post-hoc score adjustments