Search Architecture

End-to-end design of a production search system, covering ingestion, indexing, retrieval, ranking, and serving layers.

Canonical Pipeline

Query Input
    ↓
Query Understanding (normalization, intent, segmentation)
    ↓
Retrieval (BM25 / ANN / Hybrid)
    ↓
First-Stage Ranking (fast scoring)
    ↓
Reranking (cross-encoder / LTR model)
    ↓
Post-processing (dedup, diversity, personalization)
    ↓
Results Served

Ingestion Side

Raw Content
    ↓
Parsing / Normalization
    ↓
Chunking (if needed)
    ↓
Embedding (if dense)
    ↓
Index (inverted / ANN / both)

Key Design Decisions

DecisionOptions
Retrieval strategyBM25, dense, hybrid, sparse
Index typeInverted, HNSW, IVF, Flat
RerankerCross-encoder, ColBERT, LLM-based
FusionRRF, linear combination, learned
PersonalizationUser embeddings, contextual signals

Latency Budget

Multi-stage pipelines must budget latency across stages:

  • Retrieval: ~50ms
  • Reranking (top 100): ~50–100ms
  • Total P99: <200ms for most UX

Industry Implementations

  • Canva: two-phase search pipeline (Part I: retrieval, Part II: ranking)

  • Carousell: migrated from keyword to dense vector with Elasticsearch

  • Slack: full-text + entity-aware search at scale

  • Netflix: federated graph-based search — see Knowledge Graph Search

  • Uber Eats: geosharding (H3 hex grid), document layout optimization (60% latency cut), ETD range indexing — see Optimizing Search at Uber Eats

  • Zalando: layered architecture (Base Search → NER → Catalog API → Search API); self-DoS via facet aggregation — see The Day Our Own Queries DoSed Us - Zalando Search