Search Architecture
End-to-end design of a production search system, covering ingestion, indexing, retrieval, ranking, and serving layers.
Canonical Pipeline
Query Input
↓
Query Understanding (normalization, intent, segmentation)
↓
Retrieval (BM25 / ANN / Hybrid)
↓
First-Stage Ranking (fast scoring)
↓
Reranking (cross-encoder / LTR model)
↓
Post-processing (dedup, diversity, personalization)
↓
Results Served
Ingestion Side
Raw Content
↓
Parsing / Normalization
↓
Chunking (if needed)
↓
Embedding (if dense)
↓
Index (inverted / ANN / both)
Key Design Decisions
| Decision | Options |
|---|---|
| Retrieval strategy | BM25, dense, hybrid, sparse |
| Index type | Inverted, HNSW, IVF, Flat |
| Reranker | Cross-encoder, ColBERT, LLM-based |
| Fusion | RRF, linear combination, learned |
| Personalization | User embeddings, contextual signals |
Latency Budget
Multi-stage pipelines must budget latency across stages:
- Retrieval: ~50ms
- Reranking (top 100): ~50–100ms
- Total P99: <200ms for most UX
Industry Implementations
-
Canva: two-phase search pipeline (Part I: retrieval, Part II: ranking)
-
Carousell: migrated from keyword to dense vector with Elasticsearch
-
Slack: full-text + entity-aware search at scale
-
Netflix: federated graph-based search — see Knowledge Graph Search
-
Uber Eats: geosharding (H3 hex grid), document layout optimization (60% latency cut), ETD range indexing — see Optimizing Search at Uber Eats
-
Zalando: layered architecture (Base Search → NER → Catalog API → Search API); self-DoS via facet aggregation — see The Day Our Own Queries DoSed Us - Zalando Search