Zalando - Self-DoS via Facet Aggregation

Problem

On a normal Sunday, Zalando’s Elasticsearch cluster became sluggish and unresponsive. Queries took seconds instead of milliseconds. Users saw “0 results found” on filter pages. The system that feeds both catalog search and the AI assistant (Zalando Assistant) was down. Not just a technical failure — campaigns went dark, and the AI assistant couldn’t fetch products.

Architecture

Base Search (Elasticsearch)
    ↓ lexical + vector retrieval
NER Query Builder
    ↓ entity recognition → implicit filters; probes Base Search for product counts
Catalog API
    ↓ fan-out, A/B tests, caching, redirect decisions
Search API
    ↓ Algorithm Gateway (ML re-ranking) + Promotions bidding

Each layer has its own cache. ES coordinator nodes provide an additional caching layer.

Facet queries (brand, size, color, price buckets) are issued separately from result queries on every search — aggregation-heavy by design.

Root Cause

Pathological interaction between load and facet queries:

Facet aggregation queries are structurally different from document retrieval — they put disproportionate pressure on ES coordinator nodes
Under high load, caches missed simultaneously across the tier
Cache expiry storms: when cache TTLs aligned and cache missed at scale, the system flooded ES with expensive aggregation queries simultaneously
NER system’s habit of probing Base Search for product counts (to decide filter safety) added hidden query load that amplified the cascade

The coordinator nodes became the bottleneck — not the data nodes, not the network.

Key Lessons

Facet queries are a hidden stress test — they’re aggregation-heavy and bypass the result cache
Separate fanout + aggregation load visibility from result retrieval load in monitoring
Cache stampede protection (jitter on TTL, distributed lock on expensive recomputes) is critical at scale
A layered architecture means “search is slow” can have many root causes — thorough per-layer instrumentation is required to trace the actual bottleneck
The AI assistant dependency on search made the blast radius much larger than a pure UX failure

What to Steal

Treat facet/aggregation queries as a distinct load class in capacity planning
Add cache stampede protection (probabilistic early expiry or lock-based recompute) wherever aggregation queries share a cache tier
Monitor coordinator node saturation separately from data node saturation in Elasticsearch
Instrument NER and other query-expansion systems for the hidden query load they generate

Awesome Search KG

Explorer

Zalando - Self-DoS via Facet Aggregation

Zalando — Self-DoS via Facet Aggregation

Problem

Architecture

Root Cause

Key Lessons

What to Steal

Graph View

Table of Contents

Backlinks