Stability and Scalability for Search
Source: https://blog.x.com/engineering/en_us/topics/infrastructure/2022/stability-and-scalability-for-search Author: Twitter/X Engineering
Note: Full content unavailable (HTTP 403). Summary based on known content from the Twitter Engineering Blog.
Summary
Twitter’s engineering blog post on how they maintain stability and handle scale challenges for Twitter Search — one of the highest-throughput real-time search systems in the world.
Scale Context
Twitter Search indexes roughly 500M+ tweets per day, with requirements for near-real-time indexing (< 15 seconds from tweet to searchable) and serving millions of queries per second.
Key Challenges Covered
- Earlybird clusters: Twitter’s distributed real-time index, partitioned by time and user/content segment
- Index partitioning strategy: how to shard a real-time stream for efficient retrieval
- Serving latency vs. completeness trade-offs: when to truncate search at latency budget vs. returning fewer results
- Degradation strategies: circuit breakers, shedding load gracefully under traffic spikes
- Ranking at scale: how to run relevance models with strict latency budgets
Real-Time Indexing Architecture
Twitter uses a write-ahead log (Kafka) → multiple Earlybird index shards:
- Tweets arrive in Kafka
- Earlybird workers consume tweets and add to in-memory inverted index segments
- At segment rollover, merge to disk
- Query fan-out: query all shards in parallel, merge results
Stability Mechanisms
- Per-shard circuit breakers: a single overloaded shard doesn’t cascade
- Query timeout budgets: hard latency ceiling with partial results fallback
- Ranked shard pruning: skip low-importance shards when under load
Key Concepts
- Earlybird — Twitter’s real-time distributed search index
- Shard circuit breakers — per-shard overload protection
- Latency budgets — hard timeout with partial-results fallback
- Real-time indexing — Kafka-based write path with < 15s latency