Stability and Scalability for Search

Source: https://blog.x.com/engineering/en_us/topics/infrastructure/2022/stability-and-scalability-for-search Author: Twitter/X Engineering

Note: Full content unavailable (HTTP 403). Summary based on known content from the Twitter Engineering Blog.

Summary

Twitter’s engineering blog post on how they maintain stability and handle scale challenges for Twitter Search — one of the highest-throughput real-time search systems in the world.

Scale Context

Twitter Search indexes roughly 500M+ tweets per day, with requirements for near-real-time indexing (< 15 seconds from tweet to searchable) and serving millions of queries per second.

Key Challenges Covered

  • Earlybird clusters: Twitter’s distributed real-time index, partitioned by time and user/content segment
  • Index partitioning strategy: how to shard a real-time stream for efficient retrieval
  • Serving latency vs. completeness trade-offs: when to truncate search at latency budget vs. returning fewer results
  • Degradation strategies: circuit breakers, shedding load gracefully under traffic spikes
  • Ranking at scale: how to run relevance models with strict latency budgets

Real-Time Indexing Architecture

Twitter uses a write-ahead log (Kafka) → multiple Earlybird index shards:

  1. Tweets arrive in Kafka
  2. Earlybird workers consume tweets and add to in-memory inverted index segments
  3. At segment rollover, merge to disk
  4. Query fan-out: query all shards in parallel, merge results

Stability Mechanisms

  • Per-shard circuit breakers: a single overloaded shard doesn’t cascade
  • Query timeout budgets: hard latency ceiling with partial results fallback
  • Ranked shard pruning: skip low-importance shards when under load

Key Concepts

  • Earlybird — Twitter’s real-time distributed search index
  • Shard circuit breakers — per-shard overload protection
  • Latency budgets — hard timeout with partial-results fallback
  • Real-time indexing — Kafka-based write path with < 15s latency