Mirror, Mirror: All About Search Suggestions

A deep technical dive into autosuggest data structures and implementations, from naive list to production-grade FSTs.

Data Structures Compared

1. Sorted List

Simple O(n) prefix scan. Works for tiny datasets, scales badly.

2. Radix Tree (Adaptive)

Prefix lookups in O(k) where k = query length. Memory-efficient — common prefixes stored once. Java: concurrent-trees library.

Problem: No weighted ordering — responses sorted alphabetically, not by relevance.

3. Pruning Radix Trie

Stores maximum weight of subtree at each internal node. Enables early exit: skip entire subtrees whose max weight is below the k-th best result found so far. Result: sorted-by-weight top-k in O(k × depth) instead of retrieving all matches first.

Created by Wolf Garbe (SymSpell author). Java implementation: JPruningRadixTrie.

4. Weighted FST (Lucene WFST)

WFSTCompletionLookup in Lucene’s suggest package. Immutable but serializable; slightly faster than pruning radix trie for longer prefixes.

5. Elasticsearch

completion suggester: pure prefix, fast, supports contexts (category filter)
search_as_you_type field type: regular search with block-max WAND optimization; supports analyzers, synonyms, LTR plugin, re-scoring

Typo Handling

FuzzySuggester (Lucene): Levenshtein automata; max edit distance 2
Phonetic analyzer: sounds-like matching (ColognePhonetic); good for names
SymSpell: fastest known approach for spelling correction; standalone implementations

Ranking Weights

Weight sources: search logs (successful queries), purchase data, recency signals, personalization. LTR + rescoring stage after initial retrieval is the production-grade approach.

Key Takeaway

Two-stage pipeline: fast pruning radix trie or WFST for top-N retrieval → ML rescoring (XGBoost via Vespa or LTR plugin via Elasticsearch) for personalized/recency-aware ranking.

Awesome Search KG

Explorer

Mirror Mirror - All About Search Suggestions