Autosuggest Retrieval Data Structures & Algorithms

Source: https://medium.com/@dtunkelang/autosuggest-retrieval-data-structures-algorithms Author: Giovanni Fernandez-Kincade

Summary

A deep dive into the data structures and algorithms used to retrieve autocomplete candidates given a prefix (or fuzzy match). Part 1 of a two-part series; Part 2 covers ranking.

The Core Problem

Given a user typing “sho”, return the top candidate completions (e.g., “shoes”, “shorts”, “shopping bags”) in under 50ms. The challenge is both algorithmic (fast prefix lookup) and scale (billions of query candidates).

Data Structure 1: Trie

The classic prefix tree:

Each node is a character; paths spell out strings
Prefix lookup: O(|prefix|) to find all completions
Problem: memory-intensive for large vocabularies; doesn’t support fuzzy matching

Data Structure 2: FST (Finite State Transducer)

Compressed trie that maps strings to values:

Much more memory-efficient than naive trie
Supports exact prefix lookup and constrained fuzzy matching
Used in Lucene/Elasticsearch for suggest implementations
Trade-off: complex to build incrementally; batch construction is standard

Data Structure 3: DFA (Deterministic Finite Automaton)

Used for fuzzy matching (edit-distance-bounded prefix lookup):

Levenshtein DFA accepts all strings within edit distance k of the prefix
Intersect Levenshtein DFA with the suggestion trie to get fuzzy completions
Enables “sho” to match “shoe” (deletion) or “shos” to match “shoes” (substitution)

Data Structure 4: N-gram Index

Index all character n-grams from suggestion strings:

“shoes” → {“sho”, “hoe”, “oes”, “shoe”, “hoes”, “shoes”, …}
Lookup by prefix n-gram, union results
Fast for small n; degrades for large vocabularies
Good for infix matching (“oes” finds “shoes”)

Data Structure 5: Edge N-grams

Elasticsearch’s built-in approach:

Index prefixes at each edge: “shoes” → [“s”, “sh”, “sho”, “shoe”, “shoes”]
Standard inverted index lookup for the current prefix
Simple and fast; no fuzzy support; larger index size

Memory vs. Computation Trade-offs

Structure	Memory	Speed	Fuzzy	Build Complexity
Trie	High	Fast	No	Low
FST	Low	Fast	Limited	High
Levenshtein DFA	Medium	Medium	Yes	Medium
N-gram index	High	Fast	Yes	Low
Edge N-grams	Medium	Fast	No	Low

Key Concepts

Trie — character-level prefix tree for exact prefix lookup
FST — compressed trie with efficient memory use
Levenshtein DFA — fuzzy prefix matching via edit-distance automaton
N-gram index — character n-gram indexing for flexible matching
Edge N-grams — simple prefix indexing in inverted index

People

Giovanni Fernandez-Kincade

Awesome Search KG

Explorer

Autosuggest Retrieval Data Structures & Algorithms

Autosuggest Retrieval Data Structures & Algorithms

Summary

The Core Problem

Data Structure 1: Trie

Data Structure 2: FST (Finite State Transducer)

Data Structure 3: DFA (Deterministic Finite Automaton)

Data Structure 4: N-gram Index

Data Structure 5: Edge N-grams

Memory vs. Computation Trade-offs

Key Concepts

People

Graph View

Table of Contents

Backlinks

Awesome Search KG

Explorer

Autosuggest Retrieval Data Structures & Algorithms

Autosuggest Retrieval Data Structures & Algorithms

Summary

The Core Problem

Data Structure 1: Trie

Data Structure 2: FST (Finite State Transducer)

Data Structure 3: DFA (Deterministic Finite Automaton)

Data Structure 4: N-gram Index

Data Structure 5: Edge N-grams

Memory vs. Computation Trade-offs

Key Concepts

Related Concepts

Related Articles

People

Graph View

Table of Contents

Backlinks