LLM-Powered Query Extraction for Auto-Complete and Auto-Suggest
Author: David Albrecht
Dataset: WANDS (43k Wayfair product descriptions), processed with GPT-4.1-nano for <$5
Problem
Three autocomplete/autosuggest challenges:
- Cold-start: no search activity to draw from
- Query-document mismatch: users search for what they want, not what matches document text
- Short phrases: traditional NLP (SpaCy) produces short, underwhelming phrases
LLM as Query Extractor
Use an LLM to generate 10 search queries per product description. The LLM uses the document creatively — not just extracting existing words.
Example input:
Product Name: trestle coffee table
Description: it is an elegant french design...
Example output:
['French design coffee table', 'Elegant rectangular coffee table',
'Designer living room table', 'Nautical map bar furniture']These are novel, user-like phrases impossible to extract with traditional NLP.
Cost: parallel batch with GPT-4.1-nano, several hours for 43k products.
Auto-Complete
Prefix index built from extracted phrases. At query time, return all phrases starting with prefix (sorted shortest-first).
autocomplete('bed')
# ['bed riser blocks', 'bed security pad', 'bedroom decor rug', ...]Re-ranking by semantic similarity is a suggested improvement.
Auto-Suggest
Embedding similarity over all extracted phrases. Acts as a semantic fallback when auto-complete has no matches.
autosuggest('nautical furniture')
# [('furniture with nautical elements', 0.963), ('nautical coastal furniture for living room', 0.887), ...]Implemented with sentence-transformers/all-MiniLM-L6-v2, normalized cosine similarity.
Key Insight
Rather than waiting for user behavior to build suggestion corpora, use the document collection itself to simulate diverse, realistic queries. Solves cold-start while producing richer phrases than statistical NLP methods.
Practical Notes
- Online learner (contextual bandit) suggested for managing which completions/suggestions perform over time
- Click data / conversion filtering can clean up low-quality LLM phrases
- For production: use search engine’s native prefix capabilities, not this toy prefix index