What AI Engineers Should Know about Search
Doug Turnbull provides a 58-point primer for AI engineers entering the search domain — covering lexical search fundamentals (BM25, tokenization, phrase search), relevance evaluation and judgment methodology, query understanding pipelines, and Learning to Rank. Written to bridge the gap between LLM/RAG practitioners and classical IR knowledge.
Core Argument
Two things matter most before any algorithmic sophistication:
- Set up a solid evaluation framework — without it, you can’t know if changes help
- How you assemble the retrieval pipeline matters more than your choice of vector DB
Relevance Evaluation (Points 1–11)
- Labels are called judgments — historically a human “judge” grades (query, result) pairs as relevant or not
- Three labeling approaches, each with biases:
- Human evaluators: non-expert fatigue, position preference
- Clickstream data: only labels top-N shown results; click = not always relevant
- LLM judge: cheaper, faster — see LLM as Judge
- Position Bias / presentation bias: users click what they see; exploit/explore mindset needed to overcome it
- Click Models convert raw clicks into calibrated judgments
- Metrics: NDCG, ERR, MAP, Precision and Recall
- recall tradeoff: wider net → more relevant results, more noise
BM25 and Lexical Scoring (Points 15–29)
- BM25 = TF×IDF with saturation curves for term frequency and field length normalization
- IDF: rarer term → higher weight; each system uses its own non-linear IDF curve
- Parameters
k1(TF saturation speed) andb(field-length penalty) are tunable - BM25F: per-field term statistics
- Lexical scoring requires care: term rarity in one field ≠ rarity in another
Tokenization and Phrase Search (Points 30–36)
- Tokenization choices (n-gram, wordpiece, stemming, punctuation) significantly affect recall
- Lucene token streams are graphs: synonyms and multi-word equivalents (USA → United States of America) are branches at the same position
- Phrase search encodes term positions; phrase queries are also graphs
- Collocations (statistically co-occurring bigrams like “Palo Alto”) as “poor man’s entities”
Query Understanding (Points 37–41)
- Query Understanding covers: category classification, entity extraction, vector mapping
- Query Relaxation: try strict query first; relax (AND → OR or broader terms) on empty/poor results
- Relevance tiers: high-precision matches boosted, lower-precision as fallback
- Filter queries: exclude known irrelevant results explicitly
Learning to Rank (Points 42–53)
- Learning to Rank applies ML to ranking; different from standard classification/regression
- SVMRank: binary SVM on (relevant, irrelevant) pairs
- LambdaMART (see Learning to Rank): listwise optimization of NDCG via gradient boosting; old but reliable
- Multiple Reranking stages: cheap model for top-1K, expensive cross-encoder for final top-N
- Good LTR features are orthogonal: BM25, embeddings, freshness, popularity all measure different things
- Ranking ≠ similarity: high similarity docs can still be spammy or stale
Signals and Feedback (Points 53–56)
- Signals collection: memorize engaging results per query; upboost on re-query
- Query feedback: learn category/embedding affinity from clicked results (relevance feedback)
- Tools: SearchArray, BM25S
Resources Mentioned
- Introduction to Information Retrieval — Manning et al.; foundational IR textbook
- AI Powered Search (Manning) — chapters on signals, relevance tiers, presentation bias
- Relevant Search (Manning) — relevance tiers chapter
- Click Models for Web Search — free book on click-to-judgment conversion
- Awesome Search — curated search/IR resource list
Related Concepts
- BM25 — core lexical scoring algorithm
- Judgment Lists — output of relevance evaluation
- Learning to Rank — ML-based ranking optimization
- Query Understanding — pre-retrieval intent parsing
- Click Models — clickstream → judgment conversion
- Position Bias — presentation bias in click data
- Collocations — statistical phrase discovery
- Pointwise Relevance Evaluation — per-item grading
- Pairwise Relevance Evaluation — LHS/RHS comparison
- Listwise Relevance Evaluation — full-list ranking
Related Articles
- Classic ML to Cope with Dumb LLM Judges — Doug Turnbull; LLM judge evaluation in depth
- Getting Started on Search Relevance for the Understaffed Search Team — broader relevance program setup
- Semantic Search Without Embeddings — Doug Turnbull; co-occurrence embeddings via session data
People
- Doug Turnbull — author; softwaredoug.com