What AI Engineers Should Know about Search

Doug Turnbull provides a 58-point primer for AI engineers entering the search domain — covering lexical search fundamentals (BM25, tokenization, phrase search), relevance evaluation and judgment methodology, query understanding pipelines, and Learning to Rank. Written to bridge the gap between LLM/RAG practitioners and classical IR knowledge.

Core Argument

Two things matter most before any algorithmic sophistication:

Set up a solid evaluation framework — without it, you can’t know if changes help
How you assemble the retrieval pipeline matters more than your choice of vector DB

Relevance Evaluation (Points 1–11)

Labels are called judgments — historically a human “judge” grades (query, result) pairs as relevant or not
Three labeling approaches, each with biases:
- Human evaluators: non-expert fatigue, position preference
- Clickstream data: only labels top-N shown results; click = not always relevant
- LLM judge: cheaper, faster — see LLM as Judge
Position Bias / presentation bias: users click what they see; exploit/explore mindset needed to overcome it
Click Models convert raw clicks into calibrated judgments
Metrics: NDCG, ERR, MAP, Precision and Recall
recall tradeoff: wider net → more relevant results, more noise

BM25 and Lexical Scoring (Points 15–29)

BM25 = TF×IDF with saturation curves for term frequency and field length normalization
IDF: rarer term → higher weight; each system uses its own non-linear IDF curve
Parameters k1 (TF saturation speed) and b (field-length penalty) are tunable
BM25F: per-field term statistics
Lexical scoring requires care: term rarity in one field ≠ rarity in another

Tokenization and Phrase Search (Points 30–36)

Tokenization choices (n-gram, wordpiece, stemming, punctuation) significantly affect recall
Lucene token streams are graphs: synonyms and multi-word equivalents (USA → United States of America) are branches at the same position
Phrase search encodes term positions; phrase queries are also graphs
Collocations (statistically co-occurring bigrams like “Palo Alto”) as “poor man’s entities”

Query Understanding (Points 37–41)

Query Understanding covers: category classification, entity extraction, vector mapping
Query Relaxation: try strict query first; relax (AND → OR or broader terms) on empty/poor results
Relevance tiers: high-precision matches boosted, lower-precision as fallback
Filter queries: exclude known irrelevant results explicitly

Learning to Rank (Points 42–53)

Learning to Rank applies ML to ranking; different from standard classification/regression
SVMRank: binary SVM on (relevant, irrelevant) pairs
LambdaMART (see Learning to Rank): listwise optimization of NDCG via gradient boosting; old but reliable
Multiple Reranking stages: cheap model for top-1K, expensive cross-encoder for final top-N
Good LTR features are orthogonal: BM25, embeddings, freshness, popularity all measure different things
Ranking ≠ similarity: high similarity docs can still be spammy or stale

Signals and Feedback (Points 53–56)

Signals collection: memorize engaging results per query; upboost on re-query
Query feedback: learn category/embedding affinity from clicked results (relevance feedback)
Tools: SearchArray, BM25S

Resources Mentioned

Introduction to Information Retrieval — Manning et al.; foundational IR textbook
AI Powered Search (Manning) — chapters on signals, relevance tiers, presentation bias
Relevant Search (Manning) — relevance tiers chapter
Click Models for Web Search — free book on click-to-judgment conversion
Awesome Search — curated search/IR resource list

BM25 — core lexical scoring algorithm
Judgment Lists — output of relevance evaluation
Learning to Rank — ML-based ranking optimization
Query Understanding — pre-retrieval intent parsing
Click Models — clickstream → judgment conversion
Position Bias — presentation bias in click data
Collocations — statistical phrase discovery
Pointwise Relevance Evaluation — per-item grading
Pairwise Relevance Evaluation — LHS/RHS comparison
Listwise Relevance Evaluation — full-list ranking

Classic ML to Cope with Dumb LLM Judges — Doug Turnbull; LLM judge evaluation in depth
Getting Started on Search Relevance for the Understaffed Search Team — broader relevance program setup
Semantic Search Without Embeddings — Doug Turnbull; co-occurrence embeddings via session data

People

Doug Turnbull — author; softwaredoug.com

Awesome Search KG

Explorer

What AI Engineers Should Know about Search

What AI Engineers Should Know about Search

Core Argument

Relevance Evaluation (Points 1–11)

BM25 and Lexical Scoring (Points 15–29)

Tokenization and Phrase Search (Points 30–36)

Query Understanding (Points 37–41)

Learning to Rank (Points 42–53)

Signals and Feedback (Points 53–56)

Resources Mentioned

People

Graph View

Table of Contents

Backlinks

Awesome Search KG

Explorer

What AI Engineers Should Know about Search

What AI Engineers Should Know about Search

Core Argument

Relevance Evaluation (Points 1–11)

BM25 and Lexical Scoring (Points 15–29)

Tokenization and Phrase Search (Points 30–36)

Query Understanding (Points 37–41)

Learning to Rank (Points 42–53)

Signals and Feedback (Points 53–56)

Resources Mentioned

Related Concepts

Related Articles

People

Graph View

Table of Contents

Backlinks