What AI Engineers Should Know about Search

Doug Turnbull provides a 58-point primer for AI engineers entering the search domain — covering lexical search fundamentals (BM25, tokenization, phrase search), relevance evaluation and judgment methodology, query understanding pipelines, and Learning to Rank. Written to bridge the gap between LLM/RAG practitioners and classical IR knowledge.


Core Argument

Two things matter most before any algorithmic sophistication:

  1. Set up a solid evaluation framework — without it, you can’t know if changes help
  2. How you assemble the retrieval pipeline matters more than your choice of vector DB

Relevance Evaluation (Points 1–11)

  • Labels are called judgments — historically a human “judge” grades (query, result) pairs as relevant or not
  • Three labeling approaches, each with biases:
    • Human evaluators: non-expert fatigue, position preference
    • Clickstream data: only labels top-N shown results; click = not always relevant
    • LLM judge: cheaper, faster — see LLM as Judge
  • Position Bias / presentation bias: users click what they see; exploit/explore mindset needed to overcome it
  • Click Models convert raw clicks into calibrated judgments
  • Metrics: NDCG, ERR, MAP, Precision and Recall
  • recall tradeoff: wider net → more relevant results, more noise

BM25 and Lexical Scoring (Points 15–29)

  • BM25 = TF×IDF with saturation curves for term frequency and field length normalization
  • IDF: rarer term → higher weight; each system uses its own non-linear IDF curve
  • Parameters k1 (TF saturation speed) and b (field-length penalty) are tunable
  • BM25F: per-field term statistics
  • Lexical scoring requires care: term rarity in one field ≠ rarity in another

Tokenization and Phrase Search (Points 30–36)

  • Tokenization choices (n-gram, wordpiece, stemming, punctuation) significantly affect recall
  • Lucene token streams are graphs: synonyms and multi-word equivalents (USA → United States of America) are branches at the same position
  • Phrase search encodes term positions; phrase queries are also graphs
  • Collocations (statistically co-occurring bigrams like “Palo Alto”) as “poor man’s entities”

Query Understanding (Points 37–41)

  • Query Understanding covers: category classification, entity extraction, vector mapping
  • Query Relaxation: try strict query first; relax (AND → OR or broader terms) on empty/poor results
  • Relevance tiers: high-precision matches boosted, lower-precision as fallback
  • Filter queries: exclude known irrelevant results explicitly

Learning to Rank (Points 42–53)

  • Learning to Rank applies ML to ranking; different from standard classification/regression
  • SVMRank: binary SVM on (relevant, irrelevant) pairs
  • LambdaMART (see Learning to Rank): listwise optimization of NDCG via gradient boosting; old but reliable
  • Multiple Reranking stages: cheap model for top-1K, expensive cross-encoder for final top-N
  • Good LTR features are orthogonal: BM25, embeddings, freshness, popularity all measure different things
  • Ranking ≠ similarity: high similarity docs can still be spammy or stale

Signals and Feedback (Points 53–56)

  • Signals collection: memorize engaging results per query; upboost on re-query
  • Query feedback: learn category/embedding affinity from clicked results (relevance feedback)
  • Tools: SearchArray, BM25S

Resources Mentioned


People