Semantic Search Without Embeddings

Reframing Semantic Search

Semantic search requires three things (not two):

Representation — a shared space for queries and content
Similarity function — how near/far items are
Match criteria (often forgotten) — whether an item qualifies as a match at all

Embeddings excel at 1 and 2, but poorly at 3. There’s no “magic threshold” — a similarity of 0.8 doesn’t mean “match” across all domains and query types.

Taxonomy-Based Alternative

A managed taxonomy (hierarchical vocabulary) solves all three:

Representation: category tree (e.g., Baby & Kids / Toddler & Kids Playroom / Indoor Play / Rocking Horses)
Similarity: direct match > sibling > cousin > grandparent
Match criteria: include direct node + parent; exclude grandparent/cousin if desired

Example path: Baby & Kids / Toddler & Kids Playroom / Indoor Play / Rocking Horses / Novelty Rocking Horses

BM25 + Hierarchical Tokenizer

The taxonomy similarity function can be implemented in a standard BM25 index using a hierarchical tokenizer that produces all ancestor paths:

In:  hierarchical_tokenizer("Baby & Kids / ... / Rocking Horses")
Out: ['Baby & Kids',
      'Baby & Kids / Toddler & Kids Playroom',
      'Baby & Kids / Toddler & Kids Playroom / Indoor Play',
      'Baby & Kids / Toddler & Kids Playroom / Indoor Play / Rocking Horses']

Why BM25 naturally gives the right ranking: root nodes have high document frequency (common → low score); leaf nodes have low document frequency (rare → high score). Specific matches bubble up automatically.

LLMs + Taxonomies

LLMs supercharge taxonomy maintenance:

Given a taxonomy, LLMs can classify products/queries cheaply and accurately
Can “hallucinate” plausible category paths for a query as a form of HyDE-style retrieval
Makes the historically expensive management of taxonomies approachable

Sweet spot for embeddings: building classifiers into taxonomies — not direct retrieval. Use embeddings to find the best taxonomy node, then search within that node.

Tradeoffs vs. Embeddings

	Taxonomy	Embeddings
Explainability	High (user vocabulary)	Low (black box)
Cold start	Works if category exists	Random garbage without training data
Exact constraints	Excellent	Poor
Maintenance	Labor-intensive (but LLMs help now)	Data-hungry
Fuzzy/semantic	Limited (vocabulary-dependent)	Excellent

Key Insight

“Don’t apologize for living with a more organized, approach to semantic search.”

Embeddings are the wrong lens for problems requiring exacting, precise matching. Taxonomies + LLM classifiers can serve as a powerful, interpretable semantic layer.

People

Doug Turnbull

Awesome Search KG

Explorer

Semantic Search Without Embeddings

Semantic Search Without Embeddings

Reframing Semantic Search

Taxonomy-Based Alternative

BM25 + Hierarchical Tokenizer

LLMs + Taxonomies

Tradeoffs vs. Embeddings

Key Insight

People

Graph View

Table of Contents

Backlinks

Awesome Search KG

Explorer

Semantic Search Without Embeddings

Semantic Search Without Embeddings

Reframing Semantic Search

Taxonomy-Based Alternative

BM25 + Hierarchical Tokenizer

LLMs + Taxonomies

Tradeoffs vs. Embeddings

Key Insight

Related Concepts

People

Graph View

Table of Contents

Backlinks