Search Problem Archetypes

A diagnostic framework for naming the search problem you actually have before choosing a solution. The premise — drawn from Atita Arora’s Before You Fix Your Search, Know What’s Actually Broken and Russell Ackoff’s systems thinking — is that most search problems are not unique: they are instances of a small number of recurring patterns. The most valuable thing you can do before investing in LTR, vector search, or a new model is to correctly identify which pattern you are in, because the archetype tells you which layer to fix first (data model vs. ranking vs. evaluation vs. architecture).


Why Diagnosis Comes Before Solution

Teams self-diagnose with comfortable, sophisticated-sounding answers — “we need better ranking”, “we need vector search”, “we need a better model”. But the problem definition is usually a symptom, not a diagnosis. The most common foundation failure: a team has built sophisticated retrieval but never formally defined relevance — no evaluation framework, no judgment lists, no shared definition of a good result. The model optimizes a signal nobody verified maps to user satisfaction. This is a measurement / alignment problem, and no ranking sophistication fixes it.

“We fail more often because we solve the wrong problem than because we get the wrong solution to the right problem.” — Russell Ackoff

The 10 Archetypes

#ArchetypeCore tensionTypical domainsFix-first layer
1Uniqueness ProblemOne-of-a-kind inventory; freshness existential; thin demand signals; vocabulary mismatchMarketplaces, pre-owned, handmade, classifiedsData modeling
2Complexity MachineVariant-heavy catalog; mixed known-item + exploratory intent; sponsored vs. organicRetail, fashion, health, price comparisonIntent routing / ranking
3Precision MandatePlausible-but-wrong is worse than nothing; provenance & scope correctness first-classLegal, clinical, tax, regulatoryPrecision, lineage, scope
4FirehoseRecency vs. relevance tug-of-war; authority explicit; unpredictable volumeNews, media, social, live eventsFreshness & authority signals
5ExtractionConsumer is a machine/agent/pipeline; structured extraction at low latency; schema driftFinancial news→events, feeds, research pipelinesExtraction precision / latency
6Media VaultNon-text catalog (image/video/audio); sparse/missing metadata; conceptual/mood queriesStock media, DAM, archives, videoMetadata / multimodal layer
7Knowledge GraphAnswer composite across documents & entity relationships; taxonomy/ontology qualityProcurement, talent networks, compliance, gastronomyOntology / entity modeling
8GeospatialDistance, proximity, bounding box, polygon as first-class relevanceSatellite/aerial, urban planning, mapping, govtSpatial indexing / architecture
9Q-commerceReal-time relevance under operational constraints (delivery window, availability, geo)Food delivery, ride-hailing, on-demand, logisticsSupply/operational alignment
10Code SearchInherits KG + precision + firehose + extraction; vocabulary mismatch; version correctnessCoding agents, version control, API docs, code reviewDepends on inherited pattern

How to Know Which One You’re In (selected tells)

  • Uniqueness — standard taxonomies don’t fit (“handcrafted ceramic bowl” ≠ “kitchenware”); an item sold is gone.
  • Complexity Machine — Master/Variant SKUs; a single pipeline tuned for neither known-item nor exploratory intent.
  • Precision Mandate — recall isn’t enough; wrong jurisdiction / wrong revision / wrong rate range is a hard fail.
  • Firehose — most relevant doc is years old while the most recent has no topical reference; query volume spikes on events.
  • Extraction — traditional relevance evals don’t apply; failure to extract has compounding downstream consequences.
  • Media Vault — asset names are random/numeric with no relation to content; lexical layer still doing most of the work.
  • Knowledge Graph — users browse more than search; nested relationships (supplier → product → certification → regulation).
  • Geospatial — bounding box / radius / polygon queries not served by standard search; coordinate-system errors at query time.
  • Q-commerce — supply-side fulfilment confused with query-side sophistication; substitutes treated as edge case.
  • Code Search — query intent ambiguous (file? function? class? call chain?); conn_rety / retryConnection / retry_on_connection_error.

Operating Principles

  • Same symptom, different meaning. Zero results in e-commerce (Uniqueness / Complexity Machine) is wasted real estate — show substitutes, never a dead end. In a Precision Mandate, zero results is often better than a marginally-similar regulation or precedent. Same metric on the same dashboard, opposite prescription.
  • Systems sit between archetypes and outgrow them. A small marketplace landing a big retail partner suddenly inherits variants, suppliers, and sponsored-products problems — a shift of archetype, so the diagnosis should shift too.
  • The archetype names which layer to fix first, not how. Reranking won’t fix data-modeling gaps; a “better” embedding won’t fix sponsored-vs-organic business misalignment. Sequence matters; most teams skip to solutions before the problem is discovered.
  • Measurable ≠ useful. Just because a metric is measurable doesn’t mean it’s valid for your case.

Agentic Retrieval Is Not an Archetype

Agentic retrieval is a consumption pattern that sits on top of whichever archetype the system already belongs to (a financial-research agent → Extraction; a legal-review agent → Precision Mandate). What changes is the evaluation contract: a wrong result a human would catch and ignore becomes a wrong action an agent executes autonomously. Precision, result confidence, and scope correctness become hard requirements, not nice-to-haves.


Articles

People

  • Atita Arora — author of the archetype framework
  • Udi Manber — “search is essentially a solved problem” misperception