This Is What Agentic Retrieval Looks Like

Jo Kristian Bergum (Hornet) analyzes 19,279 search calls made by GPT-5 during BrowseComp-Plus benchmark runs to characterize the agentic query workload. The core finding: agents search like power users, writing queries that are completely out of distribution for retrieval infrastructure built and tuned for humans.

Part 2 of a series on agentic retrieval. Part 1: Deep research is a retrieval problem.


The BrowseComp-Plus Setup

  • 830 riddle-like, multi-hop questions
  • GPT-5 has only a BM25 search tool returning top-5 snippets (512 tokens each)
  • Median session: 24 search calls per question; p90: 35; max: 63
  • Oracle accuracy (evidence pre-loaded): 93% correct; with weak BM25: 14%
  • The 79-point gap is where retrieval quality lives

Each search call is conditioned on all prior calls: a miss on turn 3 degrades every subsequent turn.

Agent Query Length vs. Human Queries

DistributionMedianp90p99
AOL human queries (2006)2 terms5 terms8 terms
GPT-5 BrowseComp-Plus10 terms17 terms

GPT-5’s median query sits past the 99th percentile of human queries. First queries average 19.1 terms; length drops to ~10 by turn 2 and plateaus there.

GPT-5 Uses Web Search Syntax — Fluently

Operator% of individual queries% of sessions
Phrase quotes "..."65.6%98.2%
Four-digit year45.6%95.4%
site: domain filter6.4%48.1%

Agent also writes combinations humans almost never type:

  • OR across a year range: "born June" "video game composer" 1971 OR 1970 OR 1969
  • Wildcard subdomain: site:*.org "January 28, 2019" "Blog" art
  • Stacked operators: site:surrey.ac.uk filetype:pdf "student numbers" "2018/19"
  • Negation: "Memphis 13" -football -soccer Wikipedia

Phrase quotes appeared in 98% of sessions — the agent treats exact-string matching as a primary strategy.

Why Current Retrieval Infrastructure Fails Agents

Retrieval was designed for the human workload:

  • Inverted indexes and BM-WAND pruning degrade past ~7 terms (uniform term weights)
  • Neural retrievers trained on MS MARCO / Natural Questions (short, fluent queries) → Distribution Shift on agent queries
  • Search APIs expose one string input (optimized for human typers)
  • Query operators deprecated because humans stopped using them

None of those choices were wrong for humans. They just don’t match this workload.

The Multi-Turn Compounding Problem

Poor retrieval compounds across a session. A missed document on turn 3 means turn 4 reasons over a different context, and the error propagates. Miss rate × session length = accumulated degradation — a problem that single-query NDCG benchmarks are blind to.


  • Agentic Retrieval — the workload studied; multi-turn, long-query, operator-heavy
  • BM25 — the retrieval system used; reveals its limits under agent workload
  • Query Operators — phrase quotes, site:, filetype:, OR, negation; GPT-5 uses them fluently
  • Distribution Shift — neural retrievers trained on MS MARCO fail on long agent queries
  • Agentic Search — broader context; retrieval is the bottleneck
  • BrowseComp-Plus — benchmark used; 830 multi-hop questions, 100K-page corpus

Companies

  • Hornet — author’s company; building retrieval infrastructure for agentic workloads

People