RL-Trained Search Agents

Most agentic search today prompts a general or specialized LLM to use search tools. The frontier underneath it is training the searcher itself with reinforcement learning — letting the model discover when to search, what to query, and how to fold results into reasoning, rewarded only by final-answer correctness rather than human-labeled trajectories. This is the methodological complement to Purpose-Built Agentic Search Models: one trains the policy of searching, the other ships a product model.

See Reinforcement Learning for Search for the methodology and Search-R1 for the canonical framework.


Supervised tool-use training needs labeled examples of “good” search sequences — expensive to produce and brittle across domains. RL sidesteps this: reward the model for getting the final answer right and let it learn the search strategy autonomously. This is what separates RL search agents from prompted Agentic Search and from single-shot RAG.

Key Design Choices

ComponentChoiceWhy
RewardOutcome-based (e.g. exact match)Simple, robust, resists reward hacking; process rewards are gameable
Loss maskingMask retrieved tokens from the lossThe model must not be trained on content it retrieved but didn’t generate — without this, training is unstable
RL algorithmPPO (stable) / GRPO (faster convergence)Trade stability vs. speed
TrajectoriesNone requiredThe point: no human-labeled reasoning chains

Token-level loss masking is the critical technical contribution that makes the loop trainable.

Search-R1: The Canonical Framework

Search-R1 (arXiv:2503.09516) trains an LLM to interleave reasoning and retrieval with <think>, <search>, <information> tokens:

  1. Reason about what’s needed → 2. Issue a search query → 3. Incorporate results → 4. Repeat until sufficient → 5. Answer.

Reported gains over baselines: +26% (Qwen2.5-7B), +21% (Qwen2.5-3B), +10% (LLaMA3.2-3B) across 7 QA datasets (NQ, TriviaQA, HotpotQA, 2WikiMultiHopQA, …).

AspectRAGRL-trained (Search-R1)
Retrieval turns1Multiple, dynamic
Query sourceFixed user queryRL-generated mid-reasoning
TrainingSupervised / noneReinforcement learning
Knowledge sourceStatic indexLive search

Where It Sits in the Frontier

RL-trained agents and Purpose-Built Agentic Search Models are two bets on making the searcher smart: train the search policy with RL (Search-R1) vs. train a task-specialized model on the rewrite/retrieve/rerank outcome (SID-1). Both contrast with Direct Corpus Interaction, which keeps a general agent but upgrades the interface to the corpus. All three are the live edge of Frontier of Search 2026.