Search-R1

An RL-trained framework that teaches LLMs to interleave multi-turn web search with chain-of-thought reasoning — enabling autonomous, iterative research behaviour rather than single-pass retrieval. Introduced in arxiv paper 2503.09516.


Core Idea

Instead of retrieving once before generating (RAG), Search-R1 trains an LLM to:

  1. Reason about what information it needs
  2. Issue search queries to a live search engine
  3. Incorporate retrieved results into ongoing reasoning
  4. Repeat until sufficient context is gathered
  5. Generate a final answer

The model learns this entire loop — including when and what to search — through reinforcement learning, with no human-labeled reasoning trajectories.

Training Design

ComponentDesign Choice
Learning methodReinforcement learning (PPO or GRPO)
Token format<think>, <search>, <information> special tokens
Loss maskingRetrieved tokens masked — model only trains on its own generated tokens
Reward signalOutcome-based (exact match); no process reward
TrajectoriesNo human labels required

Token-level loss masking is a critical technical contribution: without it, the model inadvertently optimises over retrieved content it didn’t generate, leading to unstable training.

Performance

Evaluated on 7 QA datasets (NQ, TriviaQA, HotpotQA, 2WikiMultiHopQA, and others):

ModelGain over baseline
Qwen2.5-7B+26%
Qwen2.5-3B+21%
LLaMA3.2-3B+10%

Comparison to RAG

AspectRAGSearch-R1
Retrieval turns1Multiple (dynamic)
Query sourceFixed user queryRL-generated mid-reasoning
Knowledge sourceStatic indexLive web
TrainingSupervisedReinforcement learning
LatencyFastSlower (external API calls)