Reinforcement Learning for Search

Using reinforcement learning to train language models to perform search — deciding when to query, what to query, and how to use retrieved results — without requiring human-labeled reasoning trajectories.


Supervised approaches to teaching LLMs to use search tools require labeled examples of “good” search sequences — expensive to produce and brittle across domains. RL sidesteps this by rewarding the model for final answer correctness, letting it discover effective search strategies autonomously.

Key Design Choices

Reward Function

  • Outcome-based (e.g., exact match on final answer) — simple, robust, avoids reward hacking
  • Process-based rewards (rewarding intermediate steps) are more complex and prone to gaming

Loss Masking

When retrieved content is inserted into the model’s context, those tokens should be masked from the training loss — the model should not be rewarded or penalised for content it retrieved but did not generate. Without this, training is unstable.

RL Algorithms

  • PPO (Proximal Policy Optimization) — more stable across architectures
  • GRPO (Group Relative Policy Optimization) — faster convergence

Current Implementations

  • Search-R1 — interleaved <think>, <search>, <information> token loop; outcome reward; token-level loss masking

Relationship to Other Approaches

ApproachHow It Uses Search
RAGSingle retrieval before generation; no RL
Tool-use / ReActPrompting or RL; often needs supervised trajectories
Search-R1Pure RL; no human trajectories; multi-turn
  • Search-R1 — primary RL-for-search framework
  • Agentic Search — broader paradigm this training methodology enables
  • RAG — the supervised single-turn baseline