Reinforcement Learning for Search

Using reinforcement learning to train language models to perform search — deciding when to query, what to query, and how to use retrieved results — without requiring human-labeled reasoning trajectories.

Why RL for Search

Supervised approaches to teaching LLMs to use search tools require labeled examples of “good” search sequences — expensive to produce and brittle across domains. RL sidesteps this by rewarding the model for final answer correctness, letting it discover effective search strategies autonomously.

Key Design Choices

Reward Function

Outcome-based (e.g., exact match on final answer) — simple, robust, avoids reward hacking
Process-based rewards (rewarding intermediate steps) are more complex and prone to gaming

Loss Masking

When retrieved content is inserted into the model’s context, those tokens should be masked from the training loss — the model should not be rewarded or penalised for content it retrieved but did not generate. Without this, training is unstable.

RL Algorithms

PPO (Proximal Policy Optimization) — more stable across architectures
GRPO (Group Relative Policy Optimization) — faster convergence

Current Implementations

Search-R1 — interleaved <think>, <search>, <information> token loop; outcome reward; token-level loss masking

Relationship to Other Approaches

Approach	How It Uses Search
RAG	Single retrieval before generation; no RL
Tool-use / ReAct	Prompting or RL; often needs supervised trajectories
Search-R1	Pure RL; no human trajectories; multi-turn

Search-R1 — primary RL-for-search framework
Agentic Search — broader paradigm this training methodology enables
RAG — the supervised single-turn baseline

Awesome Search KG

Explorer

Reinforcement Learning for Search

Reinforcement Learning for Search

Why RL for Search

Key Design Choices

Reward Function

Loss Masking

RL Algorithms

Current Implementations

Relationship to Other Approaches

Graph View

Table of Contents

Backlinks

Awesome Search KG

Explorer

Reinforcement Learning for Search

Reinforcement Learning for Search

Why RL for Search

Key Design Choices

Reward Function

Loss Masking

RL Algorithms

Current Implementations

Relationship to Other Approaches

Related Concepts

Related Articles

Graph View

Table of Contents

Backlinks