Pairwise Relevance Evaluation

Definition

A relevance judgment method where an evaluator (human or LLM) is shown two candidates side-by-side — conventionally called LHS (left-hand side) and RHS (right-hand side) — and asked which is more relevant to a given query. Produces relative preference labels rather than absolute scores.

Why Pairwise Over Point-wise?

Point-wise scoring (rate this document 0–3) is hard to calibrate consistently — annotators disagree on what “2 out of 3” means. Pairwise comparison is cognitively simpler: “which of these two is better?” Humans and LLMs both agree more reliably on relative preferences than on absolute grades.

The LHS/RHS Prompt Format

A canonical pairwise LLM judge prompt:

Which of these products is more relevant to the search query?

Query: entrance table

Product LHS: aleah coffee table
Product RHS: marta coffee table

Only respond 'LHS' or 'RHS' if you are confident.
Otherwise respond 'Neither'.

Forcing a binary choice (LHS or RHS, no Neither) maximizes recall but lowers precision (~75%). Allowing “Neither” and requiring double-check consistency raises precision (~91%) at the cost of recall (~12%).

Double-Check (Swap) Technique

Run the same prompt twice with LHS and RHS swapped. Accept the judgment only when both orderings agree:

Pass 1: LHS=product A, RHS=product B → LHS
Pass 2: LHS=product B, RHS=product A → RHS (consistent — both prefer A)

If the two passes disagree, label the pair “Neither”. This detects positional bias and low-confidence judgments.

Combining with Classical ML

Individual pairwise LLM judges per attribute (name, description, category, classification) can be treated as binary features for a downstream classifier. A decision tree trained on these features outperforms any single judge:

Feature combination	Precision	Recall
Best single judge (force+double-check)	91.7%	65.2%
5-feature decision tree	96.7%	~40%
Extreme high-precision tree	100%	1.3%

See Classic ML to Cope with Dumb LLM Judges for the full experiment.

Tradeoffs

Approach	Precision	Recall	Cost
Force LHS/RHS	~75%	~100%	1×
Allow Neither	~85%	~50%	1×
Double-check swap	~91%	~12%	2×
ML ensemble of judges	~97%	~40%	N× per attribute

Comparison to Other Paradigms

Paradigm	Input	Output	Scalability	Signal strength
Pointwise	(query, doc)	absolute grade	O(n)	weakest
Pairwise	(query, doc_A, doc_B)	preference	O(n²)	strong
Listwise	(query, [doc_1…doc_k])	ranked order	O(k per query)	strongest

Pointwise Relevance Evaluation — scores items independently; simpler but weaker signal
Listwise Relevance Evaluation — ranks full list; most holistic, best for reranking evaluation
LLM as Judge — the evaluation paradigm pairwise comparison is used within
Judgment Lists — output that pairwise judgments contribute to
Search Evaluation — broader context
Decision Tree — classical ML used to ensemble pairwise signals
Precision-Recall Tradeoff — fundamental tension in the Neither/force choice

Articles

What AI Engineers Should Know about Search — Doug Turnbull; mentions pairwise as an LTR training paradigm (SVMRank)
Classic ML to Cope with Dumb LLM Judges — Doug Turnbull; per-attribute LHS/RHS LLM judges as ML features → decision tree; 96.7% precision on 40% of pairs
Search Quality Assurance with AI as a Judge — Tao Ruangyam; Zalando production pipeline using pairwise-style LLM evaluation at scale

People

Doug Turnbull — explored LHS/RHS prompt variants and ML ensembling of pairwise judges

Awesome Search KG

Explorer

Pairwise Relevance Evaluation

Pairwise Relevance Evaluation

Definition

Why Pairwise Over Point-wise?

The LHS/RHS Prompt Format

Double-Check (Swap) Technique

Combining with Classical ML

Tradeoffs

Comparison to Other Paradigms

Articles

People

Graph View

Table of Contents

Backlinks

Awesome Search KG

Explorer

Pairwise Relevance Evaluation

Pairwise Relevance Evaluation

Definition

Why Pairwise Over Point-wise?

The LHS/RHS Prompt Format

Double-Check (Swap) Technique

Combining with Classical ML

Tradeoffs

Comparison to Other Paradigms

Related Concepts

Articles

People

Graph View

Table of Contents

Backlinks