Pairwise Relevance Evaluation

Definition

A relevance judgment method where an evaluator (human or LLM) is shown two candidates side-by-side — conventionally called LHS (left-hand side) and RHS (right-hand side) — and asked which is more relevant to a given query. Produces relative preference labels rather than absolute scores.

Why Pairwise Over Point-wise?

Point-wise scoring (rate this document 0–3) is hard to calibrate consistently — annotators disagree on what “2 out of 3” means. Pairwise comparison is cognitively simpler: “which of these two is better?” Humans and LLMs both agree more reliably on relative preferences than on absolute grades.

The LHS/RHS Prompt Format

A canonical pairwise LLM judge prompt:

Which of these products is more relevant to the search query?

Query: entrance table

Product LHS: aleah coffee table
Product RHS: marta coffee table

Only respond 'LHS' or 'RHS' if you are confident.
Otherwise respond 'Neither'.

Forcing a binary choice (LHS or RHS, no Neither) maximizes recall but lowers precision (~75%). Allowing “Neither” and requiring double-check consistency raises precision (~91%) at the cost of recall (~12%).

Double-Check (Swap) Technique

Run the same prompt twice with LHS and RHS swapped. Accept the judgment only when both orderings agree:

  • Pass 1: LHS=product A, RHS=product B → LHS
  • Pass 2: LHS=product B, RHS=product A → RHS (consistent — both prefer A)

If the two passes disagree, label the pair “Neither”. This detects positional bias and low-confidence judgments.

Combining with Classical ML

Individual pairwise LLM judges per attribute (name, description, category, classification) can be treated as binary features for a downstream classifier. A decision tree trained on these features outperforms any single judge:

Feature combinationPrecisionRecall
Best single judge (force+double-check)91.7%65.2%
5-feature decision tree96.7%~40%
Extreme high-precision tree100%1.3%

See Classic ML to Cope with Dumb LLM Judges for the full experiment.

Tradeoffs

ApproachPrecisionRecallCost
Force LHS/RHS~75%~100%
Allow Neither~85%~50%
Double-check swap~91%~12%
ML ensemble of judges~97%~40%N× per attribute

Comparison to Other Paradigms

ParadigmInputOutputScalabilitySignal strength
Pointwise(query, doc)absolute gradeO(n)weakest
Pairwise(query, doc_A, doc_B)preferenceO(n²)strong
Listwise(query, [doc_1…doc_k])ranked orderO(k per query)strongest

Articles

People

  • Doug Turnbull — explored LHS/RHS prompt variants and ML ensembling of pairwise judges