Precision and Recall

Definition

Precision: Of the documents retrieved, what fraction are relevant?
Recall: Of all relevant documents, what fraction were retrieved?

Precision = |Retrieved ∩ Relevant| / |Retrieved|
Recall    = |Retrieved ∩ Relevant| / |Relevant|

The Precision-Recall Tradeoff

Precision and recall are in tension:

  • Higher recall: retrieve more documents → more relevant ones found, but also more irrelevant ones → lower precision
  • Higher precision: be selective → top results are relevant, but some relevant docs missed → lower recall

This is the fundamental tradeoff in information retrieval. There is no “best” point — the optimal depends on the use case.

Precision@k (P@k)

Instead of measuring over the full ranked list, measure precision at a cutoff:

P@k = (relevant documents in top k) / k

Examples:

  • P@1: is the first result relevant? (Click-satisfaction proxy)
  • P@5: are most of the first page relevant?
  • P@10: standard IR evaluation cutoff

Unlike NDCG, P@k ignores document ranking within the cutoff — all positions within k are equally weighted.

Recall@k

Fraction of relevant documents found in top k:

Recall@k = (relevant documents in top k) / |all relevant documents|

High Recall@1000 is the target for first-stage retrieval in Retrieval Pipeline:

  • First stage needs to recall nearly all relevant docs
  • Second stage (Cross-Encoder reranker) handles ranking quality
  • Missing a relevant doc at stage 1 = permanent recall loss

F1 Score

Harmonic mean of precision and recall:

F1 = 2 × (Precision × Recall) / (Precision + Recall)

F1 treats precision and recall equally. Use when both matter and you want a single number.

When to Use Each Metric

MetricBest for
P@1First result quality (navigational queries)
P@5Above-the-fold quality
P@10Standard page quality
Recall@1000First-stage retrieval coverage
NDCG@10Overall ranking quality with graded relevance
MRRKnown-item / QA systems
F1Classification tasks, information extraction

Precision vs. NDCG

P@k treats all positions equally within k — a relevant document at rank 1 = rank k.
NDCG discounts lower ranks — a relevant document at rank 1 is worth more than at rank k.

NDCG is generally preferred for ranked retrieval evaluation.

People