Compute Mean Reciprocal Rank (MRR) using Pandas

Source: https://softwaredoug.com/blog/2021/04/21/compute-mrr-using-pandas.html
Author: Doug Turnbull

Summary

A practical tutorial by Doug Turnbull showing how to compute MRR efficiently from search evaluation data using Pandas — useful for teams building their own lightweight evaluation tooling.

The Data Model

Start with a DataFrame representing search results:

import pandas as pd
 
# Each row = one result in a ranked list
results_df = pd.DataFrame({
    'query_id': ['q1', 'q1', 'q1', 'q2', 'q2'],
    'doc_id': ['d1', 'd2', 'd3', 'd1', 'd4'],
    'rank': [1, 2, 3, 1, 2],
    'relevant': [0, 1, 0, 1, 0]  # from judgment list
})

Computing MRR

def compute_mrr(results_df, k=10):
    # Filter to within cutoff k
    top_k = results_df[results_df['rank'] <= k]
    
    # For each query, find the rank of the first relevant result
    relevant = top_k[top_k['relevant'] == 1]
    first_relevant = relevant.groupby('query_id')['rank'].min()
    
    # Reciprocal rank for each query
    rr = 1.0 / first_relevant
    
    # Queries with no relevant result in top-k get RR=0
    all_queries = results_df['query_id'].unique()
    rr = rr.reindex(all_queries, fill_value=0.0)
    
    return rr.mean()
 
mrr = compute_mrr(results_df, k=10)
print(f"MRR@10: {mrr:.4f}")

Comparing Two Systems

system_a = pd.read_csv('system_a_results.csv')
system_b = pd.read_csv('system_b_results.csv')
 
mrr_a = compute_mrr(system_a)
mrr_b = compute_mrr(system_b)
 
print(f"System A MRR: {mrr_a:.4f}")
print(f"System B MRR: {mrr_b:.4f}")
print(f"Delta: {mrr_b - mrr_a:+.4f}")

Statistical Significance

MRR differences need significance testing before claiming “improvement”:

from scipy.stats import wilcoxon
 
# Per-query reciprocal ranks for each system
rr_a = compute_per_query_rr(system_a)
rr_b = compute_per_query_rr(system_b)
 
stat, p_value = wilcoxon(rr_a, rr_b)
print(f"Wilcoxon test p-value: {p_value:.4f}")
print("Significant improvement!" if p_value < 0.05 else "Not significant")

When Pandas Approach is Better than Quepid

Custom analysis combining MRR with other signals (CTR, dwell)
Large-scale batch evaluation (>100K results)
Integration into CI/CD pipelines
Scripted comparison between many system variants

Quepid (UI-based) is better for interactive exploration and team collaboration.

What Is a Judgment List — same author, setting up the judgment data
Flavors of NDCG — same author, companion metric
Session vs Query based Search Evals — same author, broader evaluation framework

MRR — metric being computed
Search Evaluation — context
Judgment Lists — input data
NDCG — companion metric

People

Doug Turnbull — author; Quepid creator

Awesome Search KG

Explorer

Compute Mean Reciprocal Rank (MRR) using Pandas

Compute Mean Reciprocal Rank (MRR) using Pandas

Summary

The Data Model

Computing MRR

Comparing Two Systems

Statistical Significance

When Pandas Approach is Better than Quepid

People

Graph View

Table of Contents

Backlinks

Awesome Search KG

Explorer

Compute Mean Reciprocal Rank (MRR) using Pandas

Compute Mean Reciprocal Rank (MRR) using Pandas

Summary

The Data Model

Computing MRR

Comparing Two Systems

Statistical Significance

When Pandas Approach is Better than Quepid

Related Articles

Related Concepts

People

Graph View

Table of Contents

Backlinks