Compute Mean Reciprocal Rank (MRR) using Pandas

Source: https://softwaredoug.com/blog/2021/04/21/compute-mrr-using-pandas.html
Author: Doug Turnbull

Summary

A practical tutorial by Doug Turnbull showing how to compute MRR efficiently from search evaluation data using Pandas — useful for teams building their own lightweight evaluation tooling.

The Data Model

Start with a DataFrame representing search results:

import pandas as pd
 
# Each row = one result in a ranked list
results_df = pd.DataFrame({
    'query_id': ['q1', 'q1', 'q1', 'q2', 'q2'],
    'doc_id': ['d1', 'd2', 'd3', 'd1', 'd4'],
    'rank': [1, 2, 3, 1, 2],
    'relevant': [0, 1, 0, 1, 0]  # from judgment list
})

Computing MRR

def compute_mrr(results_df, k=10):
    # Filter to within cutoff k
    top_k = results_df[results_df['rank'] <= k]
    
    # For each query, find the rank of the first relevant result
    relevant = top_k[top_k['relevant'] == 1]
    first_relevant = relevant.groupby('query_id')['rank'].min()
    
    # Reciprocal rank for each query
    rr = 1.0 / first_relevant
    
    # Queries with no relevant result in top-k get RR=0
    all_queries = results_df['query_id'].unique()
    rr = rr.reindex(all_queries, fill_value=0.0)
    
    return rr.mean()
 
mrr = compute_mrr(results_df, k=10)
print(f"MRR@10: {mrr:.4f}")

Comparing Two Systems

system_a = pd.read_csv('system_a_results.csv')
system_b = pd.read_csv('system_b_results.csv')
 
mrr_a = compute_mrr(system_a)
mrr_b = compute_mrr(system_b)
 
print(f"System A MRR: {mrr_a:.4f}")
print(f"System B MRR: {mrr_b:.4f}")
print(f"Delta: {mrr_b - mrr_a:+.4f}")

Statistical Significance

MRR differences need significance testing before claiming “improvement”:

from scipy.stats import wilcoxon
 
# Per-query reciprocal ranks for each system
rr_a = compute_per_query_rr(system_a)
rr_b = compute_per_query_rr(system_b)
 
stat, p_value = wilcoxon(rr_a, rr_b)
print(f"Wilcoxon test p-value: {p_value:.4f}")
print("Significant improvement!" if p_value < 0.05 else "Not significant")

When Pandas Approach is Better than Quepid

  • Custom analysis combining MRR with other signals (CTR, dwell)
  • Large-scale batch evaluation (>100K results)
  • Integration into CI/CD pipelines
  • Scripted comparison between many system variants

Quepid (UI-based) is better for interactive exploration and team collaboration.

People