Maximal Marginal Relevance for Keyphrase Extraction

The problem

Keyphrase extractors (TextRank, RAKE, POS tagging) produce redundant results. “Good Product,” “Great Product,” “Nice Product,” “Excellent Product” all rank highly but convey the same information — wasting limited display space.

Solution 1: Cosine similarity filtering

Remove phrases with cosine similarity above threshold (e.g., 0.9). Requires manual threshold adjustment, may miss similar phrases below cutoff.

Solution 2: MMR re-ranking

MMR score = λ × Sim(phrase, document) − (1−λ) × max Sim(phrase, previously_selected_phrases)

  • λ = 0.5: optimal balance between diversity and accuracy
  • λ → 1: prioritize relevance
  • λ → 0: prioritize diversity

The algorithm selects keyphrases based on both query relevance and novelty — “the degree of dissimilarity between the document being considered and previously selected ones.”

Result

Top N keyphrases provide meaningful variety. Similar phrases are ranked far apart, eliminating the clustering problem where redundant terms dominate results.

People