Collocations

Definition

Collocations are word combinations that appear together more frequently than chance and function as a semantic unit: “machine learning”, “credit card”, “New York”, “hot dog”. Treating them as single tokens rather than independent words improves both indexing and retrieval accuracy.

Without phrase awareness:

  • “New York” is tokenized as [“New”, “York”] — “New” matches unrelated documents
  • “hot dog” could match pages about warm canines
  • Query “machine learning jobs” could match documents mentioning machines and learning separately

Collocations are especially important for:

  • Named entities: “San Francisco”, “Apple Inc.”
  • Technical terms: “support vector machine”, “random forest”
  • Product names: compound product names that lose meaning when split

Detection Methods

Statistical Measures

Identify word pairs (bigrams) that co-occur significantly more than expected:

  • PMI (Pointwise Mutual Information): log P(w1,w2) / (P(w1) × P(w2)) — high PMI = strong collocation
  • Chi-squared test: statistical significance of co-occurrence
  • Log-likelihood ratio: robust to low-frequency pairs

Frequency Thresholds

Simple but effective: bigrams appearing > N times in corpus are candidates. Often combined with stop-word filtering to exclude function word pairs.

Dependency Parsing

NLP approach — extract noun phrases and named entities from a corpus using a parser. Higher quality, higher compute.

Collocations in Search Systems

Indexing

  • Detect phrases in documents at index time and add phrase tokens alongside word tokens
  • Shingling: index overlapping n-grams for approximate phrase matching

Query Processing

  • Detect phrases in user queries and treat them as units in Query Understanding
  • Phrase queries in Elasticsearch: {"match_phrase": {"text": "machine learning"}} requires words adjacent and in order

Synonym Integration

Collocations must be handled carefully in Synonyms expansion — “hot dog” → “frankfurter” is valid, but splitting and expanding “hot” and “dog” separately produces garbage.

Relation to Stopwords

Stopwords inside collocations should not be removed — “right to repair”, “to be or not to be” lose meaning without function words.

  • Query Understanding — phrase detection is a preprocessing step
  • BM25 — phrase handling in lexical retrieval
  • Synonyms — collocations need special handling in synonym expansion
  • Stopwords — stopwords inside phrases must be preserved
  • Query Segmentation — related task of segmenting queries into meaningful units
  • Keywords Extraction — related task of identifying important phrases in documents