Hybrid Search: SPLADE Sparse Encoder

Source: https://medium.com/@sowmiyajaganathan/hybrid-search-splade-sparse-encoder-neural-retrieval-models-d092e5f46913
Author: Sowmiya Jaganathan

Summary

Explains how SPLADE works as a sparse encoder for hybrid search, contrasting it with BM25 and dense bi-encoders, with practical implementation guidance for combining SPLADE with dense retrieval.

Traditional hybrid search uses BM25 (sparse) + bi-encoder (dense). SPLADE offers a better sparse component:

AspectBM25SPLADEDense Bi-Encoder
Term expansionNoYes (learned)N/A
Semantic understandingNoPartialYes
StorageInverted indexInverted indexVector index
SpeedVery fastFastMedium
InterpretabilityHighMediumLow

SPLADE bridges BM25 and bi-encoders: it uses an inverted index like BM25 but learns semantic term weights like a neural model.

SPLADE in Hybrid Pipeline

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
 
def splade_encode(text, model, tokenizer, max_length=512):
    tokens = tokenizer(text, return_tensors="pt", truncation=True, max_length=max_length)
    
    with torch.no_grad():
        output = model(**tokens)
    
    # MLM logits → log(1 + ReLU(x)) → MaxPool over sequence
    logits = output.logits  # shape: [1, seq_len, vocab_size]
    activated = torch.log(1 + torch.relu(logits))
    sparse_vec = torch.max(activated, dim=1).values.squeeze()  # [vocab_size]
    
    # Convert to sparse dict of {token_id: weight}
    non_zero = sparse_vec.nonzero().squeeze(-1).tolist()
    return {tokenizer.decode([tok_id]): sparse_vec[tok_id].item() 
            for tok_id in non_zero}

Hybrid Fusion with RRF

def reciprocal_rank_fusion(splade_results, dense_results, k=60):
    fused_scores = {}
    
    for rank, (doc_id, _) in enumerate(splade_results):
        fused_scores[doc_id] = fused_scores.get(doc_id, 0) + 1/(k + rank + 1)
    
    for rank, (doc_id, _) in enumerate(dense_results):
        fused_scores[doc_id] = fused_scores.get(doc_id, 0) + 1/(k + rank + 1)
    
    return sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)

When SPLADE Hybrid Outperforms Pure Dense

SPLADE hybrid is particularly effective when:

  • Queries contain rare domain terms (product codes, technical jargon)
  • Exact phrase matching matters
  • Documents have specific named entities (people, companies, places)
  • Query vocabulary doesn’t overlap with how documents are written (vocabulary mismatch)

Comparison with Standard BM25+Dense Hybrid

SPLADE+Dense typically outperforms BM25+Dense by:

  • 5–10% NDCG improvement on MS MARCO
  • Larger gains on domain-specific corpora with specialized vocabulary

The improvement comes from SPLADE’s term expansion: the sparse component now matches semantically related terms, reducing vocabulary mismatch.

People