Bi-Encoder

Definition

A bi-encoder uses two independent encoders — one for queries, one for documents — producing separate dense vector embeddings. Relevance is computed as the similarity (cosine or dot product) between the query embedding and pre-computed document embeddings.

How It Works

Query  → Encoder_Q → q_vec ──┐
                               → similarity(q_vec, d_vec) → score
Doc    → Encoder_D → d_vec ──┘

Documents are encoded offline (pre-computed and stored). At search time, only the query is encoded and compared against the index.

Key Properties

PropertyValue
Query encodingOnline (at search time)
Document encodingOffline (pre-computed)
InteractionNone — similarity computed post-encoding
SpeedVery fast (dot product / cosine against ANN index)
QualityGood for general tasks; limited for nuanced relevance

Training

Trained with contrastive loss (e.g., MultipleNegativesRankingLoss):

  • Maximize similarity of relevant (query, document) pairs
  • Minimize similarity of irrelevant pairs
  • Positive pairs from click data, relevance judgments, or NLI datasets

When to Use

Best for:

  • Large-scale first-stage retrieval (millions/billions of docs)
  • Any task where pre-computing document embeddings is feasible
  • Real-time applications requiring low latency

Less suited for:

  • Nuanced relevance where query-document interaction matters
  • Small candidate sets where Cross-Encoder reranking is feasible
  • Cross-Encoder — processes query+document jointly; slower but more accurate
  • ColBERT — uses per-token vectors with late interaction; balance of speed and quality
  • Late Interaction — generalization of ColBERT’s token-level interaction idea

Articles

Normalization Note

Also called: dual encoder, two-tower model, Siamese network (when same encoder for both). All refer to the same architecture.