Announcing the Vespa ColBERT embedder

ColBERT maintains per-token vector representations. Each query token interacts with all document tokens via contextualized embeddings (MaxSim). Unlike single-vector models, it approaches cross-encoder performance in zero-shot settings with better training efficiency and explainability.

Storage Optimization

A novel asymmetric binarization compression reduces storage by 32x. Query vectors keep full precision; document token vectors use binary representation. At int8 precision, 128-dimensional floats compress to 16 bytes.

Ranking Performance

MaxSim requires ~2×M×N×K floating-point ops. Re-ranking 100 documents takes ~23ms on a single CPU thread.

BEIR Benchmark Results (compressed vs uncompressed)

DatasetCompressed nDCG@10Uncompressed nDCG@10
trec-covid0.80030.7939
nfcorpus0.33230.3434
fiqa0.38850.3919

Long Context

ColBERT handles extended documents via text chunking — supports array inputs mapped to multiple tensor dimensions for per-paragraph representations.

Usage

Configured in services.xml, integrates with Vespa’s ranking framework alongside BM25, cross-encoders, and other signals. Supports hybrid search and structured field embeddings.

People