ColPali

Definition

ColPali is a visual document retrieval model that applies Late Interaction (ColBERT-style MaxSim scoring) to document page images. Instead of extracting text and embedding it, ColPali encodes the entire page image into ~1000 token-level vectors using a vision-language model (PaliGemma backbone). At query time, the MaxSim operator scores each document by comparing all query vectors against all document vectors.

Paper: arxiv:2407.01449

How It Works

Page image  → PaliGemma encoder → [v₁, v₂, ..., v₁₀₀₀]  (per-patch vectors)
Text query  → encoder          → [q₁, q₂, ..., qₙ]       (per-token vectors)

Score = Σᵢ max_j (qᵢ · vⱼ)    (MaxSim / maxSimDotProduct)

Unlike ColBERT (text-only), ColPali operates on visual patches — useful for PDFs, slides, and scanned documents where layout and visual content matter.

Scaling Challenges

ColPali’s ~1000 vectors per page create two bottlenecks at scale:

Disk space — storing thousands of float32 vectors per document is expensive
Query computation — maxSimDotProduct over 1000 vectors per doc is slow vs. bi-encoders

Optimization Techniques

Bit Vectors

Compress document vectors to 1-bit per dimension (sign quantization). Uses maxSimInvHamming (hamming distance inversion) for scoring. Enables fast bitwise SIMD operations at the cost of slight accuracy loss. Asymmetric variant: keep query vectors at full precision, quantize only document vectors — preserves score quality with full storage savings.

Elasticsearch field: "element_type": "bit" in rank_vectors.

Average Vectors

Compress all page vectors into a single average vector (normalized). Index in HNSW (dense_vector field) for fast candidate retrieval. Combine with BBQ to reduce RAM usage. Accuracy drops compared to full MaxSim, recoverable via two-stage retrieval.

Token Pooling

See Token Pooling. Groups semantically similar patch vectors via clustering, replacing each cluster with its mean. Pool factor 3 retains 97.8% of performance while reducing vector count by 66.7%.

Two-Stage Retrieval Pattern

Retrieve — use average vector + HNSW kNN over millions of documents
Rerank — apply full maxSimDotProduct on top-k candidates

This makes ColPali viable for large-scale production workloads in Elasticsearch using the rescorer retriever (introduced in 8.18).

ColPali vs. ColBERT

	ColPali	ColBERT
Input modality	Images (document pages)	Text
Vectors per doc	~1000 (visual patches)	~100 (tokens)
Use case	Visual document search (PDFs, slides)	Text retrieval
Storage pressure	Higher	Lower

Late Interaction — architecture ColPali is built on
ColBERT — text-domain sibling
Token Pooling — compression for ColPali’s multi-vectors
BBQ — quantization to reduce RAM for HNSW index of average vectors
HNSW — used for average-vector first-stage retrieval
Multimodal Embeddings — broader context
Reranking — ColPali used primarily as a reranker

Articles

Late Interaction Models - How to Scale and Optimize in Elasticsearch

Awesome Search KG

Explorer

ColPali

ColPali

Definition

How It Works

Scaling Challenges

Optimization Techniques

Bit Vectors

Average Vectors

Token Pooling

Two-Stage Retrieval Pattern

ColPali vs. ColBERT

Articles

Graph View

Table of Contents

Backlinks

Awesome Search KG

Explorer

ColPali

ColPali

Definition

How It Works

Scaling Challenges

Optimization Techniques

Bit Vectors

Average Vectors

Token Pooling

Two-Stage Retrieval Pattern

ColPali vs. ColBERT

Related Concepts

Articles

Graph View

Table of Contents

Backlinks