Multimodal Embeddings

Definition

Multimodal embeddings project multiple data modalities (text, images, audio, video) into a shared vector space, enabling cross-modal retrieval: search for images using text queries, or find text documents using image queries.

CLIP Architecture

The dominant approach is CLIP (Contrastive Language-Image Pre-training, OpenAI):

Image → Vision Encoder (ViT) → image embedding ─┐
                                                  ├→ cosine similarity
Text  → Text Encoder (Transformer) → text emb  ──┘

Trained on 400M (image, caption) pairs from the web using contrastive loss — push matching pairs together, pull non-matching apart.

Domain-Specific Fine-tuning

General CLIP models may fail on specialized visual domains:

  • Medical imaging (CT scans, pathology slides)
  • Satellite imagery
  • Fashion/product images with fine-grained attributes
  • Technical diagrams

Shaw Talebi’s work demonstrates fine-tuning CLIP for domain-specific multimodal search:

  1. Collect domain (image, text description) pairs
  2. Fine-tune with contrastive loss
  3. Evaluate on domain-specific retrieval benchmarks

Fine-tuning with sentence-transformers

from sentence_transformers import SentenceTransformer
from sentence_transformers.losses import MultipleNegativesRankingLoss
 
model = SentenceTransformer("clip-ViT-B-32")
train_examples = [InputExample(texts=["a dog", image_path])]
train_loss = MultipleNegativesRankingLoss(model)
model.fit(train_objectives=[(train_dataloader, train_loss)])

Applications

Use CaseQueryRetrieved
E-commerce visual searchImage of shoeSimilar shoes
Medical report retrievalText symptomsRelated X-rays
Stock photo searchText descriptionMatching images
Document understandingImage of chartText explanations
Fashion similarityOutfit photoSimilar products

Matryoshka + Multimodal

Matryoshka Embeddings (MRL) can be applied to multimodal models, enabling:

  • Fast first-pass search with truncated (smaller) embeddings
  • Re-ranking with full-dimension embeddings
  • Storage savings for large image+text databases

Relation to Text-Only Embeddings

Shares training principles with text Bi-Encoder:

Differs in:

  • Heterogeneous encoders (vision vs. language)
  • Cross-modal alignment challenge
  • Domain gap often more severe

People

  • Shaw Talebi — CLIP fine-tuning for domain-specific multimodal search