Embeddings

A numerical representation of an object (text, image, product, user) as a fixed-length vector of real numbers, where geometric proximity corresponds to semantic similarity. The core primitive of modern semantic search.

"running shoes"  →  [0.21, -0.84, 0.03, 0.67, ...]   (768 dimensions)
"jogging sneakers" →  [0.22, -0.81, 0.05, 0.65, ...]  ← close in space
"tax return"       →  [-0.45, 0.12, -0.78, 0.03, ...] ← far in space

The Core Insight

Meaning can be encoded in direction and distance. Two vectors that point similarly encode similar meaning. This enables:

  • Similarity search — find items closest in vector space to a query
  • Clustering — group items by meaning
  • Algebra — king − man + woman ≈ queen (Word2Vec era)
  • Transfer — a model trained on general text produces useful representations for specialized domains

Brief History

EraModelRepresentation
2013Word2Vec, GloVeWord-level; static
2015FastTextSubword-aware
2018ELMoContextual word embeddings
2018BERTContextual; CLS token as sentence rep
2019+Sentence-BERT, E5, BGESentence/passage bi-encoders
2024+Qwen3, NV-EmbedInstruction-tuned; multi-task; MRL

Dense vs Sparse

The two main families differ in representation structure:

PropertyDense EmbeddingsSparse Embeddings
Dimensionality256–3072 (all non-zero)30k–100k (mostly zero)
SpaceSemantic latent spaceVocabulary space
StrengthsSemantic similarity, paraphraseExact match, rare terms
Index typeANN (HNSW, IVF)Inverted index
ExamplesE5, BGE, OpenAIBM25, SPLADE, ELSER

Best results in practice: Hybrid Search combining both.

How Embeddings Are Trained

Contrastive learning (most common for retrieval):

  • Positive pairs: (query, relevant document)
  • Negative pairs: (query, irrelevant document)
  • Loss: pull positives together, push negatives apart (InfoNCE / NT-Xent)

Masked language modeling (BERT pretraining):

  • Predict masked tokens — forces contextual understanding

Knowledge distillation:

Fine-tuning:

Dimensionality

DimsModelsNotes
256–384MiniLM, all-MiniLMFast; light memory
768BERT-base, E5-baseStandard; good quality
1024E5-large, BGE-largeHigher quality; slower
1536OpenAI text-embedding-3-smallAPI-served
3072OpenAI text-embedding-3-largeHighest quality API

Matryoshka Embeddings (MRL) allow truncating to smaller dims at query time — the same model supports multiple dimensionalities.

Embedding Quality vs Cost

At scale, the economics of embedding inference matter:

  • Embedding generation is compute-bound (not memory-bound)
  • RTX 4090 offers better FLOPS/$ than H100 for inference
  • ~$0.01/1M tokens achievable with commodity hardware
  • See Why Are Embeddings So Cheap

Compression via Quantization

Full-precision (float32) embeddings are memory-hungry. Vector Quantization compresses them:

  • SQ8 (int8): 4× smaller, near-lossless
  • Binary / BBQ: 32× smaller, fast Hamming distance
  • Product Quantization: 32–64× smaller

See BBQ for Elasticsearch’s implementation.

Articles

People