Fine Tuning an Embedding Model for Semantic Search

A practical hands-on introduction to fine-tuning embedding models for semantic search using the Sentence Transformers library. Demonstrates fine-tuning all-MiniLM-L12-v2 on the Databricks Dolly-15k dataset using Multiple Negatives Ranking Loss, with measurable improvement on QA retrieval.


Key Concepts

Sentence Transformers

Python framework and library for accessing and training vector embedding models. Widely used for semantic similarity and RAG pipelines. Supports text, images, and other data types. Access to Hugging Face-hosted models.

Loss Functions for Embedding Training

LossWhen to use
Multiple Negatives Ranking (MNR) LossPositive pairs of related text (query, answer)
Triplet LossAnchor + positive + negative triplets
Contrastive LossPositive and negative pairs

Catastrophic Forgetting

A real risk when training with semantically similar negatives: the model can lose all ability to differentiate between texts, clustering all embeddings together. Mitigation: use clearly distinct negatives; shorter text segments tend to train better.

Practical Experiment

  • Base model: sentence-transformers/all-MiniLM-L12-v2
  • Dataset: databricks/databricks-dolly-15k — closed QA subset (~1,700 question+context pairs); 70/30 train/test split
  • Loss: MultipleNegativesRankingLoss
  • Training: 10 epochs, free Google Colab T4 GPU, ~1,200 training examples

Results

TestBeforeAfter
”What color is the Sky?” → “The sky is blue” cosine sim0.6080.692
”What color is the sea?” (wrong answer) cosine sim0.7040.634
Average cosine sim (test set, correct contexts)0.6200.694

The fine-tuned model substantially improved question→correct-answer similarity while reducing similarity to incorrect answers.

Alternatives Mentioned

  • LlamaIndex: its own embedding fine-tuning functions
  • FlagEmbeddings / BGE: FlagEmbedding library with fine-tuning support
  • API access: OpenAI ada, other hosted embedding APIs

People