PEFT — Parameter-Efficient Fine-Tuning

Umbrella term for fine-tuning techniques that update only a small fraction of model parameters while keeping most weights frozen. The HuggingFace peft library is the standard implementation. Enables domain adaptation of large models without full fine-tuning cost.

Why PEFT

Full fine-tuning a 7B model requires storing optimizer states for 7B parameters (~84 GB for Adam). PEFT methods update 0.1–1% of parameters, making fine-tuning practical on limited hardware and enabling one base model to serve many tasks via swappable adapters.

Main Methods

MethodHow it worksParams updatedBest for
LoRALow-rank matrices injected alongside frozen weights~0.1–1%LLMs, embedding models — dominant method
QLoRALoRA + 4-bit base quantization~0.1–1%Consumer GPU fine-tuning
Prefix TuningLearnable tokens prepended to every layer’s key/value~0.1%Sequence generation
Prompt TuningLearnable tokens at input only<0.01%Large models (≥11B); less effective on small models
IA³Scale activations with learned vectors~0.01%Few-shot; minimal params
Adapter layersSmall bottleneck MLP inserted between transformer layers~1–3%Original PEFT approach; slower than LoRA

LoRA dominates in practice — best quality/efficiency trade-off across LLMs and embedding models.

In the Search Pipeline

StagePEFT techniquePurpose
Embedding modelLoRA (contrastive)Domain adaptation of bi-encoder
Query understandingQLoRAFine-tune LLM on domain queries
RerankingLoRA on cross-encoderDomain-specific relevance scoring
Answer synthesis (RAG)QLoRADomain vocabulary, format control
Judgment generationQLoRAConsistent relevance labels

Relationship to Domain Adaptation

PEFT is the primary mechanism for domain adaptation of neural search components. Before PEFT, domain adaptation required either:

  • Full fine-tuning (expensive, needs large GPU cluster)
  • Prompt engineering (limited effect on retrieval quality)
  • Training from scratch (prohibitive)

With LoRA/QLoRA, a search team with a single GPU can produce a domain-adapted embedding model or LLM reranker in hours.

  • LoRA — the dominant PEFT method
  • QLoRA — LoRA + 4-bit quantization
  • Embedding Fine-tuning — domain adaptation of embedding models; PEFT is the practical path
  • LLM — primary target for PEFT in search pipelines