Fine Tune LLM on a Custom Dataset with QLoRA
End-to-end tutorial for fine-tuning a large language model (Phi-2) on a domain-specific dataset using QLoRA (Quantized LoRA). Relevant for search practitioners who need to adapt LLMs for query understanding, document generation, or RAG pipelines.
Key Concepts
LoRA (Low-Rank Adaptation): instead of fine-tuning all model weights, train two small matrices that approximate the weight update. Produces a small adapter (~MBs) that can be applied on top of the frozen base model.
QLoRA: LoRA + 4-bit quantization of the base model weights. Reduces GPU memory significantly (fits a 2.7B model on a single consumer GPU) with minimal accuracy loss.
PEFT (Parameter-Efficient Fine-Tuning): umbrella term for techniques like LoRA/QLoRA that update only a subset of parameters.
Process
- Load base model (Phi-2) in 4-bit via
BitsAndBytesConfig - Apply LoRA adapter config (
r=32, target Q/K/V projection layers) - Format dataset as instruction-response pairs
- Train with
SFTTrainer(supervised fine-tuning) - Evaluate with ROUGE metric against human baseline summaries
When to Use Fine-tuning vs RAG
- Fine-tune for: strict output formatting (JSON), domain-specific terminology, cost reduction (small specialized model instead of GPT-4)
- RAG for: frequently changing knowledge, grounded citations, factual retrieval
Search Relevance
Fine-tuning is directly relevant for: LLM-based query rewriting, intent classification, judgement label generation, and RAG synthesis layers.