QLoRA — Quantized Low-Rank Adaptation

LoRA fine-tuning with the frozen base model loaded in 4-bit precision. Dettmers et al., 2023 (arXiv:2305.14314). Makes it practical to fine-tune large models (65B+ params) on a single consumer GPU.

What It Adds to LoRA

ComponentDescription
NF4 quantizationNormalFloat4 — 4-bit data type optimised for weights that follow a normal distribution; theoretically optimal for normally-distributed model weights
Double quantizationQuantize the quantization constants themselves; saves ~0.37 bits/param
Paged optimizersUse NVIDIA unified memory to page optimizer states to CPU RAM during memory spikes; prevents OOM crashes

The LoRA adapters (A and B matrices) are still trained in bfloat16 — only the frozen base model weights are quantized.

Memory Impact

A 65B parameter model in float16 requires ~130 GB VRAM — impossible on a single GPU. With QLoRA:

  • 65B model in NF4 ≈ 33 GB
  • LoRA adapter (rank 16) ≈ minimal
  • Fits on a single 48GB A100 or 2× consumer 24GB GPUs

Quality: QLoRA fine-tuned 65B is competitive with ChatGPT (GPT-3.5 level) on instruction-following benchmarks.

QLoRA is the practical path for search practitioners who need custom LLM behaviour without enterprise GPU budgets:

  • Query understanding: fine-tune a 7B–13B LLM on domain-specific query→intent examples
  • Judgment generation: consistent relevance labels at scale
  • RAG synthesis: domain vocabulary and answer format
  • Reranker: fine-tune a small cross-encoder for domain-specific ranking

See Fine Tune LLM on a Custom Dataset with QLoRA — end-to-end tutorial (Phi-2 on custom dataset, SFTTrainer).

QLoRA vs. LoRA

LoRAQLoRA
Base model precisionbfloat16 / float16NF4 (4-bit)
VRAM neededLarge~4× less
Training speedFasterSlightly slower (quantization overhead)
QualityBestNear-identical to LoRA
Use whenSufficient VRAMSingle consumer GPU / cost-constrained