GGUF Quantization: A Technical Deep Dive
Author: Michael Hannecke (two-part series)
What is GGUF?
GGUF (GPT-Generated Unified Format): flexible key-value metadata format for local LLM deployment. Evolved from GGML (llama.cpp, 2023) to solve extensibility problems. De facto standard for local inference, supporting 40+ architectures.
Benefits: quantization-native storage, mmap loading (instant “load”), efficient SIMD alignment.
How Block Quantization Works
Quantize: q = round((weight - zero_point) / scale)
Dequantize: weight' = scale × q + zero_point
Weights grouped into blocks (32 or 256 values), each with its own scale. This adapts to local weight distributions.
Type-0 (scale-only): weight = scale × q — symmetric
Type-1 (scale + minimum): weight = scale × q + minimum — asymmetric
Three Quantization Families
Legacy Quants (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0)
- One scale per 32-weight block
- Q8_0: near-lossless (+0.01 PPL), fastest CPU path
K-Quants (Q2_K through Q6_K)
- Super-block structure (256 weights): scales themselves are quantized (double quantization)
- Mixed precision:
_Mvariants apply higher precision to sensitive tensors (embeddings, attention V projections) - Q4_K_M: recommended default — best quality/size balance
I-Quants (IQ1_S through IQ4_NL)
- Non-linear quantization levels (learned, not uniform)
- Importance matrix (imatrix): weights sensitive to inference get more precision
- Essential for aggressive quantization (<4 bits)
Quick Selection Guide
| Scenario | Recommended |
|---|---|
| Plenty of VRAM/RAM | Q8_0 or Q6_K (near-lossless) |
| Best balance | Q4_K_M (safe default) |
| Tight memory | Q4_K_S or IQ4_XS |
| Max CPU speed | Q4_0 (legacy, simple dequant) |
| Edge/mobile | IQ3_M or IQ2_M (with imatrix) |
Importance Matrix
For aggressive quantization (IQ3, IQ2): generate imatrix from calibration data to identify which weights matter most.
./llama-imatrix -m model-f16.gguf -f calibration.txt --chunk 512 -o model-imatrix.dat
./llama-quantize --imatrix model-imatrix.dat model-f16.gguf model-IQ3_M.gguf IQ3_MCalibration: diverse text preferred over narrow domain data (avoids overfitting).
Hardware Recommendations
- Apple Silicon: Q4_K_M or Q5_K_M (Metal optimized)
- NVIDIA GPU: choose largest quant that fits VRAM; Q4_K_M for RTX 4090 34B models
- CPU only: Q4_0 (max speed, AVX-512) or Q4_K_M (quality)
- Edge/mobile: IQ2_M or IQ3_S
Production Deployment (Part 2)
llama-server: OpenAI-compatible endpoints (/v1/chat/completions,/v1/embeddings)llama-cpp-python: Python bindings with CUDA/Metal/ROCm/Vulkan backends- Ollama: simplest approach; can serve GGUF directly from HuggingFace via
ollama run hf.co/... - Docker/Kubernetes deployment patterns covered