GGUF Quantization: A Technical Deep Dive

Author: Michael Hannecke (two-part series)

What is GGUF?

GGUF (GPT-Generated Unified Format): flexible key-value metadata format for local LLM deployment. Evolved from GGML (llama.cpp, 2023) to solve extensibility problems. De facto standard for local inference, supporting 40+ architectures.

Benefits: quantization-native storage, mmap loading (instant “load”), efficient SIMD alignment.

How Block Quantization Works

Quantize:    q = round((weight - zero_point) / scale)
Dequantize:  weight' = scale × q + zero_point

Weights grouped into blocks (32 or 256 values), each with its own scale. This adapts to local weight distributions.

Type-0 (scale-only): weight = scale × q — symmetric
Type-1 (scale + minimum): weight = scale × q + minimum — asymmetric

Three Quantization Families

Legacy Quants (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0)

  • One scale per 32-weight block
  • Q8_0: near-lossless (+0.01 PPL), fastest CPU path

K-Quants (Q2_K through Q6_K)

  • Super-block structure (256 weights): scales themselves are quantized (double quantization)
  • Mixed precision: _M variants apply higher precision to sensitive tensors (embeddings, attention V projections)
  • Q4_K_M: recommended default — best quality/size balance

I-Quants (IQ1_S through IQ4_NL)

  • Non-linear quantization levels (learned, not uniform)
  • Importance matrix (imatrix): weights sensitive to inference get more precision
  • Essential for aggressive quantization (<4 bits)

Quick Selection Guide

ScenarioRecommended
Plenty of VRAM/RAMQ8_0 or Q6_K (near-lossless)
Best balanceQ4_K_M (safe default)
Tight memoryQ4_K_S or IQ4_XS
Max CPU speedQ4_0 (legacy, simple dequant)
Edge/mobileIQ3_M or IQ2_M (with imatrix)

Importance Matrix

For aggressive quantization (IQ3, IQ2): generate imatrix from calibration data to identify which weights matter most.

./llama-imatrix -m model-f16.gguf -f calibration.txt --chunk 512 -o model-imatrix.dat
./llama-quantize --imatrix model-imatrix.dat model-f16.gguf model-IQ3_M.gguf IQ3_M

Calibration: diverse text preferred over narrow domain data (avoids overfitting).

Hardware Recommendations

  • Apple Silicon: Q4_K_M or Q5_K_M (Metal optimized)
  • NVIDIA GPU: choose largest quant that fits VRAM; Q4_K_M for RTX 4090 34B models
  • CPU only: Q4_0 (max speed, AVX-512) or Q4_K_M (quality)
  • Edge/mobile: IQ2_M or IQ3_S

Production Deployment (Part 2)

  • llama-server: OpenAI-compatible endpoints (/v1/chat/completions, /v1/embeddings)
  • llama-cpp-python: Python bindings with CUDA/Metal/ROCm/Vulkan backends
  • Ollama: simplest approach; can serve GGUF directly from HuggingFace via ollama run hf.co/...
  • Docker/Kubernetes deployment patterns covered