Semantic IDs for Recommendation Systems

A hands-on explainer by Janu Verma (Incomplete Distillation, 4 Aug 2025) that builds Semantic IDs from the ground up — from vector quantization through RQ-VAE — and reproduces a TIGER-style generative recommendation pipeline on a real dataset.

Vault framing. This note is anchored to retrieval, not recsys: semantic IDs are the identifier scheme behind Generative Retrieval and its IR-native origin Differentiable Search Index. The article is the accessible reference; the conceptual weight lives in the Semantic IDs and RQ-VAE notes.

Source: https://januverma.substack.com/p/semantic-ids-for-recommendation-systems

What It Covers

The problem with atomic IDs — cold start, long-tail bias, sparsity, and poor cross-dataset generalization when identifiers are opaque numbers.
Vector Quantization (VQ) — mapping continuous embeddings to discrete codewords via nearest-neighbour codebook lookup; codebooks via the Linde–Buzo–Gray (k-means) algorithm.
Residual Quantization (RQ) — staged refinement quantizing successive residuals; two 256-entry stages express 256² = 65,536 vectors. See RQ-VAE.
VQ-VAE → RQ-VAE — encoder / VQ layer / decoder, the straight-through estimator for the non-differentiable lookup, and the reconstruction + codebook + commitment loss.
Application to recommendation — a seq2seq Transformer predicts the next item’s semantic ID from a user session (TIGER, arXiv:2305.05065); semantic IDs also used for ranking at YouTube (arXiv:2306.08121).

The Experiment

Dataset: Amazon Beauty (UCSD). Products converted to text (title, brand, category, price) and embedded with a T5 sentence-transformer → 768-d vectors (32,892 × 768).
Quantization: 3 RQ levels, codebook size 256 → semantic IDs of shape 32,892 × 3.
- MSE 0.000122, improving across levels; perplexity 230.84 (level 1) → 154.12 (level 3).
- Uses RQ-KMeans, a memory-efficient k-means variant.
Result: the author’s seq2seq model reaches NDCG@10 = 0.018, versus a YouTube baseline of 0.038 (RQ-VAE + user-specific tokens, 100k training steps).

Why It’s in This Vault

It bridges the existing Embeddings / Vector Quantization cluster into Generative Retrieval — a retrieval paradigm where a model generates identifiers instead of scoring an ANN index. The recsys framing is incidental; the transferable ideas are semantic IDs and RQ-VAE.

People

Janu Verma — author

Awesome Search KG

Explorer

Semantic IDs for Recommendation Systems

Semantic IDs for Recommendation Systems

What It Covers

The Experiment

Why It’s in This Vault

People

Graph View

Table of Contents

Backlinks

Awesome Search KG

Explorer

Semantic IDs for Recommendation Systems

Semantic IDs for Recommendation Systems

What It Covers

The Experiment

Why It’s in This Vault

Related Concepts

People

Graph View

Table of Contents

Backlinks