Differentiable Search Index

DSI (Tay et al., Google, 2022 — “Transformer Memory as a Differentiable Search Index”) is the seminal formulation of Generative Retrieval for document search. A single Transformer is trained to map a query directly to the identifier of a relevant document, generating the docid autoregressively. The corpus is encoded into the model’s parameters — the “index” is the network itself, hence differentiable.


The Core Idea

Conventional retrieval separates indexing (build an inverted or ANN index) from retrieval (look up the query). DSI collapses both into one model with two training objectives:

  • Indexing: learn document text → docid (memorize the corpus).
  • Retrieval: learn query → docid (answer information needs).

At inference, the model performs constrained beam search over the docid vocabulary to return a ranked list.

Document Identifiers

DSI’s central insight is that how you name documents matters. It explores several schemes, finding that semantically structured identifiers — built by hierarchically clustering document embeddings so related documents share ID prefixes — generalize far better than arbitrary atomic numbers. This is the direct ancestor of modern Semantic IDs and the RQ-VAE code construction used in recommendation.

Significance

DSI launched the generative-retrieval research line (NCI, GenRet, and the recsys offshoot TIGER). It reframed retrieval as a sequence-generation problem and showed that a model can serve as both store and search index — at the cost of corpus updates now meaning model updates, the paradigm’s main open problem.

Articles