Hypothetical Document Embeddings (HyDE)
HyDE addresses embedding retriever generalization problems by using an instruction-following LLM to generate “hypothetical documents” that capture relevant patterns from a query.
When to use it
- Retrieval performance is poor (low Recall)
- Your pipeline queries against large document collections
- Your domain data differs substantially from the retriever’s training data
How it works
- LLM generates ~5 hypothetical documents for the query
- Each hypothetical document is encoded into an embedding vector
- Vectors are averaged into a single consolidated embedding
- That embedding is used for vector similarity search against real documents
The averaged embedding captures the semantic space of relevant content rather than the query string itself.
Haystack implementation
Components needed:
- OpenAI generator (configured for multiple outputs)
- Prompt builder
- Output adapter
- Document embedder
- Custom component to compute mean of hypothetical document embeddings