HYDE: Revolutionising Search with Hypothetical Document Embeddings
Last updated
Copyright Continuum Labs - 2023
Last updated
This December 2022 paper titled "HYDE: Revolutionising Search with Hypothetical Document Embeddings" introduced an approach that seeks to address the limitations of traditional keyword-based search by leveraging Large Language Models (LLMs) to create a new framework for searching and retrieving information from vast document repositories.
Zero-shot dense retrieval
This refers to the task of retrieving relevant documents from a large corpus without having any labelled data to train the model specifically for understanding what makes a document relevant to a query.
This scenario is challenging because traditional retrieval systems rely heavily on relevance labels to learn how to rank documents.
Difficulty of encoding relevance
Without relevance labels, it's challenging for models to learn what content is relevant to a given query, making zero-shot learning particularly difficult in dense retrieval contexts.
Hypothetical Document Embeddings (HyDE)
The core idea behind HyDE is to generate a "hypothetical" document that captures the relevance patterns for a given query, even though this document is not real and may contain false details.
This is achieved in two main steps:
Generating a hypothetical document: Using an instruction-following language model HyDE takes a query and instructs the model to generate a document that answers the query. This generated document captures the essence of what a relevant document should contain, despite not being a real document and potentially containing inaccuracies.
Encoding and retrieving: The hypothetical document is then encoded into an embedding vector using an unsupervised contrastively learned encoder (e.g., Contriever). This embedding is used to identify similar real documents in the corpus based on vector similarity. The encoder's role is to filter out incorrect details and focus on the core relevance patterns encoded in the hypothetical document's embedding.
Retrieval-Augmented Generation (RAG) combines the power of a retriever and a generator to enhance language models' capability to produce informed and accurate outputs based on retrieved documents.
RAG retrieves relevant documents to a query and then uses these documents to generate a response, effectively augmenting the generative model's knowledge.
HyDE, while not explicitly augmenting generation with retrieval in the same way RAG does, similarly leverages retrieval as a critical component of its process.
However, HyDE's innovation lies in creating a bridge between the query and the corpus through a generated document that captures the essence of relevance, rather than directly using retrieved documents to inform generation.
The essence of the approach is to create an "enhanced query" through the generation of a hypothetical document that captures the intent and relevance patterns of the original query.
This process effectively transforms the initial query into a more detailed, context-rich representation that mirrors the content one would expect to find in relevant documents.
While this paper focuses on using a contrastive encoder as an embedding model, this technique has been superseded by using modern transformer based embedding models.
Simple, low parameter models like BERT and its variants can produce contextual embeddings directly from text. When used as encoders, they can generate rich embeddings that capture the document's context and relevance.