Questions Are All You Need to Train a Dense Passage Retriever
Last updated
Copyright Continuum Labs - 2023
Last updated
This April 2023 paper introduces a novel approach called ART (Autoencoding-based Retriever Training) for training dense passage retrieval models without the need for labelled training data.
Dense passage retrieval is a component in open-domain tasks like open-domain question answering (OpenQA).
Current state-of-the-art methods require large supervised datasets with custom hard-negative mining and denoising of positive examples.
ART (Autoencoding-based Retriever Training) aims to address this limitation by training retrieval models using only unpaired inputs and outputs (questions and potential answer passages).
ART (Autoencoding-based Retriever Training) uses a new passage-retrieval autoencoding scheme:
Given an input question, ART retrieves a set of evidence passages.
The retrieved passages are then used to compute the probability of reconstructing the original question.
Training the retriever based on question reconstruction enables effective unsupervised learning of both passage and question encoders.
ART leverages a generative pre-trained language model (PLM) to compute the question reconstruction likelihood conditioned on the retrieved passages.
The PLM is used in a zero-shot manner, without the need for finetuning, allowing ART to use large PLMs and obtain accurate soft-label estimates of passage relevance.
The retriever is trained to minimize the divergence between the passage likelihood computed by the retriever and the soft-label scores obtained from the PLM.
Dual Encoder Retriever
ART (Autoencoding-based Retriever Training) uses a dual-encoder model consisting of separate question and passage encoders.
The encoders map questions and passages to a shared latent embedding space.
The retrieval score for a question-passage pair is computed using the dot product of their embeddings.
Zero-Shot Cross-Attention Scorer
ART (Autoencoding-based Retriever Training) uses a large generative pre-trained language model (PLM) to compute the relevance score for a question-passage pair.
The PLM is prompted with the passage as input and generates the question tokens using teacher-forcing.
The relevance score is approximated by the autoregressive generation of question tokens conditioned on the passage.
Training Algorithm
ART (Autoencoding-based Retriever Training) retrieves the top-K passages for each question using the current retriever parameters.
The retriever likelihood is computed using the top-K passages.
The pre-trained language model (PLM) is used to compute the relevance scores for the retrieved passages.
The retriever is trained by minimising the KL divergence between the retriever likelihood and the PLM relevance scores.
The evidence passage embeddings are periodically updated to prevent staleness.
ART (Autoencoding-based Retriever Training) achieves state-of-the-art results on multiple QA retrieval benchmarks, outperforming previous unsupervised models and matching the performance of strong supervised models.
ART demonstrates strong generalisation performance on out-of-distribution datasets, even when trained on a mix of answerable and unanswerable questions.
Scaling up the retriever model size further improves the performance of ART.
ART (Autoencoding-based Retriever Training) is highly sample-efficient, outperforming BM25 and DPR with just 100 and 1,000 questions, respectively.
The training process is not sensitive to the initial retriever parameters.
The choice of the number of retrieved passages during training affects the performance, with 32 passages providing a reasonable middle ground.
The instruction-tuned T0 PLM provides the most accurate relevance scores compared to other pre-trained language model (PLM) training strategies.
In summary, ART introduces a novel unsupervised approach for training dense passage retrieval models using only questions.
By leveraging a generative PLM for question reconstruction and optimising the retriever to match the PLM's relevance scores, ART achieves state-of-the-art performance without the need for labeled training data.
The ART (Autoencoding-based Retriever Training) approach introduced in this paper has significant potential for commercial applications, particularly in domains where large amounts of labeled training data are scarce or expensive to obtain.
Many companies have large repositories of internal documents, knowledge bases, and customer support information.
Building an effective search engine to retrieve relevant information can be challenging without labeled training data. ART could be used to train a dense passage retriever using only the questions or queries employees typically ask, making it easier to find relevant information quickly and accurately.
Chatbots and virtual assistants often rely on retrieving relevant information from a knowledge base to answer user queries.
Using ART, companies could train a retrieval model using a large set of user questions without needing to manually annotate the relevance of each passage. This would enable more accurate and efficient question-answering capabilities, improving the user experience and reducing the workload on human support agents.
Online retailers often struggle with providing accurate and relevant search results for product queries, especially when dealing with a large and diverse product catalogue.
By leveraging ART, e-commerce companies could train a dense passage retriever using customer questions and product descriptions, enabling more effective product search and recommendation.
In domains like law and medicine, retrieving relevant information from large collections of documents (e.g., legal cases, medical research papers) is crucial.
ART could be used to train retrieval models using questions posed by legal professionals or medical practitioners, making it easier to find relevant precedents, case studies, or research findings.
News aggregators and media platforms could use ART to train retrieval models using user queries and article headlines or summaries.
This would enable personalized news and media recommendations based on the user's interests and previous search history, improving engagement and user satisfaction.