# Questions Are All You Need to Train a Dense Passage Retriever

This <mark style="color:blue;">**April 2023**</mark> paper introduces a novel approach called ART (Autoencoding-based Retriever Training) for <mark style="color:yellow;">training dense passage retrieval models</mark> without the need for labelled training data.&#x20;

{% embed url="<https://arxiv.org/abs/2206.10658>" %}
Questions Are All You Need to Train a Dense Passage Retriever
{% endembed %}

### <mark style="color:purple;">Motivation</mark>

* Dense passage retrieval is a component in open-domain tasks like open-domain question answering (OpenQA).
* Current state-of-the-art methods require large supervised datasets with custom hard-negative mining and denoising of positive examples.
* ART (Autoencoding-based Retriever Training) aims to address this limitation by training retrieval models <mark style="color:yellow;">using only unpaired inputs and outputs</mark> (questions and potential answer passages).

### <mark style="color:purple;">Methodology</mark>

* ART (Autoencoding-based Retriever Training) uses a new passage-retrieval autoencoding scheme:
  * Given an input question, ART retrieves a set of evidence passages.
  * The retrieved passages are then used to compute the probability of reconstructing the original question.
* Training the retriever based on question reconstruction enables effective unsupervised learning of both passage and question encoders.
* ART leverages a generative pre-trained language model (PLM) to compute the question reconstruction likelihood conditioned on the retrieved passages.
* The PLM is used in a zero-shot manner, without the need for finetuning, allowing ART to use large PLMs and obtain accurate soft-label estimates of passage relevance.
* The retriever is trained to minimize the divergence between the passage likelihood computed by the retriever and the soft-label scores obtained from the PLM.

### <mark style="color:purple;">Technical Details</mark>

<mark style="color:green;">**Dual Encoder Retriever**</mark>

* ART (Autoencoding-based Retriever Training) uses a dual-encoder model consisting of separate question and passage encoders.
* The encoders map questions and passages to a shared latent embedding space.
* The retrieval score for a question-passage pair is computed using the dot product of their embeddings.

<mark style="color:green;">**Zero-Shot Cross-Attention Scorer**</mark>

* ART (Autoencoding-based Retriever Training) uses a large generative pre-trained language model (PLM) to compute the relevance score for a question-passage pair.
* The PLM is prompted with the passage as input and generates the question tokens using teacher-forcing.
* The relevance score is approximated by the autoregressive generation of question tokens conditioned on the passage.

<mark style="color:green;">**Training Algorithm**</mark>

* ART (Autoencoding-based Retriever Training) retrieves the top-K passages for each question using the current retriever parameters.
* The retriever likelihood is computed using the top-K passages.
* The  pre-trained language model (PLM) is used to compute the relevance scores for the retrieved passages.
* The retriever is trained by minimising the KL divergence between the retriever likelihood and the PLM relevance scores.
* The evidence passage embeddings are periodically updated to prevent staleness.

### <mark style="color:purple;">Experimental Results</mark>

* ART (Autoencoding-based Retriever Training) achieves state-of-the-art results on multiple QA retrieval benchmarks, outperforming previous unsupervised models and matching the performance of strong supervised models.
* ART demonstrates strong generalisation performance on out-of-distribution datasets, even when trained on a mix of answerable and unanswerable questions.
* Scaling up the retriever model size further improves the performance of ART.

### <mark style="color:purple;">Analysis</mark>

* ART (Autoencoding-based Retriever Training) is highly sample-efficient, outperforming BM25 and DPR with just 100 and 1,000 questions, respectively.
* The training process is not sensitive to the initial retriever parameters.
* The choice of the number of retrieved passages during training affects the performance, with 32 passages providing a reasonable middle ground.
* The instruction-tuned T0 PLM provides the most accurate relevance scores compared to other  pre-trained language model (PLM) training strategies.

In summary, ART introduces a novel unsupervised approach for training dense passage retrieval models using only questions.&#x20;

By leveraging a generative PLM for question reconstruction and optimising the retriever to match the PLM's relevance scores, ART achieves state-of-the-art performance without the need for labeled training data.&#x20;

### <mark style="color:purple;">Commercial Application</mark>

The ART (Autoencoding-based Retriever Training) approach introduced in this paper has significant potential for commercial applications, particularly in domains where large amounts of labeled training data are scarce or expensive to obtain.&#x20;

#### <mark style="color:green;">Enterprise Search Engines</mark>

Many companies have large repositories of internal documents, knowledge bases, and customer support information.&#x20;

Building an effective search engine to retrieve relevant information can be challenging without labeled training data. ART could be used to train a dense passage retriever using only the questions or queries employees typically ask, making it easier to find relevant information quickly and accurately.

#### <mark style="color:green;">Chatbots and Virtual Assistants</mark>

Chatbots and virtual assistants often rely on retrieving relevant information from a knowledge base to answer user queries.&#x20;

Using ART, companies could train a retrieval model using a large set of user questions without needing to manually annotate the relevance of each passage. This would enable more accurate and efficient question-answering capabilities, improving the user experience and reducing the workload on human support agents.

#### <mark style="color:green;">E-commerce Product Search</mark>

Online retailers often struggle with providing accurate and relevant search results for product queries, especially when dealing with a large and diverse product catalogue.&#x20;

By leveraging ART, e-commerce companies could train a dense passage retriever using customer questions and product descriptions, enabling more effective product search and recommendation.

#### <mark style="color:green;">Legal and Medical Information Retrieval</mark>

In domains like law and medicine, retrieving relevant information from large collections of documents (e.g., legal cases, medical research papers) is crucial.&#x20;

ART could be used to train retrieval models using questions posed by legal professionals or medical practitioners, making it easier to find relevant precedents, case studies, or research findings.

#### <mark style="color:green;">News and Media Recommendation</mark>

News aggregators and media platforms could use ART to train retrieval models using user queries and article headlines or summaries.&#x20;

This would enable personalized news and media recommendations based on the user's interests and previous search history, improving engagement and user satisfaction.
