# REALM: Retrieval-Augmented Language Model Pre-Training

This seminal <mark style="color:blue;">**February 2020**</mark> paper on Retrieval-Augmented Language Model Pre-Training (REALM) has been cited nearly 1,500 times.

The authors proposed REALM, a framework that augments language model pre-training with a learned textual knowledge retriever.&#x20;

Unlike traditional language models that store knowledge implicitly in their parameters, REALM *<mark style="color:yellow;">explicitly incorporates world knowledge by retrieving relevant documents from a large corpus</mark>* (e.g., Wikipedia) during pre-training, fine-tuning, and inference.&#x20;

The retriever is trained using an unsupervised masked language modelling objective, where the model learns to retrieve documents that improve its ability to predict masked tokens.&#x20;

The authors address the computational challenge of backpropagating through a retrieval step over millions of documents by structuring the retriever to enable caching and formulating document selection as a <mark style="color:blue;">**Maximum Inner Product Search (MIPS)**</mark> problem.&#x20;

They demonstrate the effectiveness of REALM by fine-tuning the pre-trained models on the task of Open-domain Question Answering (Open-QA) and achieve state-of-the-art results on three popular benchmarks, outperforming previous methods by a significant margin.

{% embed url="<https://arxiv.org/abs/2002.08909>" %}
REALM: Retrieval-Augmented Language Model Pre-Training
{% endembed %}

### <mark style="color:purple;">Analysis</mark>

#### <mark style="color:green;">Explicit knowledge retrieval</mark>

REALM introduces an approach to incorporating world knowledge into language models by explicitly retrieving relevant documents during pre-training and inference.&#x20;

This allows for a more interpretable and modular representation of knowledge compared to implicitly storing it in the model parameters. The retrieval step exposes the role of world knowledge in the model's predictions, making it easier to understand and analyse.

#### <mark style="color:green;">Unsupervised pre-training</mark>

The authors show how to pre-train the knowledge retriever in an unsupervised manner using masked language modelling as the learning signal.&#x20;

By backpropagating through the retrieval step, the model learns to retrieve documents that improve its language modelling performance. This unsupervised pre-training approach enables the model to leverage large-scale textual corpora without the need for labeled data.

#### <mark style="color:green;">Computational efficiency</mark>

To address the computational challenge of considering millions of documents during retrieval, the authors structure the retriever to enable caching and formulate document selection as a MIPS problem.&#x20;

This allows for efficient retrieval during pre-training and inference, making the approach scalable to large knowledge corpora.

#### <mark style="color:green;">Improved Open-QA performance</mark>

REALM achieves state-of-the-art results on three popular Open-QA benchmarks, demonstrating the effectiveness of the retrieval-augmented pre-training approach.&#x20;

By outperforming previous methods that store knowledge implicitly or use heuristic retrieval mechanisms, REALM showcases the benefits of explicitly incorporating world knowledge through a learned retriever.

#### <mark style="color:green;">Interpretability and modularity</mark>

The authors highlight the qualitative benefits of REALM, including improved interpretability and modularity.&#x20;

By explicitly exposing the role of retrieved documents in the model's predictions, REALM allows for a more transparent and explainable decision-making process. Additionally, the modular nature of the retriever enables potential extensions and adaptations to different knowledge corpora or retrieval mechanisms.

Overall, REALM represented a significant advancement in language model pre-training by demonstrating the effectiveness of retrieval-augmented methods.&#x20;

The approach offered a promising direction for incorporating large-scale world knowledge into NLP models while maintaining interpretability and modularity. The impressive performance on Open-QA tasks suggests that REALM could be applied to other knowledge-intensive NLP problems, opening up exciting avenues for future research.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://training.continuumlabs.ai/knowledge/retrieval-augmented-generation/realm-retrieval-augmented-language-model-pre-training.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
