REALM: Retrieval-Augmented Language Model Pre-Training
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, Ming-Wei Chang
Last updated
Copyright Continuum Labs - 2023
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, Ming-Wei Chang
Last updated
This seminal February 2020 paper on Retrieval-Augmented Language Model Pre-Training (REALM) has been cited nearly 1,500 times.
The authors proposed REALM, a framework that augments language model pre-training with a learned textual knowledge retriever.
Unlike traditional language models that store knowledge implicitly in their parameters, REALM explicitly incorporates world knowledge by retrieving relevant documents from a large corpus (e.g., Wikipedia) during pre-training, fine-tuning, and inference.
The retriever is trained using an unsupervised masked language modelling objective, where the model learns to retrieve documents that improve its ability to predict masked tokens.
The authors address the computational challenge of backpropagating through a retrieval step over millions of documents by structuring the retriever to enable caching and formulating document selection as a Maximum Inner Product Search (MIPS) problem.
They demonstrate the effectiveness of REALM by fine-tuning the pre-trained models on the task of Open-domain Question Answering (Open-QA) and achieve state-of-the-art results on three popular benchmarks, outperforming previous methods by a significant margin.
REALM introduces an approach to incorporating world knowledge into language models by explicitly retrieving relevant documents during pre-training and inference.
This allows for a more interpretable and modular representation of knowledge compared to implicitly storing it in the model parameters. The retrieval step exposes the role of world knowledge in the model's predictions, making it easier to understand and analyse.
The authors show how to pre-train the knowledge retriever in an unsupervised manner using masked language modelling as the learning signal.
By backpropagating through the retrieval step, the model learns to retrieve documents that improve its language modelling performance. This unsupervised pre-training approach enables the model to leverage large-scale textual corpora without the need for labeled data.
To address the computational challenge of considering millions of documents during retrieval, the authors structure the retriever to enable caching and formulate document selection as a MIPS problem.
This allows for efficient retrieval during pre-training and inference, making the approach scalable to large knowledge corpora.
REALM achieves state-of-the-art results on three popular Open-QA benchmarks, demonstrating the effectiveness of the retrieval-augmented pre-training approach.
By outperforming previous methods that store knowledge implicitly or use heuristic retrieval mechanisms, REALM showcases the benefits of explicitly incorporating world knowledge through a learned retriever.
The authors highlight the qualitative benefits of REALM, including improved interpretability and modularity.
By explicitly exposing the role of retrieved documents in the model's predictions, REALM allows for a more transparent and explainable decision-making process. Additionally, the modular nature of the retriever enables potential extensions and adaptations to different knowledge corpora or retrieval mechanisms.
Overall, REALM represented a significant advancement in language model pre-training by demonstrating the effectiveness of retrieval-augmented methods.
The approach offered a promising direction for incorporating large-scale world knowledge into NLP models while maintaining interpretability and modularity. The impressive performance on Open-QA tasks suggests that REALM could be applied to other knowledge-intensive NLP problems, opening up exciting avenues for future research.