Active Retrieval Augmented Generation

FLARE!

This widely cited 2023 paper introduces a new approach called Forward-Looking Active Retrieval augmented generation (FLARE) to improve the performance of language models (LMs) in long-form knowledge-intensive generation tasks.

The main idea is to actively decide when and what to retrieve from external knowledge resources throughout the generation process, rather than just retrieving information once based on the initial input.

Key points

Motivation

LMs tend to hallucinate and generate factually inaccurate output, especially in long-form generation tasks.
Single-time retrieval augmented LMs, which retrieve information only once based on the input, are not sufficient for long-form generation tasks that require continually gathering information throughout the generation process.

Active Retrieval Augmented Generation

The authors propose a generalised framework called active retrieval augmented generation.
The framework actively decides when and what to retrieve across the course of the generation, interleaving retrieval and generation steps.
The hypothesis is that LMs should retrieve information only when they lack the required knowledge to avoid unnecessary or inappropriate retrieval.

Forward-Looking Active REtrieval augmented generation (FLARE)

FLARE is a specific implementation of the active retrieval augmented generation framework.
It iteratively generates a temporary next sentence and checks whether it contains low-probability tokens, indicating a lack of knowledge.
If low-probability tokens are detected, FLARE retrieves relevant documents using the temporary next sentence as a query and regenerates the sentence conditioned on the retrieved documents.
The process continues until the end of the generation is reached.

Evaluation

FLARE is evaluated on 4 diverse tasks/datasets involving long-form generation: multihop QA, commonsense reasoning, long-form QA, and open-domain summarization.
FLARE is applied to the GPT-3.5 (text-davinci-003) model at inference time without additional training.
The results show that FLARE achieves superior or competitive performance compared to single-time and multi-time retrieval baselines across all tasks, demonstrating its effectiveness and generalisability.

In summary, FLARE addresses the limitations of single-time retrieval augmented LMs in long-form generation tasks by actively deciding when and what to retrieve throughout the generation process.

Process to create FLARE

FLARE (Forward-Looking Active REtrieval augmented generation) is a method that actively decides when and what to retrieve throughout the generation process to improve the performance of language models (LMs) in long-form knowledge-intensive generation tasks.

Start with a base language model (e.g., GPT-3.5, text-davinci-003).
Iteratively generate the next sentence without conditioning on retrieved documents.
Assess the confidence of the generated sentence by checking the probabilities of its tokens.
If any token has a probability lower than a threshold θ, trigger retrieval using the generated sentence as a query. Otherwise, accept the generated sentence without retrieval.
To formulate the query, either mask out low-confidence tokens (implicit query) or generate questions targeting the low-confidence spans (explicit query).
Retrieve relevant documents using the formulated query and an off-the-shelf retriever (e.g., BM25 for Wikipedia, Bing for open web).
Prepend the retrieved documents to the user input and regenerate the next sentence.
Repeat steps 2-7 until the end of the generation is reached.

To emulate FLARE

Choose a base language model that provides access to token probabilities (e.g., GPT-3.5 API).
Implement a sentence tokenizer to iteratively generate sentences.
Set a confidence threshold θ to trigger retrieval based on token probabilities.
Implement query formulation methods:
- Implicit query: mask out tokens with probabilities below a threshold β.
- Explicit query: extract low-confidence spans and generate questions using a separate LM (e.g., gpt-3.5-turbo) with a predefined prompt.
Integrate an off-the-shelf retriever (e.g., BM25 for Wikipedia, Bing for open web) to retrieve documents based on the formulated queries.
Format the retrieved documents and prepend them to the user input for regeneration.
Iterate the process until the desired output length is reached.

Three practical applications of FLARE

Automated Content Creation

Develop an AI-powered content creation tool that generates long-form articles, blog posts, or product descriptions using FLARE.
FLARE actively retrieves relevant information from a knowledge base or the open web to enhance the factual accuracy and richness of the generated content.
The tool can assist content creators, marketers, and writers in producing high-quality, informative content more efficiently.

Intelligent Tutoring System

Create an AI-powered tutoring system that generates detailed explanations and answers to student questions using FLARE.
FLARE enables the system to actively retrieve relevant information from educational resources, such as textbooks, research papers, or online courses, to provide comprehensive and accurate responses.
The tutoring system can adapt to students' needs, offering personalized guidance and support in various subjects.

Virtual Customer Support Agent

Develop an AI-powered customer support agent that handles complex customer inquiries using FLARE.
FLARE allows the agent to actively retrieve information from a company's knowledge base, product manuals, or online resources to provide accurate and detailed responses to customer questions.
The virtual agent can assist customers 24/7, reducing response times and improving customer satisfaction while alleviating the workload of human support staff.

These applications demonstrate how FLARE can be used to enhance the quality and efficiency of AI-powered systems in content generation, education, and customer support.

By actively retrieving relevant information throughout the generation process, FLARE enables these systems to produce more accurate, informative, and context-aware outputs.

PreviousGenerate Rather Than Retrieve: Large Language Models Are Strong Context Generators NextDSPy: LM Assertions: Enhancing Language Model Pipelines with Computational Constraints

Last updated 1 year ago

Was this helpful?