Page cover image

Retrieval-Augmented Generation for Large Language Models: A Survey

This December 2023 paper investigates the concept of Retrieval-Augmented Generation (RAG).

It addresses the critical challenges that LLMs face, such as the generation of incorrect information, reliance on outdated knowledge, and opaque reasoning processes.

The paper highlights the potential of RAG to enhance the accuracy and credibility of LLMs, particularly in knowledge-intensive tasks.

RAG allows for continuous knowledge updates and the integration of domain-specific information, effectively merging the intrinsic knowledge of LLMs with the vast, dynamic repositories of external databases.

The authors present a detailed examination of the progression of RAG paradigms, categorising them into three main stages:

  1. Naïve RAG

  2. Advanced RAG

  3. Modular RAG.

They also scrutinise the tripartite foundation of RAG frameworks, which consists of retrieval, generation, and augmentation techniques.

The Evolution and Mechanisms of RAG

Retrieval Augmented Generation (RAG) is a process where generative AI models aim to augment their generative tasks with external data retrieval.

This process begins with the model querying an external source to obtain relevant information before generating an output.

This methodology not only informs the subsequent generation phase but also grounds responses in evidence, markedly improving output accuracy and relevance.

The dynamic nature of RAG allows for continuous updates from knowledge bases, addressing the issue of hallucinations and making large language models more applicable for real-world use.

A representative instance of the RAG process applied to question answering. It mainly consists of 3 steps. 1) Indexing. Documents are split into chunks, encoded into vectors, and stored in a vector database. 2) Retrieval. Retrieve the Top k chunks most relevant to the question based on semantic similarity. 3) Generation. Input the original question and the retrieved chunks together into LLM to generate the final answer.

Core Components of RAG

The RAG framework is built on a tripartite foundation: retrieval, generation, and augmentation techniques.

Each component plays a critical role in the functionality of RAG systems:

  • Retrieval: The initial step involves querying external databases to fetch relevant information, which forms the basis for the generation process.

  • Generation: Leveraging the retrieved data, the LLM generates responses that are not only accurate but also relevant to the query.

  • Augmentation: This phase enhances the generative process by integrating domain-specific information, and allowing for continuous knowledge updates.

Retrieval-Augmented Generation (RAG) paradigms

Naïve RAG

Naïve RAG is the earliest methodology that gained prominence shortly after the widespread adoption of ChatGPT.

It follows a traditional process that includes indexing, retrieval, and generation, also known as the "Retrieve-Read" framework.

Indexing: Raw data in various formats (PDF, HTML, Word, Markdown) is cleaned, extracted, and converted into plain text. The text is then segmented into smaller chunks, encoded into vector representations using an embedding model, and stored in a vector database.

Retrieval: When a user query is received, it is encoded into a vector representation using the same encoding model. Similarity scores between the query vector and the vectors of the indexed chunks are computed, and the top K most similar chunks are retrieved.

Generation: The retrieved chunks are integrated with the user query into a prompt, and a large language model generates a response based on the provided context.

However, Naïve RAG faces several challenges, such as retrieval precision and recall issues, generation difficulties (hallucination, irrelevance, toxicity, bias), and augmentation hurdles (disjointed outputs, redundancy, determining relevance).

Advanced RAG

Advanced RAG introduces specific improvements to overcome the limitations of Naïve RAG, focusing on enhancing retrieval quality through pre-retrieval and post-retrieval strategies.

Pre-retrieval process: This stage optimises the indexing structure and the original query. Indexing optimisation involves enhancing data granularity, optimising index structures, adding metadata, alignment optimisation, and mixed retrieval. Query optimisation makes the user's question clearer and more suitable for retrieval through query rewriting, transformation, and expansion.

Post-retrieval process: After retrieving relevant context, Advanced RAG focuses on effectively integrating it with the query. The main methods include re-ranking chunks to prioritise the most relevant content and context compression to select essential information and shorten the context to be processed.

Modular RAG

Modular RAG offers enhanced adaptability and versatility by incorporating diverse strategies for improving its components and introducing new modules and patterns.

New Modules: Modular RAG introduces specialised components such as the Search module (adapts to specific scenarios), Memory module (guides retrieval using LLM's memory), Routing module (navigates through diverse data sources), Predict module (generates context directly through LLM), and Task Adapter module (tailors RAG to various downstream tasks).

New Patterns: Modular RAG allows for module substitution or reconfiguration to address specific challenges. Innovations include the Rewrite-Retrieve-Read model (refines retrieval queries), Generate-Read (replaces traditional retrieval with LLM-generated content), ReciteRead (emphasises retrieval from model weights), hybrid retrieval strategies (integrate keyword, semantic, and vector searches), and iterative retrieval flows (Retrieve-Read-Retrieve-Read).

Modular RAG also showcases the benefits of adaptive retrieval through techniques like FLARE and Self-RAG, which evaluate the necessity of retrieval based on different scenarios. The flexible architecture allows for easier integration with other technologies, such as fine-tuning the retriever or generator for better results or engaging in collaborative fine-tuning.

Comparison between the three paradigms of RAG. (Left) Naïve RAG mainly consists of three parts: indexing, retrieval and generation. (Middle) Advanced RAG proposes multiple optimisation strategies around pre-retrieval and post-retrieval, with a process similar to the Naïve RAG, still following a chain-like structure. (Right) Modular RAG inherits and develops from the previous paradigm, showcasing greater flexibility overall. This is evident in the introduction of multiple specific functional modules and the replacement of existing modules. The overall process is not limited to sequential retrieval and generation; it includes methods such as iterative and adaptive retrieval.

RAG versus Fine Tuning

The relationship between Retrieval-Augmented Generation (RAG) and fine-tuning is complementary, and they can be used together to optimise the performance of Large Language Models (LLMs) in different ways.

RAG and fine-tuning have distinct characteristics

RAG provides the model with external knowledge for information retrieval, similar to giving a model a tailored textbook. It excels in dynamic environments by offering real-time knowledge updates and effective use of external knowledge sources with high interpretability. But it does come with higher latency and ethical considerations regarding data retrieval.

Fine-tuning allows the model to internalise knowledge over time, similar to a student learning and adapting. It is suitable for scenarios requiring replication of specific structures, styles, or formats.

Fine tuning is more static, requiring retraining for updates but enabling deep customisation of the model's behaviour and style.

Using RAG and fine-tuning together

Complementary optimisation: RAG and fine-tuning can enhance a model's capabilities at different levels. RAG focuses on providing external knowledge, while fine-tuning allows for customisation of the model's behavior and style.

Improved performance: In evaluations of knowledge-intensive tasks across different topics, RAG consistently outperforms unsupervised fine-tuning, both for existing knowledge encountered during training and entirely new knowledge. However, combining RAG and fine-tuning may lead to optimal performance in some cases.

Iterative process: The optimisation process involving RAG and fine-tuning may require multiple iterations to achieve satisfactory results. Fine-tuning the model can help it better use the information provided by RAG, while RAG can continuously supply updated and relevant external knowledge to the fine-tuned model.

The choice between RAG and fine-tuning, or using them together, depends on the specific needs of the application, such as:

  • Data dynamics: If the application requires real-time knowledge updates, RAG may be more suitable.

  • Customisation: If the application demands deep customization of the model's behavior and style, fine-tuning may be necessary.

  • Computational capabilities: Fine-tuning requires significant computational resources, while RAG may have higher latency during inference.

In summary, RAG and fine-tuning are not mutually exclusive and can be used together to optimise LLMs' performance.

The decision to use either or both techniques should be based on the specific requirements of the application, considering factors such as data dynamics, customisation needs, and computational constraints.

Having discussed the evolution and core components of RAG, let's now move to the retrieval process, which plays a crucial role in efficiently retrieving relevant documents from the data source to enhance the performance of Large Language Models (LLMs).

Retrieval

The retrieval process plays an important role in efficiently retrieving relevant documents from the data source to enhance the performance of Large Language Models (LLMs).

The retrieval process involves several key aspects, including the retrieval source, retrieval granularity, pre-processing of the retrieval, and selection of the corresponding embedding model.

Retrieval Source

The type of retrieval source and the granularity of retrieval units both affect the final generation results. The main types of retrieval sources are:

  1. Unstructured Data is the most widely used retrieval source, gathered from corpora such as Wikipedia Dump and domain-specific data (medical, legal domains).

  2. Semi-structured Data: Data that contains a combination of text and table information, such as PDF. Handling semi-structured data poses challenges due to potential data corruption during splitting processes and the complexity of incorporating tables into semantic similarity searches.

  3. Structured Data: Knowledge graphs (KGs) are typically verified and can provide more precise information. However, building, validating, and maintaining structured databases requires additional effort.

  4. LLMs-Generated Content: Some research focuses on exploiting LLMs' internal knowledge by using LLM-generated contexts for retrieval, aiming to improve model performance and task effectiveness.

Retrieval Granularity

The granularity of the retrieved data is another important factor. In text, retrieval granularity ranges from fine to coarse, including Token, Phrase, Sentence, Proposition, Chunks, and Document.

On Knowledge Graphs (KG), retrieval granularity includes Entity, Triplet, and sub-Graph.

Choosing the appropriate retrieval granularity during inference can be a simple and effective strategy to improve retrieval and downstream task performance.

Pre-processing and Embedding

Before retrieval, the data source undergoes pre-processing steps such as cleaning, extraction, and conversion into a uniform format. The text is then segmented into smaller chunks and encoded into vector representations using an embedding model. These vector representations are stored in a vector database to enable efficient similarity searches during retrieval.

Retrieval Process

When a user query is received, it is encoded into a vector representation using the same embedding model used for the data source.

The retrieval process then computes similarity scores between the query vector and the vectors of the indexed chunks.

The top K most similar chunks are retrieved based on these similarity scores. The retrieved chunks are then integrated with the user query into a prompt for further processing by the LLM.

The retrieval process in RAG can be performed in different ways, such as:

  • Iterative Retrieval: Multiple retrieval rounds are performed to refine the retrieved content.

  • Adaptive Retrieval: The necessity of retrieval is evaluated based on different scenarios.

  • Recursive Retrieval: The output of one retrieval round is used as input for the next round.

  • Multi-time Retrieval: Retrieval is performed multiple times from different sources or with different objectives.

In summary, the retrieval process in RAG works by efficiently searching and retrieving relevant information from various data sources based on the user query.

It involves pre-processing the data, encoding it into vector representations, and computing similarity scores to identify the most relevant chunks. The retrieved content is then integrated with the user query to enhance the performance of LLMs in generating accurate and informative responses.

Indexing optimisation

Indexing optimisation plays a crucial role in the Retrieval-Augmented Generation (RAG) process by improving the quality and efficiency of the retrieval phase.

The goal is to create an index that allows for the retrieval of the most relevant context while minimising noise and processing time.

Here's a summary of indexing optimisation techniques and best practices:

Chunking Strategy

  • Documents are typically split into chunks based on a fixed number of tokens (e.g., 100, 256, 512).

  • Larger chunks capture more context but also generate more noise and require longer processing time.

  • Smaller chunks have less noise but may not fully convey the necessary context.

  • Optimisation techniques include recursive splits and sliding window methods, enabling layered retrieval by merging globally related information across multiple retrieval processes.

  • The Small2Big method uses sentences as the retrieval unit and provides the preceding and following sentences as context to LLMs, striking a balance between semantic completeness and context length.

Best practices are to choose an appropriate chunk size based on the specific task and the trade-off between context and noise. Developers should consider using techniques like recursive splits, sliding windows, or Small2Big to optimise the chunking process and improve retrieval quality.

Metadata Attachments

  • Chunks can be enriched with metadata information such as page number, file name, author, category, and timestamp.

  • Retrieval can be filtered based on metadata, limiting the scope of the search.

  • Time-aware RAG can be achieved by assigning different weights to document timestamps during retrieval, ensuring the freshness of knowledge and avoiding outdated information.

  • Metadata can also be artificially constructed, such as adding summaries of paragraphs or introducing hypothetical questions (Reverse HyDE).

Structural Index

Establishing a hierarchical structure for documents can enhance information retrieval.

Hierarchical index structure: Files are arranged in parent-child relationships, with chunks linked to them. Data summaries are stored at each node, aiding in the swift traversal of data and assisting the RAG system in determining which chunks to extract.

Knowledge Graph (KG) index: Using KGs in constructing the hierarchical structure of documents contributes to maintaining consistency. It delineates the connections between different concepts and entities, reducing the potential for hallucinations.

KGP (Knowledge Graph Paragraphs) method: Builds an index between multiple documents using KG, consisting of nodes (representing paragraphs or structures) and edges (indicating semantic/lexical similarity or relationships within the document structure).

By implementing these indexing optimisation techniques and following best practices, RAG systems can improve the quality and efficiency of the retrieval process, ultimately enhancing the overall performance of the system in generating accurate and contextually coherent responses.

Query optimisation and transformation

Query optimisation and transformation play a role in improving the retrieval effectiveness and generating more relevant answers in Retrieval-Augmented Generation (RAG) systems.

The main goal is to address the challenges associated with imprecise, complex, or ambiguous user queries.

Query Expansion

  • Expands a single query into multiple queries to enrich the content and provide further context.

  • Addresses the lack of specific nuances in the original query and ensures optimal relevance of the generated answers.

  • Techniques include using prompt engineering to expand queries via LLMs, which are then executed in parallel. Sub-Query techniques are used to make the model generates necessary sub-questions to contextualise and fully answer the original question when combined.

Query Transformation

Query transformation retrieves chunks based on a transformed query instead of the user's original query. Some of the techniques include:

  • Query Rewrite: Uses LLM or specialised smaller language models to rewrite the original queries, making them more suitable for retrieval.

  • HyDE (Hypothetical Document Embeddings): Constructs hypothetical documents (assumed answers to the original query) and focuses on embedding similarity from answer to answer.

  • Step-back Prompting: Abstracts the original query to generate a high-level concept question (step-back question), which is used alongside the original query for retrieval.

Query Routing

Routes queries to distinct RAG pipelines based on their characteristics can make the system more versatile and adaptable to diverse scenarios. Techniques include:

Metadata Router/Filter: Extracts keywords (entities) from the query and filters chunks based on the keywords and metadata to narrow down the search scope.

Semantic Router: Leverages the semantic information of the query to route it to the most appropriate RAG pipeline.

Hybrid Routing: Combines both semantic and metadata-based methods for enhanced query routing.

Examples of applied query optimisation and transformation techniques

Imagine an e-commerce platform that uses a RAG system to help users find products based on their queries. A user enters a query: "lightweight running shoes for marathons"

Query Expansion

Multi-Query: The system generates multiple queries to capture different aspects of the original query:

1. "lightweight running shoes for long-distance running"
2. "running shoes with good cushioning for marathons"
3. "durable running shoes for marathon training"

Sub-Query: The system breaks down the original query into sub-questions:

1. "What are the best lightweight running shoes?"
2. "What running shoes are recommended for marathons?"
3. "What are the key features of running shoes for long-distance running?"

Query Transformation

Query Rewrite: The system rewrites the original query to make it more suitable for retrieval:

Original query: "lightweight running shoes for marathons"
Rewritten query: "best running shoes for marathon racing with lightweight design"

Step-back Prompting: The system generates a high-level concept question:

Step-back question: "What are the ideal running shoes for marathon runners?"

Consider a legal research platform that uses a RAG system to help users find relevant legal documents based on their queries.

A user enters a query: "case law related to intellectual property disputes in the software industry"

Query Expansion

Chain-of-Verification (CoVe): The system generates expanded queries and validates them using an LLM:

1. "court decisions on software patent infringement cases"
2. "legal precedents for copyright disputes in software development"
3. "landmark cases in software trademark litigation"

Query Transformation

HyDE (Hypothetical Document Embeddings): The system constructs hypothetical documents that could potentially answer the original query:

1. "A summary of key legal decisions related to software intellectual property disputes"
2. "An analysis of the impact of patent, copyright, and trademark laws on the software industry"
3. "A timeline of significant legal battles in the software industry involving intellectual property rights"

Scenario 3: Customer Support Assistant

Imagine a customer support chatbot that uses a RAG system to provide answers to user inquiries.

A user enters a query: "How can I troubleshoot slow performance on my laptop?"

Query Expansion

Multi-Query: The system generates multiple queries to address different aspects of the problem:

1. "Steps to diagnose and fix slow laptop performance"
2. "Common causes of slow laptop performance"
3. "Software and hardware solutions for improving laptop speed"

Query Transformation

Query Rewrite: The system rewrites the original query to make it more actionable:

Original query: "How can I troubleshoot slow performance on my laptop?"
Rewritten query: "What are the best methods to identify and resolve issues causing slow laptop performance?"

These examples demonstrate how query optimisation and transformation techniques can be applied in various domains to improve the effectiveness and relevance of the RAG system's responses.

By expanding queries, transforming them, and routing them intelligently, RAG systems can better understand user intent, retrieve the most pertinent information, and provide accurate and helpful answers.

Embedding

Embedding enables the retrieval of relevant documents by calculating the similarity between the embeddings of the question and document chunks.

An embedding is a dense vector representation of a piece of text that captures its semantic meaning.

The process of creating these embeddings is called embedding, and it is performed by an embedding engine, which is typically a pre-trained language model.

How Embedding Works

  1. Text Preprocessing: The input text (question or document chunk) is preprocessed by tokenizing it into individual words or subwords.

  2. Input Representation: The tokenized text is then converted into a numerical representation, such as one-hot encoding or word embeddings (e.g., Word2Vec, GloVe).

  3. Embedding Model: The input representation is passed through an embedding model, which is usually a pre-trained language model like BERT, RoBERTa, or specialised models like AngIE, Voyage, or BGE. These models have been trained on large amounts of text data to capture the semantic relationships between words and sentences.

  4. Dense Vector Representation: The embedding model outputs a dense vector representation (embedding) for the input text. This embedding is a fixed-size, low-dimensional vector that captures the semantic meaning of the text.

  5. Similarity Calculation: The embeddings of the question and document chunks are compared using a similarity metric, such as cosine similarity, to determine the relevance of each chunk to the question.

Best Practices and Techniques

Mix/Hybrid Retrieval: Combining sparse and dense embedding approaches can capture different relevance features and leverage complementary relevance information. Sparse retrieval models (e.g., BM25) can provide initial search results for training dense retrieval models or handle queries containing rare entities, while dense retrieval models can enhance the semantic understanding of the text.

Fine-tuning Embedding Model: When the context significantly deviates from the pre-training corpus, such as in specialised domains like healthcare or legal practice, fine-tuning the embedding model on a domain-specific dataset becomes essential. This helps align the retriever and generator and improves the model's performance on downstream tasks.

LM-supervised Retriever (LSR): Using the results of a Large Language Model (LLM) as the supervision signal for fine-tuning the retriever can help align the retriever and generator. Techniques like PROMPTAGATOR and LLM-Embedder use LLMs to generate task-specific retrievers or reward signals for fine-tuning.

By understanding the embedding process, using appropriate embedding engines, and applying best practices and techniques, RAG systems can effectively retrieve relevant documents and generate more accurate and informative responses to user queries.

The choice of embedding engine and techniques depends on the specific requirements of the RAG system, the available resources, and the characteristics of the input data.

Adapters

An adapter plays the role of aligning and integrating the retriever and generator components to optimise performance on specific tasks.

Adapters are lightweight, pluggable modules that can be added to pre-trained language models (LLMs) without modifying their underlying architecture.

They help address challenges such as integrating functionality through APIs, handling computational resource constraints, and improving the multi-task capabilities of LLMs.

Key roles and functions of adapters in RAG systems:

  1. Task-specific alignment: Adapters can be designed to align the retriever and generator components for specific downstream tasks. By fine-tuning the adapter on task-specific data, the RAG system can better handle the requirements and nuances of individual tasks.

  2. Prompt retrieval: Adapters like UPRISE (Universal Prompt Retrieval for Improving Zero-Shot Performance) can be trained to automatically retrieve prompts from a pre-built prompt pool that are suitable for a given zero-shot task input. This enhances the multi-task capabilities of LLMs and improves their performance on unseen tasks.

  3. Universal adaptation: Adapters such as AAR (Augmentation-Adapted Retriever) can be designed as universal adapters that accommodate multiple downstream tasks. This allows the RAG system to handle a wide range of tasks without the need for task-specific fine-tuning.

  4. Reward-driven adaptation: Adapters like PRCA (Pluggable Reward-driven Contextual Adapter) can be added to the RAG system to enhance performance on specific tasks by incorporating reward-driven learning. This allows the system to optimize its performance based on task-specific reward signals.

  5. Bridging retriever and generator: Adapters can act as a bridge between the retriever and generator components, transforming the retrieved information into a format that LLMs can work with effectively. For example, BGM (Bridge-Generator Model) trains a bridge Seq2Seq model that sits between the retriever and LLM, allowing it to rerank and dynamically select passages for each query.

  6. Knowledge integration: Adapters can be used to integrate knowledge into white-box models via directive fine-tuning, as demonstrated by PKG (Pluggable Knowledge-integration via directive fine-tuning with Generated documents). In this approach, the retriever module is directly substituted to generate relevant documents according to a query, helping to address the difficulties encountered during the fine-tuning process and enhance model performance.

By incorporating adapters into RAG systems, researchers and practitioners can improve the alignment between retriever and generator components, enhance the multi-task capabilities of LLMs, and optimise performance on specific downstream tasks.

Adapters provide a versatile and customisable way to enhance the performance and capabilities of RAG systems, making them a valuable tool in the development and deployment of these systems.

Once the relevant documents have been retrieved, the next step is to generate accurate and coherent responses. In this section, we will explore the generation process, which involves context curation and LLM fine-tuning."

Generation Process

The generation process involves further processing the retrieved content and adjusting the Large Language Model (LLM) to obtain the best results.

This process is crucial because directly inputting all the retrieved information to the LLM may lead to suboptimal answers due to redundant or overly long contexts. The generation process can be optimised through context curation and LLM fine-tuning.

Context Curation: Context curation involves adjusting the retrieved content to reduce redundancy and focus on the most relevant information.

The main techniques for context curation are:

Reranking

  • Reranking reorders document chunks to prioritise the most pertinent results, effectively reducing the overall document pool.

  • It serves as both an enhancer and a filter, delivering refined inputs for more precise language model processing.

  • Reranking can be performed using rule-based methods or model-based approaches, specialised reranking models, or general large language models like GPT.

Context Selection/Compression

  • Excessive context can introduce noise and diminish the LLM's perception of key information.

  • For example, LLMLingua uses small language models (SLMs) to detect and remove unimportant tokens, transforming the context into a form that is challenging for humans to comprehend but well understood by LLMs. This approach eliminates the need for additional training of LLMs while balancing language integrity and compression ratio.

Augmentation Techniques

The augmentation process focuses on optimising the retrieval process to provide a more comprehensive knowledge base for Large Language Models (LLMs).

The standard practice of a single retrieval step followed by generation can be insufficient for complex problems that require multi-step reasoning. In response to this issue, several optimisation techniques have been developed, including iterative retrieval, recursive retrieval, and adaptive retrieval.

In addition to the most common once retrieval, RAG also includes three types of retrieval augmentation processes. (left) Iterative retrieval involves alternating between retrieval and generation, allowing for richer and more targeted context from the knowledge base at each step. (Middle) Recursive retrieval involves gradually refining the user query and breaking down the problem into sub-problems, then continuously solving complex problems through retrieval and generation. (Right) Adaptive retrieval focuses on enabling the RAG system to autonomously determine whether external knowledge retrieval is necessary and when to stop retrieval and generation, often utilizing LLM-generated special tokens for control.

Iterative Retrieval

  • Iterative retrieval involves repeatedly searching the knowledge base based on the initial query and the text generated so far.

  • It provides a more comprehensive knowledge base for LLMs by offering additional contextual references through multiple retrieval iterations.

  • The model uses the content needed to address the input task as a contextual basis for retrieving relevant knowledge, which facilitates the generation of improved responses in subsequent iterations.

Recursive Retrieval

  • Recursive retrieval involves iteratively refining search queries based on the results obtained from previous searches.

  • It aims to enhance the search experience by gradually converging on the most pertinent information through a feedback loop.

  • Recursive retrieval can be combined with multi-hop retrieval techniques to address specific data scenarios, such as processing and retrieving data from hierarchical structures or graph-structured data sources.

Adaptive Retrieval

  • Adaptive retrieval methods, such as Flare and Self-RAG, enable LLMs to actively determine the optimal moments and content for retrieval.

  • These methods are part of a broader trend where LLMs employ active judgment in their operations, as seen in model agents like AutoGPT, Toolformer, and GraphToolformer.

  • Flare automates timing retrieval by monitoring the confidence of the generation process, as indicated by the probability of generated terms, and activates the retrieval system when the probability falls below a certain threshold.

  • Self-RAG introduces "reflection tokens" that allow the model to introspect its outputs, autonomously deciding when to activate retrieval or using a predefined threshold.

By carefully applying iterative retrieval, recursive retrieval, and adaptive retrieval techniques, you can optimise the augmentation process in RAG systems, providing a more comprehensive knowledge base for LLMs and enabling them to handle complex problems that require multi-step reasoning.

The choice of augmentation technique depends on the specific requirements of the task, the complexity of the problem, and the structure of the data sources being used.

Examples of Augmentation Techniques

Here are some scenarios illustrating how augmentation techniques can be applied in practice:

Scenario 1: Iterative Retrieval for Academic Literature Review

Imagine a researcher using a RAG system to conduct a literature review on a specific topic, such as "the impact of social media on mental health."

Initial query: "the impact of social media on mental health"

Iteration 1:
- Retrieve: The system retrieves relevant abstracts and introductions from academic papers.
- Generate: The system generates a summary of the main findings and key concepts.

Iteration 2:
- Retrieve: Based on the generated summary, the system retrieves more specific information, such as methodologies and limitations of the studies.
- Generate: The system updates the summary with more detailed information and identifies gaps in the literature.

Iteration 3:
- Retrieve: The system retrieves recent publications and case studies to fill the identified gaps.
- Generate: The system generates a comprehensive literature review, synthesizing the findings from all iterations.

By employing iterative retrieval, the RAG system progressively builds a more comprehensive knowledge base, allowing for a thorough and up-to-date literature review.

Scenario 2: Recursive Retrieval for Troubleshooting Complex Software Issues

Consider a software developer using a RAG system to troubleshoot a complex issue in a large codebase.

Initial query: "Error: Unable to connect to database on startup"

Recursive Retrieval 1:
- Refine query: "Possible reasons for database connection failure on application startup"
- Retrieve: The system retrieves information on common causes of database connection issues, such as incorrect credentials or network problems.

Recursive Retrieval 2:
- Refine query: "How to diagnose database connection issues in a Spring Boot application"
- Retrieve: The system retrieves specific troubleshooting steps and code snippets relevant to the developer's application framework.

Recursive Retrieval 3:
- Refine query: "Debugging database connection issues in a Docker containerized environment"
- Retrieve: The system retrieves information on how to diagnose and resolve database connection problems specific to the developer's deployment setup.

Generate: The system generates a step-by-step troubleshooting guide tailored to the developer's specific application architecture and deployment environment.

By using recursive retrieval, the RAG system gradually refines the search queries based on the results from previous searches, ultimately providing a targeted and comprehensive solution to the complex troubleshooting problem.

Scenario 3: Adaptive Retrieval for Personalised Travel Recommendations

Imagine a travel booking platform using a RAG system to provide personalised travel recommendations based on user preferences and queries.

Initial query: "Best beach destinations for a family vacation"

Adaptive Retrieval:
- The system assesses the initial query and determines that more information is needed to provide accurate recommendations.
- The system generates follow-up questions to gather more context:
  - "What is your preferred travel month?"
  - "What is your budget range for the vacation?"
  - "Are you interested in any specific activities or attractions?"

- Based on the user's responses, the system adaptively retrieves information on beach destinations that match the specified criteria.
- The system monitors the confidence of the generated recommendations based on the available information.
- If the confidence is low, the system triggers additional retrieval to gather more relevant information, such as user reviews or travel guides.

Generate: The system generates a list of personalized beach destination recommendations, along with a summary of each destination's key features, activities, and accommodations that align with the user's preferences.

By employing adaptive retrieval, the RAG system actively determines the optimal moments and content for retrieval based on the user's input and the confidence of the generated recommendations, ultimately providing more accurate and personalised travel suggestions.

These three scenarios demonstrate how iterative retrieval, recursive retrieval, and adaptive retrieval can be applied in various domains to enhance the performance and output quality of RAG systems.

By leveraging these augmentation techniques, RAG systems can effectively handle complex, multi-step problems and deliver more comprehensive and targeted results.

Evaluating RAG Models

RAG is primarily used for Question Answering (QA) tasks, including single-hop/multi-hop QA, multiple-choice, domain-specific QA, and long-form scenarios.

RAG is also expanding into other tasks such as Information Extraction (IE), dialogue generation, and code search.

The main evaluation targets are retrieval quality and generation quality.

Retrieval quality assesses the effectiveness of the context sourced by the retriever component, while generation quality evaluates the generator's ability to synthesise coherent and relevant answers from the retrieved context.

While there are a range of evaluation benchmarks and tools, the best evaluation tool is human feedback. Is the context relevant, are the answers good, does it work.

Future Directions and Challenges

To make RAG more mainstream and production-ready, several key areas need to be addressed. These challenges can be grouped into three main categories: technical advancements, user accessibility, and scalability.

Robust and Efficient Retrieval

Developing techniques to improve retrieval efficiency and document recall in large knowledge bases is critical.

This could involve optimising indexing structures, employing more sophisticated similarity measures, and leveraging hardware accelerators to speed up the retrieval process.

Additionally, ensuring data security and preventing inadvertent disclosure of document sources or metadata by LLMs is essential for production environments.

Standardised RAG Frameworks

Creating standardised and modular RAG frameworks that are easy to use and integrate with existing systems can greatly accelerate adoption.

These frameworks should provide a set of well-defined APIs and abstractions for different components of the RAG pipeline, such as retrieval, augmentation, and generation. They should also support various data formats, modalities, and downstream tasks out of the box.

Low-Code/No-Code Platforms

Developing user-friendly, low-code, or no-code platforms for building and deploying RAG applications can democratise access to this technology.

These platforms should provide intuitive interfaces for configuring RAG pipelines, selecting pre-trained models, and fine-tuning them for specific tasks.

Visual programming paradigms, such as drag-and-drop interfaces and flowchart-based design tools, can make RAG more accessible to non-technical users.

Scalability and Cloud Support

Ensuring that RAG frameworks and platforms can scale to handle large-scale deployments and high-throughput scenarios is critical for production readiness.

Integrating RAG with cloud computing platforms and leveraging their elastic resources, such as serverless computing and auto-scaling, can help meet the demands of real-world applications. Additionally, providing managed RAG services through cloud providers can simplify deployment and maintenance for developers.

Multimodal RAG

Expanding RAG to support multimodal data, such as images, audio, video, and code, can open up new possibilities and use cases.

Developing unified frameworks that can seamlessly handle different modalities and enable cross-modal retrieval and generation can greatly enhance the versatility and applicability of RAG.

This could involve integrating specialised encoders and decoders for each modality, as well as designing novel augmentation strategies that leverage the strengths of each modality.

By following this process and leveraging the advancements in RAG frameworks, platforms, and multimodal support, developers can create powerful and production-ready RAG pipelines that tackle a wide range of real-world problems.

Creating a Process for Generating RAG Pipelines

To create a streamlined process for generating RAG pipelines, developers can follow these steps:

  1. Problem Definition: Clearly define the problem or task that the RAG pipeline will address. Identify the input data format, desired output format, and any specific constraints or requirements.

  2. Data Preparation: Collect, preprocess, and annotate the relevant data for training and evaluation. This may involve web scraping, data cleaning, tokenization, and indexing. Ensure that the data is in a format compatible with the chosen RAG framework.

  3. Retrieval Component Selection: Choose an appropriate retrieval component based on the nature of the data and the task at hand. This could be a dense retriever, a sparse retriever, or a hybrid approach. Configure the retrieval parameters, such as the similarity function, index structure, and query processing techniques.

  4. Augmentation Strategy Design: Determine the augmentation strategy that aligns with the task requirements. This could involve techniques like iterative retrieval, recursive retrieval, or adaptive retrieval. Define the criteria for triggering retrieval and the mechanisms for integrating retrieved information with the generated output.

  5. Generation Model Selection: Select a suitable language model for the generation component. This could be a pre-trained model like GPT or a domain-specific model fine-tuned on relevant data. Configure the model parameters, such as the architecture, tokenization scheme, and decoding strategy.

  6. Pipeline Integration: Integrate the retrieval, augmentation, and generation components into a coherent RAG pipeline. Use the chosen RAG framework's APIs and abstractions to connect the components and define the data flow between them. Implement any necessary pre-processing and post-processing steps.

  7. Training and Fine-tuning: Train or fine-tune the RAG pipeline on the prepared data. This may involve optimising the retrieval component's parameters, fine-tuning the generation model on domain-specific data, or jointly training the entire pipeline end-to-end. Use appropriate loss functions and optimization techniques for each component.

  8. Evaluation and Iteration: Evaluate the performance of the RAG pipeline using relevant metrics and benchmarks. Analyse the results and identify areas for improvement. Iterate on the pipeline design, component selection, and training process until satisfactory performance is achieved.

  9. Deployment and Monitoring: Deploy the RAG pipeline in a production environment, integrating it with the necessary APIs and user interfaces. Monitor the pipeline's performance and gather user feedback. Continuously update and maintain the pipeline to ensure its robustness, efficiency, and effectiveness over time.

Conclusion

In conclusion, this document provides a detailed overview of the evolution of RAG, from the early stages of Naïve RAG to the more advanced and modular approaches.

The core components of RAG, including retrieval, generation, and augmentation techniques, are thoroughly explained, emphasising their roles in improving the accuracy, relevance, and efficiency of language models.

The document also explores the relationship between RAG and fine-tuning, showcasing how these techniques can be used together to optimise the performance of Large Language Models (LLMs) in various applications.

The knowledge documentation explores the technical aspects of RAG, such as indexing optimisation, query optimisation and transformation, embedding, and adapters. These sections provide insights into the best practices and techniques for implementing RAG in real-world scenarios, along with illustrative examples and case studies.

The future directions and challenges section highlights the key areas that need to be addressed to make RAG more mainstream and production-ready.

This includes the development of robust and efficient retrieval techniques, standardised RAG frameworks, low-code/no-code platforms, scalability and cloud support, and multimodal RAG. By focusing on these aspects, researchers and practitioners can accelerate the adoption and impact of RAG across various industries.

RAG has the potential to revolutionise the way we interact with and generate content, enabling more accurate, relevant, and contextually aware language models. As the technology continues to evolve, it is expected to have a significant impact on a wide range of applications, such as question answering, information retrieval, dialogue systems, and content creation.

In summary, this knowledge documentation provides a comprehensive and accessible overview of RAG, its mechanisms, and its potential for transforming neural language models.

By understanding and leveraging the power of RAG, researchers and practitioners can develop more advanced and efficient language models that can tackle complex real-world problems and deliver better user experiences.

References

Introduction and Overview of RAG

  • "Large language models struggle to learn long-tail knowledge" [1]

  • "Siren's song in the ai ocean: A survey on hallucination in large language models" [2]

  • "Gar-meets-rag paradigm for zero-shot information retrieval" [3]

  • "Retrievalaugmented generation for knowledge-intensive nlp tasks" [4]

  • "Improving language models by retrieving from trillions of tokens" [5]

  • "Training language models to follow instructions with human feedback" [6]

Evolution and Mechanisms of RAG

  • "Query rewriting for retrieval-augmented large language models" [7]

  • "Advanced rag techniques: an illustrated overview" [8]

  • "Large language model based long-tail query rewriting in taobao search" [9]

  • "Take a step back: Evoking reasoning via abstraction in large language models" [10]

  • "Precise zero-shot dense retrieval without relevance labels" [11]

  • "Enhancing rag pipelines in haystack: Introducing diversityranker and lostinthemiddleranker" [12]

  • "Generate rather than retrieve: Large language models are strong context generators" [13]

  • "Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy" [14]

  • "Knowledgpt: Enhancing large language models with retrieval and storage access on knowledge bases" [15]

  • "Forget rag, the future is rag-fusion" [16]

  • "Lift yourself up: Retrieval-augmented text generation with self memory" [17]

  • "Training data is more valuable than you think: A simple and effective method by retrieving from training data" [18]

  • "From classification to generation: Insights into crosslingual retrieval augmented icl" [19]

  • "Uprise: Universal prompt retrieval for improving zero-shot evaluation" [20]

  • "Promptagator: Few-shot dense retrieval from 8 examples" [21]

  • "Recitation-augmented language models" [22]

  • "Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp" [23]

  • "Active retrieval augmented generation" [24]

  • "Self-rag: Learning to retrieve, generate, and critique through self-reflection" [25]

  • "Bridging the preference gap between retrievers and llms" [26]

  • "Ra-dit: Retrievalaugmented dual instruction tuning" [27]

RAG versus Fine-Tuning

  • "Fine-tuning or retrieval? comparing knowledge injection in llms" [28]

Retrieval

  • "Copy is all you need" [29]

  • "Dense x retrieval: What retrieval granularity should we use?" [30]

  • "Divide & conquer for entailment-aware multi-hop evidence retrieval" [31]

  • "Diversify question generation with retrieval-augmented style transfer" [32]

  • "Prompt-guided retrieval augmentation for non-knowledge-intensive tasks" [33]

  • "Learning to filter context for retrieval-augmented generation" [34]

  • "Retrieval-augmented data augmentation for low-resource domain tasks" [35]

  • "Large language model is not a good few-shot information extractor, but a good reranker for hard samples!" [36]

  • "Retrieval-augmented generative question answering for event argument extraction" [37]

  • "Learning to retrieve in-context examples for large language models" [38]

  • "Recommender systems with generative retrieval" [39]

  • "Language models as semantic indexers" [40]

  • "Context tuning for retrieval augmented generation" [41]

  • "Few-shot learning with retrieval augmented language models" [42]

  • "Raven: In-context learning with retrieval augmented encoderdecoder language models" [43]

  • "Shall we pretrain autoregressive language models with retrieval? a comprehensive study" [44]

  • "Instructretro: Instruction tuning post retrieval-augmented pretraining" [45]

  • "Improving the domain adaptation of retrieval augmented generation (rag) models for open domain question answering" [46]

  • "Augmentation-adapted retriever improves generalization of language models as generic plug-in" [47]

  • "Making retrieval augmented language models robust to irrelevant context" [48]

  • "Understanding retrieval augmentation for long-form question answering" [49]

  • "Chain-of-note: Enhancing robustness in retrieval-augmented language models" [50]

  • "Search-in-the chain: Towards accurate, credible and traceable large language models for knowledge intensive tasks" [51]

  • "Optimizing retrieval-augmented reader models via token elimination" [52]

  • "Paperqa: Retrieval-augmented generative agent for scientific research" [53]

  • "The power of noise: Redefining retrieval for rag systems" [54]

  • "Iag: Induction-augmented generation framework for answering reasoning questions" [55]

  • "Nomiracl: Knowing when you don't know for robust multilingual retrieval-augmented generation" [56]

  • "Tree of clarifications: Answering ambiguous questions with retrieval-augmented large language models" [57]

  • "Self-knowledge guided retrieval augmentation for large language models" [58]

  • "Retrievalgeneration synergy augmented large language models" [59]

  • "Retrieval meets long context large language models" [60]

  • "Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions" [61]

  • "Investigating the factual knowledge boundary of large language models with retrieval augmentation" [62]

  • "Raptor: Recursive abstractive processing for tree-organized retrieval" [63]

  • "In-context retrieval-augmented language models" [64]

  • "Retrieve-and sample: Document-level event argument extraction via hybrid retrieval augmentation" [65]

  • "Zemi: Learning zero-shot semi-parametric language models from multiple tasks" [66]

  • "Corrective retrieval augmented generation" [67]

  • "1-pager: One pass answer generation and evidence retrieval" [68]

  • "Prca: Fitting black-box large language models for retrieval question answering via pluggable reward-driven contextual adapter" [69]

  • "Open-source large language models are strong zero-shot query likelihood models for document ranking" [70]

  • "Recomp: Improving retrieval-augmented lms with compression and selective augmentation" [71]

  • "Replug: Retrieval-augmented black-box language models" [72]

  • "Enhancing llm intelligence with arm-rag: Auxiliary rationale memory for retrieval augmented generation" [73]

  • "Unims-rag: A unified multi-source retrieval-augmented generation for personalized dialogue systems" [74]

  • "Augmented large language models with parametric knowledge guiding" [75]

  • "Structure aware language model pretraining improves dense retrieval on structured data" [76]

  • "Knowledge graph-augmented language models for knowledge-grounded dialogue generation" [77]

  • "Retrievalgeneration alignment for end-to-end task-oriented dialogue system" [78]

  • "Dual-feedback knowledge retrieval for task-oriented dialogue systems" [79]

  • "Fabula: Intelligence report generation using retrieval-augmented narrative construction" [80]

  • "Think and retrieval: A hypothesis knowledge graph enhanced medical large language models" [81]

  • "Knowledge-augmented language model verification" [82]

  • "Reasoning on graphs: Faithful and interpretable large language model reasoning" [83]

  • "G-retriever: Retrieval-augmented generation for textual graph understanding and question answering" [84]

  • "Tablegpt: Towards unifying tables, nature language and commands into one gpt" [85]

  • "Iseeq: Information seeking question generation using dynamic meta-information retrieval and knowledge graphs" [86]

  • "Large language models can be easily distracted by irrelevant context" [87]

Indexing Optimization

  • "Evaluating the ideal chunk size for a rag system using llamaindex" [88]

  • "Recursively split by character" [89]

  • "Advanced rag 01: Small-tobig retrieval" [90]

  • "Knowledge graph prompting for multi-document question answering" [91]

Query Optimization and Transformation

  • "Least-to-most prompting enables complex reasoning in large language models" [92]

  • "Chain-of-verification reduces hallucination in large language models" [93]

Embedding

  • "Angle-optimized text embeddings" [94]

  • "Voyage's embedding models" [95]

  • "Flagembedding" [96]

  • "Retrieve anything to augment large language models" [97]

Adapters

  • "Uprise: Universal prompt retrieval for improving zero-shot evaluation" [20]

  • "Augmentation-adapted retriever improves generalization of language models as generic plug-in" [47]

  • "Prca: Fitting black-box large language models for retrieval question answering via pluggable reward-driven contextual adapter" [69]

  • "Bridging the preference gap between retrievers and llms" [26]

  • "Augmented large language models with parametric knowledge guiding" [75]

Generation Process

  • "Lost in the middle: How language models use long contexts" [98]

  • "Chatrec: Towards interactive and explainable llms-augmented recommender system" [99]

  • "Lingua: Addressing scenarios for live interpretation and automatic dubbing" [100]

  • "Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression" [101]

  • "Dense passage retrieval for open-domain question answering" [102]

  • "Large language model is not a good few-shot information extractor, but a good reranker for hard samples!" [103]

  • "Chatlaw: Open-source legal large language model with integrated external knowledge bases" [104]

Augmentation

  • "Making retrieval augmented language models robust to irrelevant context" [105]

  • "Chain of knowledge: A framework for grounding large language models with structured knowledge bases" [106]

  • "Auto-gpt for online decision making: Benchmarks and additional opinions" [107]

  • "Toolformer: Language models can teach themselves to use tools" [108]

  • "Graph-toolformer: To empower llms with graph reasoning ability via prompt augmented by chatgpt" [109]

  • "Webgpt: Browser assisted question-answering with human feedback" [110]

Evaluating RAG Models

  • "Natural questions: a benchmark for question answering research" [111]

  • "Exploring the integration strategies of retriever and large language models" [112]

  • "Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension" [113]

  • "Squad: 100,000+ questions for machine comprehension of text" [114]

  • "Semantic parsing on freebase from question-answer pairs" [115]

  • "When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories" [116]

  • "Ms marco: A human-generated machine reading comprehension dataset" [117]

  • "Hotpotqa: A dataset for diverse, explainable multi-hop question answering" [118]

  • "Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps" [119]

  • "Musique: Multihop questions via single-hop question composition" [120]

  • "Eli5: Long form question answering" [121]

  • "The narrative qa reading comprehension challenge" [122]

  • "A human inspired reading agent with gist memory of very long contexts" [123]

  • "Asqa: Factoid questions meet long-form answers" [124]

  • "Qmsum: A new benchmark for query-based multi-domain meeting summarization" [125]

  • "A dataset of information-seeking questions and answers anchored in research papers" [126]

  • "Covid-qa: A question answering dataset for covid-19" [127]

  • "Cmb: A comprehensive medical benchmark in chinese" [128]

  • "Measuring massive multitask chinese understanding" [129]

  • "Quality: Question answering with long input texts, yes!" [130]

  • "Think you have solved question answering? try arc, the ai2 reasoning challenge" [131]

  • "Commonsenseqa: A question answering challenge targeting commonsense knowledge" [132]

  • "Wizard of wikipedia: Knowledge-powered conversational agents" [133]

  • "Large language models as source planner for personalized knowledge-grounded dialogue" [134], [135]

  • "Long time no see! open-domain conversation with long-term persona memory" [136]

  • "Conditional generation and snapshot learning in neural dialogue systems" [137]

  • "Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering" [138]

  • "Document-level event argument extraction by conditional generation" [139]

  • "Multisentence argument linking" [140]

  • "T-rex: A large scale alignment of natural language with knowledge base triples" [141]

  • "Zero-shot relation extraction via reading comprehension" [142]

  • "Hellaswag: Can a machine really finish your sentence?" [143]

  • "The cot collection: Improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning" [144]

  • "Complex sequential question answering: Towards learning to converse over linked question answer pairs with a knowledge graph" [145]

  • "Measuring massive multitask language understanding" [146]

  • "Pointer sentinel mixture models" [147]

  • "Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies" [148]

  • "Fever: a large-scale dataset for fact extraction and verification" [149]

  • "Explainable automated fact-checking for public health claims" [150]

  • "Neural text generation from structured data with application to the biography domain" [151]

  • "Wikiasp: A dataset for multi-domain aspect-based summarization" [152]

  • "Don't give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization" [153]

  • "Vio-lens: A novel dataset of annotated social network posts leading to different forms of communal violence and its evaluation" [154]

  • "Learning question classifiers" [155]

  • "Recursive deep models for semantic compositionality over a sentiment treebank" [156]

  • "Codesearchnet challenge: Evaluating the state of semantic code search" [157]

  • "Training verifiers to solve math word problems" [158]

  • "The jrc-acquis: A multilingual aligned parallel corpus with 20+ languages" [159]

  • "Ralle: A framework for developing and evaluating retrieval-augmented large language models" [160]

  • "Building production-ready rag applications" [161]

  • "Evaluating rag part i: How to evaluate document retrieval" [162]

  • "Best practices for llm evaluation of rag applications" [163]

  • "Ragas: Automated evaluation of retrieval augmented generation" [164]

  • "Ares: An automated evaluation framework for retrieval-augmented generation systems" [165]

  • "A survey of techniques for maximizing llm performance" [166]

  • "Benchmarking large language models in retrieval-augmented generation" [167]

  • "Recall: A benchmark for llms robustness against external counterfactual knowledge" [168]

  • "Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language models" [169]

Future Directions and Challenges

  • "Retrieval meets long context large language models" [170]

  • "Memgpt: Towards llms as operating systems" [171]

  • "Efficient streaming language models with attention sinks" [172]

  • "Raft: Adapting language model to domain specific rag" [173]

  • "Scaling laws for neural language models" [174]

  • "Neurosymbolic language modeling with automaton-augmented retrieval" [175]

Multimodal RAG

  • "Retrieval-augmented multimodal language modeling" [176]

  • "Blip-2: Bootstrapping language image pre-training with frozen image encoders and large language models" [177]

  • "Visualize before you write: Imagination-guided open-ended text generation" [178]

  • "Generating synthetic speech from spoken vocab for speech translation" [179]

  • "Using external off-policy speech-to-text mappings in contextual end-to-end automated speech recognition" [180]

  • "Vid2seq: Large-scale pretraining of a visual language model for dense video captioning" [181]

Last updated

Logo

Continuum - Accelerated Artificial Intelligence

Continuum WebsiteAxolotl Platform

Copyright Continuum Labs - 2023