# RAFT: Adapting Language Model to Domain Specific RAG

## <mark style="color:purple;">Introduction to "Retrieval Augmented Fine Tuning (RAFT)</mark>

This <mark style="color:blue;">**June 2024**</mark> paper introduces a novel approach called Retrieval Augmented Fine Tuning (RAFT), which aims to improve the performance of Large Language Models (LLMs) in domain-specific retrieval-augmented generation (RAG) tasks.&#x20;

The authors present RAFT as a training methodology that enhances a model's ability to answer questions in "open-book" in-domain settings.

### <mark style="color:purple;">Key Points</mark>

1. **Problem Addressed**: The paper tackles the challenge of adapting pre-trained LLMs for specialized domains where accuracy based on a given set of documents is crucial.
2. **Current Limitations**: Existing methods like in-context learning through RAG and supervised fine-tuning have limitations. RAG-based methods don't leverage the learning opportunity in fixed domains, while fine-tuning approaches often fail to account for imperfections in the retrieval process.
3. **RAFT Solution**: The proposed method combines instruction fine-tuning with retrieval augmented generation. It trains the model to:
   * Incorporate domain knowledge
   * Improve in-domain RAG performance
   * Identify and use relevant documents while ignoring distractors
4. **Training Process**: RAFT trains the model to answer questions using relevant documents while also presenting it with distractor documents. This process improves the model's ability to reason and cite relevant information.
5. **Performance**: The authors report that RAFT consistently outperforms supervised fine-tuning (with and without RAG) across multiple datasets, including PubMed, HotpotQA, and Gorilla.
6. **Analogy**: The authors liken their approach to studying for an open-book exam, where the model learns to recognize relevant and irrelevant retrieved documents.

{% embed url="<https://arxiv.org/abs/2403.10131>" %}
RAFT: Adapting Language Model to Domain Specific RAG
{% endembed %}

<figure><img src="https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2FE4G5eT2SG1OeTHnkIoGR%2Fchrome_3eFl4JGIMm.png?alt=media&#x26;token=a4cf3b23-5cf3-411f-9bc1-87cae9522027" alt=""><figcaption><p>How best to prepare for an Exam?(a) Fine-tuning based approaches implement "studying" by either directly "memorising" the input documents or answering practice QAwithout referencing the documents. (b) Alternatively, in-context retrieval methods fail to leverage the learning opportunity afforded by the fixed domain and are equivalent to taking an open-book exam without studying. In contrast, our approach (c) RAFT leverages f ine-tuning with question-answer pairs while referencing the documents in a simulated imperfect retrieval setting — thereby effectively preparing for the open-book exam setting.</p></figcaption></figure>

## <mark style="color:purple;">Detailed Explanation of RAFT Methodology</mark>

### <mark style="color:blue;">Introduction to RAFT</mark>

RAFT (Retrieval Augmented Fine-Tuning) is presented as a training method for Large Language Models (LLMs) specifically designed for domain-specific "open-book" scenarios.

The authors describe it as a way to prepare LLMs for specialised tasks where the model needs to effectively use external information to answer questions.

<figure><img src="https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2FkxOwpzcmcGWTIWyeBlMZ%2Fchrome_dtUJ2KKtVn.png?alt=media&#x26;token=05f3817f-91f4-49c1-ab69-2c721ad7ca80" alt=""><figcaption><p>Overview of the RAFT method. The top-left figure depicts the approach of adapting LLMs to reading solution from a set of positive and distractor documents in contrast to standard RAG setup where models are trained based on the retriever outputs, which is a mixture of both memorisation and reading. At test time, all methods follow the standard RAG setting, provided with a top-k retrieved documents in the context.</p></figcaption></figure>

### <mark style="color:purple;">Supervised Fine-Tuning (SFT) - The Traditional Approach</mark>

Before introducing RAFT, the paper explains the traditional <mark style="color:yellow;">Supervised Fine-Tuning (SFT)</mark> approach:

* **Dataset Structure**: SFT uses a dataset (D) containing pairs of Questions (Q) and Answers (A).
* **Training Process**: The model is trained to improve its ability to answer questions based on knowledge gained during pre-training or the SFT phase.
* **Usage Scenarios**:
  1. **0-shot Inference**: Q → A (answering without additional context)
  2. **RAG Inference**: Q + D → A (answering with additional documents provided)

### <mark style="color:purple;">RAFT Methodology</mark>

RAFT *<mark style="color:yellow;">modifies the traditional SFT approach</mark>* to better prepare models for domain-specific open-book settings:

#### <mark style="color:blue;">Data Preparation</mark>

* Each data point contains:
  * A question (Q)
  * A set of documents (Dk)
  * A Chain-of-Thought style answer (A\*)

#### <mark style="color:blue;">Document Types</mark>

1. <mark style="color:green;">\*</mark>*<mark style="color:green;">'Golden' Documents (D</mark>*<mark style="color:green;">)\*\*:</mark> Contains information needed to answer the question.
2. <mark style="color:green;">**'Distractor' Documents (Di)**</mark><mark style="color:green;">:</mark> Do not contain answer-relevant information.

#### <mark style="color:blue;">Training Data Structure</mark>

RAFT uses <mark style="color:yellow;">two types</mark> of training data:

1. For P% of questions:
   * Q + D\* + D1 + D2 + ... + Dk → A\* (Question + Golden Document + Distractor Documents → Answer)
2. For (1-P)% of questions:
   * Q + D1 + D2 + ... + Dk → A\* (Question + Only Distractor Documents → Answer)

#### <mark style="color:blue;">Training Process</mark>

* The model is fine-tuned using standard SFT techniques on this prepared data.
* By sometimes removing golden documents, the model is compelled to memorise answers and learn to distinguish between relevant and irrelevant information.

#### <mark style="color:blue;">Chain-of-Thought Reasoning</mark>

* RAFT incorporates Chain-of-Thought reasoning in the answers.
* This involves creating a full reasoning chain and citing sources from the context.
* Answers include:
  1. Citations from the original context (marked with ##begin\_quote## and ##end\_quote##)
  2. Detailed explanations on how to reach the conclusion based on the citations

### <mark style="color:purple;">Key Concepts and Technical Details</mark>

1. **Open-book Exam Analogy**: RAFT is likened to preparing for an open-book exam, where the model learns to recognise and use relevant information while ignoring distractors.
2. **In-domain RAG**: RAFT is designed to improve the model's performance specifically on the set of documents it's trained on, making it suitable for domain-specific applications.
3. **Retriever Independence**: RAFT is independent of the specific retrieval method used in the RAG pipeline.
4. **Balancing Memorisation and Derivation**: By including both scenarios (with and without golden documents), RAFT aims to balance the model's ability to memorise important information and derive answers from provided context.
5. **Source Citation**: The inclusion of direct quotes from the source documents in the answers helps the model learn to identify and use relevant information accurately.
6. **Flexibility in 'Golden' Documents**: The method allows for multiple documents to be considered 'golden' for more complex questions (e.g., in the HotpotQA dataset).

### 5. Expected Outcomes

The authors suggest that this approach:

* Enhances the model's accuracy in answering questions
* Improves the model's ability to reason and explain its answers
* Increases robustness in handling both relevant and irrelevant information

The subsequent sections of the paper are expected to provide experimental results demonstrating these outcomes across various datasets.

## <mark style="color:purple;">RAFT Evaluation and Results Summary</mark>

### <mark style="color:blue;">Evaluation Methodology</mark>

1. **Datasets Used:**
   * Natural Questions (NQ)
   * Trivia QA
   * HotpotQA
   * HuggingFace, Torch Hub, and TensorFlow Hub (from APIBench)
   * PubMed QA
2. **Baseline Models:**
   * LlaMA2-7B-chat model with 0-shot prompting
   * LlaMA2-7B-chat model with RAG
   * Domain-Specific Finetuning (DSF) with 0-shot prompting
   * Domain-Specific Finetuning with RAG (DSF + RAG)
3. **Evaluation Metrics:** The paper doesn't explicitly state the metrics, but it appears to use accuracy percentages for comparison.

### <mark style="color:blue;">Key Results</mark>

1. **Overall Performance:**
   * <mark style="color:yellow;">RAFT consistently outperformed all baselines across the datasets</mark>.
   * Significant improvements were observed compared to the base Llama-2 model and domain-specific fine-tuning.
2. **Specific Improvements:**
   * Hotpot QA: RAFT showed a 35.25% improvement over the base Llama-2 model.
   * Torch Hub: RAFT demonstrated a 76.35% improvement over the base Llama-2 model.
   * HuggingFace: RAFT outperformed DSF by 31.41%.
3. **Performance on PubMed QA:**
   * For binary yes/no questions, RAFT didn't show significant gains compared to DSF + RAG.
4. **Comparison with GPT-3.5:**
   * RAFT demonstrated significant advantages even when compared to the larger GPT-3.5 model.
5. **Chain-of-Thought (CoT) Impact:**
   * Incorporating CoT significantly improved performance:
     * Hotpot QA: 9.66% improvement
     * HuggingFace: 14.93% improvement
6. **Golden Context Ratio Study:**
   * The optimal proportion of training data including golden documents varied across datasets (40%, 60%, 100%).
   * Surprisingly, including some training data without golden documents (P = 80%) enhanced model performance on RAG tasks.

### Table of Results

<table><thead><tr><th width="195">Model</th><th width="101">PubMed</th><th width="97">HotPot</th><th width="138">HuggingFace</th><th width="109">Torch Hub</th><th>TensorFlow</th></tr></thead><tbody><tr><td>GPT-3.5 + RAG</td><td>71.60</td><td>41.5</td><td>29.08</td><td>60.21</td><td>65.59</td></tr><tr><td>LLaMA2-7B</td><td>56.5</td><td>0.54</td><td>0.22</td><td>0</td><td>0</td></tr><tr><td>LLaMA2-7B + RAG</td><td>58.8</td><td>0.03</td><td>26.43</td><td>08.60</td><td>43.06</td></tr><tr><td>DSF</td><td>59.7</td><td>6.38</td><td>61.06</td><td>84.94</td><td>86.56</td></tr><tr><td>DSF + RAG</td><td>71.6</td><td>4.41</td><td>42.59</td><td>82.80</td><td>60.29</td></tr><tr><td>RAFT(LLaMA2-7B)</td><td>73.30</td><td>35.28</td><td>74.00</td><td>84.95</td><td>86.86</td></tr></tbody></table>

### Key Takeaways

1. RAFT significantly improves RAG performance across various specialized domains.
2. The method enhances both the model's ability to extract information and its robustness towards distractors.
3. Chain-of-Thought reasoning substantially contributes to the model's performance.
4. Including some training data without golden documents can be beneficial for downstream RAG tasks.
5. RAFT outperforms larger models like GPT-3.5 in specific domain tasks.

## <mark style="color:purple;">RAFT Generalization to Top-K RAG</mark>

This section of the paper explores how RAFT (Retrieval Augmented Fine-Tuning) performs when faced with varying numbers of documents during test time, particularly in top-k RAG (Retrieval-Augmented Generation) scenarios. The researchers aim to address a critical challenge in LLM+RAG systems: the model's ability to handle irrelevant information effectively.

### Key Concepts

1. **Top-k RAG**: A technique where the k most relevant documents are retrieved and provided to the model during inference.
2. **Distractor Documents**: Irrelevant documents included alongside relevant ones during training or testing.
3. **Golden Documents**: Highly relevant documents that contain the information needed to answer a query.

### <mark style="color:blue;">The Challenge</mark>

Large Language Models (LLMs) are known to be vulnerable to irrelevant text.&#x20;

This vulnerability becomes particularly problematic in RAG systems, *<mark style="color:yellow;">where the retrieval process might introduce irrelevant information</mark>*. The goal is to make the model robust enough to discern and disregard irrelevant content while focusing on pertinent information.

### <mark style="color:blue;">RAFT's Approach</mark>

RAFT addresses this challenge by:

1. Training the model with a <mark style="color:yellow;">mix of golden (relevant) and distractor (irrelevant) documents</mark>.
2. Investigating the <mark style="color:yellow;">optimal ratio of distractor documents to include during training</mark>.
3. Assessing how well this training approach <mark style="color:yellow;">generalises to different volumes of documents encountered during testing</mark>.

### <mark style="color:blue;">Experimental Setup</mark>

The researchers conducted two main experiments:

1. **Training with Distractor Documents**:
   * Varied the number of distractor documents during training.
   * Consistently evaluated using the top-3 documents from the retriever.
2. **Generalization to Variable Test-Time Documents**:
   * Trained models with different numbers of distractor documents.
   * Tested these models with varying numbers of documents at test time.

### <mark style="color:blue;">Key Findings</mark>

1. **Importance of Distractor Documents in Training**:
   * Training with only golden documents often resulted in inferior performance.
   * Including distractor documents during training improved the model's ability to handle irrelevant information.
2. **Optimal Number of Training Documents**:
   * For Natural Questions (NQ): <mark style="color:yellow;">Best performance when training with golden document + 3 distractors</mark> (D\* + 3D).
   * For HotpotQA: Best performance when training with golden document + 1 distractor (D\* + 1D).
   * RAFT consistently used 1 golden document + 4 distractor documents in their experiments.
3. **Generalisation to Variable Test-Time Documents**:
   * Models trained with distractor documents showed more resilience to fluctuations in the number of test-time documents.
   * This demonstrates the robustness of the RAFT approach in real-world scenarios where the number of retrieved documents may vary.

### <mark style="color:blue;">Implications</mark>

1. **Improved Robustness**: RAFT's approach of including distractor documents during training enhances the model's ability to handle irrelevant information in real-world RAG applications.
2. **Flexibility**: The method allows for better generalization across different retrieval settings (e.g., top-3, top-5, top-10 RAG).
3. **Optimal Training Strategy**: The findings suggest that there's an optimal balance of golden and distractor documents during training, which may vary depending on the specific task or domain.
4. **Real-world Applicability**: By demonstrating robustness to varying numbers of test-time documents, RAFT shows promise for deployment in practical RAG systems where retrieval results may be inconsistent.

## <mark style="color:purple;">RAFT: Conclusion and Practical Applications</mark>

Retrieval Augmented Fine Tuning (RAFT) represents a significant advancement in training Large Language Models (LLMs) for domain-specific, open-book question answering tasks.&#x20;

Key aspects of RAFT include:

1. Training with a mix of relevant (golden) and irrelevant (distractor) documents.
2. Structuring the dataset to sometimes exclude golden documents from the context.
3. Generating answers using a chain-of-thought approach with direct quotations from relevant text.

Evaluations on diverse datasets (PubMed, HotpotQA, Gorilla API Bench) demonstrate RAFT's superior performance compared to traditional fine-tuning methods and even larger models like GPT-3.5. RAFT shows particular strength in:

* Improving information extraction from domain-specific documents.
* Enhancing robustness against irrelevant information.
* Generalizing well to varying numbers of retrieved documents during inference.

These capabilities position RAFT as a promising approach for enhancing LLM performance in Retrieval Augmented Generation (RAG) systems across various specialized domains.

### <mark style="color:purple;">Practical Applications and Commercial Ideas</mark>

1. **Enhanced Medical Literature Review**
   * Application: Assist medical researchers in quickly finding relevant information from vast medical literature.
   * Commercial Idea: Develop a subscription-based platform for healthcare professionals and researchers, offering rapid, accurate insights from medical journals and clinical trial data.
2. **Legal Document Analysis**
   * Application: Improve efficiency in legal research and case preparation.
   * Commercial Idea: Create a RAFT-powered legal assistant tool for law firms, offering faster contract analysis, case law research, and legal precedent identification.
3. **Intelligent Technical Support Systems**
   * Application: Enhance customer support in technical fields (e.g., software, electronics).
   * Commercial Idea: Develop an AI-powered technical support platform that can accurately answer complex product-specific queries by referencing vast product documentation and user manuals.
4. **Personalised Educational Assistant**
   * Application: Provide tailored explanations and answers across various academic subjects.
   * Commercial Idea: Create an adaptive learning platform that uses RAFT to offer personalised tutoring and homework assistance, drawing from textbooks and educational resources.
5. **Financial Analysis and Research Tool**
   * Application: Assist in financial research, market analysis, and investment decisions.
   * Commercial Idea: Develop a RAFT-based financial assistant for investment firms and individual investors, offering insights from financial reports, market data, and news articles.
6. **Enhanced Content Management Systems**
   * Application: Improve content creation and curation in large organizations.
   * Commercial Idea: Create an intelligent content management system that can answer queries about internal documents, policies, and procedures, aiding in knowledge management and employee onboarding.
7. **Sophisticated Customer Service Chatbots**
   * Application: Enhance customer service with more accurate and context-aware responses.
   * Commercial Idea: Offer a RAFT-powered chatbot service that can handle complex customer inquiries by referencing extensive product catalogs, FAQs, and policy documents.
8. **Scientific Literature Assistant**
   * Application: Aid researchers in navigating and synthesizing information from scientific papers.
   * Commercial Idea: Develop a research tool for academic institutions and R\&D departments that can answer complex scientific questions by analyzing and synthesising information from multiple research papers.
9. **Intelligent Documentation for Software Development**
   * Application: Improve code documentation and API understanding for developers.
   * Commercial Idea: Create a RAFT-based coding assistant that can answer queries about complex codebases, APIs, and frameworks by referencing extensive documentation and code repositories.
10. **Regulatory Compliance Assistant**
    * Application: Help businesses navigate complex regulatory environments.
    * Commercial Idea: Develop a compliance tool for industries with strict regulations (e.g., finance, healthcare), offering up-to-date guidance on regulatory requirements by analyzing vast amounts of legal and regulatory documents.

These applications leverage RAFT's ability to process domain-specific information accurately and its robustness in handling varying amounts of context, making it valuable across diverse industries and use cases.
