RAFT: Adapting Language Model to Domain Specific RAG

Introduction to "Retrieval Augmented Fine Tuning (RAFT)

This June 2024 paper introduces a novel approach called Retrieval Augmented Fine Tuning (RAFT), which aims to improve the performance of Large Language Models (LLMs) in domain-specific retrieval-augmented generation (RAG) tasks.

The authors present RAFT as a training methodology that enhances a model's ability to answer questions in "open-book" in-domain settings.

Key Points

Problem Addressed: The paper tackles the challenge of adapting pre-trained LLMs for specialized domains where accuracy based on a given set of documents is crucial.
Current Limitations: Existing methods like in-context learning through RAG and supervised fine-tuning have limitations. RAG-based methods don't leverage the learning opportunity in fixed domains, while fine-tuning approaches often fail to account for imperfections in the retrieval process.
RAFT Solution: The proposed method combines instruction fine-tuning with retrieval augmented generation. It trains the model to:
- Incorporate domain knowledge
- Improve in-domain RAG performance
- Identify and use relevant documents while ignoring distractors
Training Process: RAFT trains the model to answer questions using relevant documents while also presenting it with distractor documents. This process improves the model's ability to reason and cite relevant information.
Performance: The authors report that RAFT consistently outperforms supervised fine-tuning (with and without RAG) across multiple datasets, including PubMed, HotpotQA, and Gorilla.
Analogy: The authors liken their approach to studying for an open-book exam, where the model learns to recognize relevant and irrelevant retrieved documents.

Detailed Explanation of RAFT Methodology

Introduction to RAFT

RAFT (Retrieval Augmented Fine-Tuning) is presented as a training method for Large Language Models (LLMs) specifically designed for domain-specific "open-book" scenarios.

The authors describe it as a way to prepare LLMs for specialised tasks where the model needs to effectively use external information to answer questions.

Supervised Fine-Tuning (SFT) - The Traditional Approach

Before introducing RAFT, the paper explains the traditional Supervised Fine-Tuning (SFT) approach:

Dataset Structure: SFT uses a dataset (D) containing pairs of Questions (Q) and Answers (A).
Training Process: The model is trained to improve its ability to answer questions based on knowledge gained during pre-training or the SFT phase.
Usage Scenarios:
1. 0-shot Inference: Q → A (answering without additional context)
2. RAG Inference: Q + D → A (answering with additional documents provided)

RAFT Methodology

RAFT modifies the traditional SFT approach to better prepare models for domain-specific open-book settings:

Data Preparation

Each data point contains:
- A question (Q)
- A set of documents (Dk)
- A Chain-of-Thought style answer (A*)

Document Types

*'Golden' Documents (D)**: Contains information needed to answer the question.
'Distractor' Documents (Di): Do not contain answer-relevant information.

Training Data Structure

RAFT uses two types of training data:

For P% of questions:
- Q + D* + D1 + D2 + ... + Dk → A* (Question + Golden Document + Distractor Documents → Answer)
For (1-P)% of questions:
- Q + D1 + D2 + ... + Dk → A* (Question + Only Distractor Documents → Answer)

Training Process

The model is fine-tuned using standard SFT techniques on this prepared data.
By sometimes removing golden documents, the model is compelled to memorise answers and learn to distinguish between relevant and irrelevant information.

Chain-of-Thought Reasoning

RAFT incorporates Chain-of-Thought reasoning in the answers.
This involves creating a full reasoning chain and citing sources from the context.
Answers include:
1. Citations from the original context (marked with ##begin_quote## and ##end_quote##)
2. Detailed explanations on how to reach the conclusion based on the citations

Key Concepts and Technical Details

Open-book Exam Analogy: RAFT is likened to preparing for an open-book exam, where the model learns to recognise and use relevant information while ignoring distractors.
In-domain RAG: RAFT is designed to improve the model's performance specifically on the set of documents it's trained on, making it suitable for domain-specific applications.
Retriever Independence: RAFT is independent of the specific retrieval method used in the RAG pipeline.
Balancing Memorisation and Derivation: By including both scenarios (with and without golden documents), RAFT aims to balance the model's ability to memorise important information and derive answers from provided context.
Source Citation: The inclusion of direct quotes from the source documents in the answers helps the model learn to identify and use relevant information accurately.
Flexibility in 'Golden' Documents: The method allows for multiple documents to be considered 'golden' for more complex questions (e.g., in the HotpotQA dataset).

5. Expected Outcomes

The authors suggest that this approach:

Enhances the model's accuracy in answering questions
Improves the model's ability to reason and explain its answers
Increases robustness in handling both relevant and irrelevant information

The subsequent sections of the paper are expected to provide experimental results demonstrating these outcomes across various datasets.

RAFT Evaluation and Results Summary

Evaluation Methodology

Datasets Used:
- Natural Questions (NQ)
- Trivia QA
- HotpotQA
- HuggingFace, Torch Hub, and TensorFlow Hub (from APIBench)
- PubMed QA
Baseline Models:
- LlaMA2-7B-chat model with 0-shot prompting
- LlaMA2-7B-chat model with RAG
- Domain-Specific Finetuning (DSF) with 0-shot prompting
- Domain-Specific Finetuning with RAG (DSF + RAG)
Evaluation Metrics: The paper doesn't explicitly state the metrics, but it appears to use accuracy percentages for comparison.

Key Results

Overall Performance:
- RAFT consistently outperformed all baselines across the datasets.
- Significant improvements were observed compared to the base Llama-2 model and domain-specific fine-tuning.
Specific Improvements:
- Hotpot QA: RAFT showed a 35.25% improvement over the base Llama-2 model.
- Torch Hub: RAFT demonstrated a 76.35% improvement over the base Llama-2 model.
- HuggingFace: RAFT outperformed DSF by 31.41%.
Performance on PubMed QA:
- For binary yes/no questions, RAFT didn't show significant gains compared to DSF + RAG.
Comparison with GPT-3.5:
- RAFT demonstrated significant advantages even when compared to the larger GPT-3.5 model.
Chain-of-Thought (CoT) Impact:
- Incorporating CoT significantly improved performance:
  - Hotpot QA: 9.66% improvement
  - HuggingFace: 14.93% improvement
Golden Context Ratio Study:
- The optimal proportion of training data including golden documents varied across datasets (40%, 60%, 100%).
- Surprisingly, including some training data without golden documents (P = 80%) enhanced model performance on RAG tasks.

Table of Results

Model

PubMed

HotPot

HuggingFace

Torch Hub

TensorFlow

GPT-3.5 + RAG

71.60

41.5

29.08

60.21

65.59

LLaMA2-7B

56.5

0.54

0.22

LLaMA2-7B + RAG

58.8

0.03

26.43

08.60

43.06

DSF

59.7

6.38

61.06

84.94

86.56

DSF + RAG

71.6

4.41

42.59

82.80

60.29

RAFT(LLaMA2-7B)

73.30

35.28

74.00

84.95

86.86

Key Takeaways

RAFT significantly improves RAG performance across various specialized domains.
The method enhances both the model's ability to extract information and its robustness towards distractors.
Chain-of-Thought reasoning substantially contributes to the model's performance.
Including some training data without golden documents can be beneficial for downstream RAG tasks.
RAFT outperforms larger models like GPT-3.5 in specific domain tasks.

RAFT Generalization to Top-K RAG

This section of the paper explores how RAFT (Retrieval Augmented Fine-Tuning) performs when faced with varying numbers of documents during test time, particularly in top-k RAG (Retrieval-Augmented Generation) scenarios. The researchers aim to address a critical challenge in LLM+RAG systems: the model's ability to handle irrelevant information effectively.

Key Concepts

Top-k RAG: A technique where the k most relevant documents are retrieved and provided to the model during inference.
Distractor Documents: Irrelevant documents included alongside relevant ones during training or testing.
Golden Documents: Highly relevant documents that contain the information needed to answer a query.

The Challenge

Large Language Models (LLMs) are known to be vulnerable to irrelevant text.

This vulnerability becomes particularly problematic in RAG systems, where the retrieval process might introduce irrelevant information. The goal is to make the model robust enough to discern and disregard irrelevant content while focusing on pertinent information.

RAFT's Approach

RAFT addresses this challenge by:

Training the model with a mix of golden (relevant) and distractor (irrelevant) documents.
Investigating the optimal ratio of distractor documents to include during training.
Assessing how well this training approach generalises to different volumes of documents encountered during testing.

Experimental Setup

The researchers conducted two main experiments:

Training with Distractor Documents:
- Varied the number of distractor documents during training.
- Consistently evaluated using the top-3 documents from the retriever.
Generalization to Variable Test-Time Documents:
- Trained models with different numbers of distractor documents.
- Tested these models with varying numbers of documents at test time.

Key Findings

Importance of Distractor Documents in Training:
- Training with only golden documents often resulted in inferior performance.
- Including distractor documents during training improved the model's ability to handle irrelevant information.
Optimal Number of Training Documents:
- For Natural Questions (NQ): Best performance when training with golden document + 3 distractors (D* + 3D).
- For HotpotQA: Best performance when training with golden document + 1 distractor (D* + 1D).
- RAFT consistently used 1 golden document + 4 distractor documents in their experiments.
Generalisation to Variable Test-Time Documents:
- Models trained with distractor documents showed more resilience to fluctuations in the number of test-time documents.
- This demonstrates the robustness of the RAFT approach in real-world scenarios where the number of retrieved documents may vary.

Implications

Improved Robustness: RAFT's approach of including distractor documents during training enhances the model's ability to handle irrelevant information in real-world RAG applications.
Flexibility: The method allows for better generalization across different retrieval settings (e.g., top-3, top-5, top-10 RAG).
Optimal Training Strategy: The findings suggest that there's an optimal balance of golden and distractor documents during training, which may vary depending on the specific task or domain.
Real-world Applicability: By demonstrating robustness to varying numbers of test-time documents, RAFT shows promise for deployment in practical RAG systems where retrieval results may be inconsistent.

RAFT: Conclusion and Practical Applications

Retrieval Augmented Fine Tuning (RAFT) represents a significant advancement in training Large Language Models (LLMs) for domain-specific, open-book question answering tasks.

Key aspects of RAFT include:

Training with a mix of relevant (golden) and irrelevant (distractor) documents.
Structuring the dataset to sometimes exclude golden documents from the context.
Generating answers using a chain-of-thought approach with direct quotations from relevant text.

Evaluations on diverse datasets (PubMed, HotpotQA, Gorilla API Bench) demonstrate RAFT's superior performance compared to traditional fine-tuning methods and even larger models like GPT-3.5. RAFT shows particular strength in:

Improving information extraction from domain-specific documents.
Enhancing robustness against irrelevant information.
Generalizing well to varying numbers of retrieved documents during inference.

These capabilities position RAFT as a promising approach for enhancing LLM performance in Retrieval Augmented Generation (RAG) systems across various specialized domains.

Practical Applications and Commercial Ideas

Enhanced Medical Literature Review
- Application: Assist medical researchers in quickly finding relevant information from vast medical literature.
- Commercial Idea: Develop a subscription-based platform for healthcare professionals and researchers, offering rapid, accurate insights from medical journals and clinical trial data.
Legal Document Analysis
- Application: Improve efficiency in legal research and case preparation.
- Commercial Idea: Create a RAFT-powered legal assistant tool for law firms, offering faster contract analysis, case law research, and legal precedent identification.
Intelligent Technical Support Systems
- Application: Enhance customer support in technical fields (e.g., software, electronics).
- Commercial Idea: Develop an AI-powered technical support platform that can accurately answer complex product-specific queries by referencing vast product documentation and user manuals.
Personalised Educational Assistant
- Application: Provide tailored explanations and answers across various academic subjects.
- Commercial Idea: Create an adaptive learning platform that uses RAFT to offer personalised tutoring and homework assistance, drawing from textbooks and educational resources.
Financial Analysis and Research Tool
- Application: Assist in financial research, market analysis, and investment decisions.
- Commercial Idea: Develop a RAFT-based financial assistant for investment firms and individual investors, offering insights from financial reports, market data, and news articles.
Enhanced Content Management Systems
- Application: Improve content creation and curation in large organizations.
- Commercial Idea: Create an intelligent content management system that can answer queries about internal documents, policies, and procedures, aiding in knowledge management and employee onboarding.
Sophisticated Customer Service Chatbots
- Application: Enhance customer service with more accurate and context-aware responses.
- Commercial Idea: Offer a RAFT-powered chatbot service that can handle complex customer inquiries by referencing extensive product catalogs, FAQs, and policy documents.
Scientific Literature Assistant
- Application: Aid researchers in navigating and synthesizing information from scientific papers.
- Commercial Idea: Develop a research tool for academic institutions and R&D departments that can answer complex scientific questions by analyzing and synthesising information from multiple research papers.
Intelligent Documentation for Software Development
- Application: Improve code documentation and API understanding for developers.
- Commercial Idea: Create a RAFT-based coding assistant that can answer queries about complex codebases, APIs, and frameworks by referencing extensive documentation and code repositories.
Regulatory Compliance Assistant
- Application: Help businesses navigate complex regulatory environments.
- Commercial Idea: Develop a compliance tool for industries with strict regulations (e.g., finance, healthcare), offering up-to-date guidance on regulatory requirements by analyzing vast amounts of legal and regulatory documents.

These applications leverage RAFT's ability to process domain-specific information accurately and its robustness in handling varying amounts of context, making it valuable across diverse industries and use cases.

PreviousRetrieval Augmented Generation (RAG) versus fine tuning NextSummarisation Methods and RAG

Last updated 11 months ago

Was this helpful?