# Fine-Tuning Llama for Multi-Stage Text Retrieval

This <mark style="color:blue;">**October 2023**</mark> paper explores the use of large language models (LLMs) in multi-stage text retrieval, specifically focusing on fine-tuning the LLaMA model as both a retriever (RepLLaMA) and a reranker (RankLLaMA).&#x20;

The authors conduct experiments on the MS MARCO passage and document ranking datasets, as well as the BEIR benchmark, to evaluate the effectiveness of their approach.

{% embed url="<https://arxiv.org/abs/2310.08319>" %}
Fine-Tuning Llama for Multi-Stage Text Retrieval
{% endembed %}

### <mark style="color:purple;">Introduction and Motivation</mark>

The authors highlight the importance of text retrieval in various open-domain language comprehension tasks and in enhancing the effectiveness of LLMs through retrieval-augmented generation (RAG).

They argue that *<mark style="color:yellow;">**fine-tuning state-of-the-art LLMs as retrievers and rerankers can yield better effectiveness than previous smaller models**</mark>* and can optimally use LLMs within multi-stage retrieval pipelines.

### <mark style="color:green;">Method</mark>

<mark style="color:blue;">**RepLLaMA**</mark><mark style="color:blue;">:</mark> A dense retriever model that follows the bi-encoder architecture, initialised with the LLaMA model. It encodes queries and documents into vector representations and computes relevance scores using dot products.

<mark style="color:blue;">**RankLLaMA:**</mark> A pointwise reranker model that takes a query and a candidate document together as input and generates a relevance score. It is also initialised with the LLaMA model.

Both models are *<mark style="color:yellow;">**optimised using contrastive loss,**</mark>* with hard negatives sampled from top-ranking results of existing retrieval systems or the RepLLaMA model.

### <mark style="color:green;">Experiments</mark>

<mark style="color:blue;">**Passage Retrieval**</mark>

The authors train RepLLaMA and RankLLaMA on the MS MARCO passage ranking dataset and evaluate their effectiveness on the development split and TREC DL19/DL20 passage ranking test collections. They also assess the zero-shot effectiveness of the models on the BEIR benchmark.

<mark style="color:blue;">**Document Retrieval**</mark>

The authors train the models on the MS MARCO document ranking dataset, which presents the challenge of handling long input sequences. They compare their approach with existing methods that rely on document segmentation and rank score aggregation.

### <mark style="color:green;">Results</mark>

<mark style="color:blue;">**In-Domain Evaluation:**</mark> RepLLaMA and RankLLaMA outperform existing methods on the MS MARCO passage and document ranking datasets, achieving state-of-the-art effectiveness.

<mark style="color:blue;">**Zero-Shot Evaluation:**</mark> Both models demonstrate superior zero-shot effectiveness on the BEIR datasets, surpassing existing dense retrievers with billions of parameters.

### <mark style="color:green;">Ablation Study and Analysis</mark>

<mark style="color:blue;">**Full Fine-Tuning vs. LoRA:**</mark> The authors compare the effectiveness of full fine-tuning and LoRA (a parameter-efficient method) when training RepLLaMA. They find that LoRA generalises better on independent human judgments, despite full fine-tuning achieving higher scores on the training set.

<mark style="color:blue;">**Input Sequence Length:**</mark> The authors investigate the effects of varying the maximum training and inference input lengths on RankLLaMA's effectiveness for document reranking. They observe that effectiveness improves as the input length increases, but the gains plateau beyond a certain length, suggesting a point of diminishing returns.

### <mark style="color:green;">Related Work</mark>

The authors discuss the advancements in pre-trained language models, particularly the development of large decoder-only models like GPT and LLaMA.

They review the evolution of multi-stage text retrieval pipelines, highlighting the impact of pre-trained language models on retrievers and rerankers.

The authors compare their approach with recent methods that prompt LLMs for text reranking in a generative manner, emphasizing the advantages of their fully optimized and efficient multi-stage retrieval system.

### <mark style="color:purple;">Conclusion</mark>

The study demonstrates the potential of fine-tuning LLMs as dense retrievers and pointwise rerankers, establishing an effective, state-of-the-art multi-stage retrieval system that outperforms smaller models.

The authors underscore the potential of leveraging LLMs for retrieval tasks in the future and express their intention to continue exploring this direction.

Overall, this paper presents a comprehensive investigation into the use of LLMs in multi-stage text retrieval, showcasing the effectiveness of fine-tuning LLaMA as both a retriever and a reranker. The authors' approach achieves state-of-the-art results on various datasets and offers insights into the potential and challenges of leveraging LLMs for retrieval tasks
