Fine-Tuning Llama for Multi-Stage Text Retrieval
Microsoft Research
Last updated
Copyright Continuum Labs - 2023
Microsoft Research
Last updated
This October 2023 paper explores the use of large language models (LLMs) in multi-stage text retrieval, specifically focusing on fine-tuning the LLaMA model as both a retriever (RepLLaMA) and a reranker (RankLLaMA).
The authors conduct experiments on the MS MARCO passage and document ranking datasets, as well as the BEIR benchmark, to evaluate the effectiveness of their approach.
The authors highlight the importance of text retrieval in various open-domain language comprehension tasks and in enhancing the effectiveness of LLMs through retrieval-augmented generation (RAG).
They argue that fine-tuning state-of-the-art LLMs as retrievers and rerankers can yield better effectiveness than previous smaller models and can optimally use LLMs within multi-stage retrieval pipelines.
RepLLaMA: A dense retriever model that follows the bi-encoder architecture, initialised with the LLaMA model. It encodes queries and documents into vector representations and computes relevance scores using dot products.
RankLLaMA: A pointwise reranker model that takes a query and a candidate document together as input and generates a relevance score. It is also initialised with the LLaMA model.
Both models are optimised using contrastive loss, with hard negatives sampled from top-ranking results of existing retrieval systems or the RepLLaMA model.
Passage Retrieval
The authors train RepLLaMA and RankLLaMA on the MS MARCO passage ranking dataset and evaluate their effectiveness on the development split and TREC DL19/DL20 passage ranking test collections. They also assess the zero-shot effectiveness of the models on the BEIR benchmark.
Document Retrieval
The authors train the models on the MS MARCO document ranking dataset, which presents the challenge of handling long input sequences. They compare their approach with existing methods that rely on document segmentation and rank score aggregation.
In-Domain Evaluation: RepLLaMA and RankLLaMA outperform existing methods on the MS MARCO passage and document ranking datasets, achieving state-of-the-art effectiveness.
Zero-Shot Evaluation: Both models demonstrate superior zero-shot effectiveness on the BEIR datasets, surpassing existing dense retrievers with billions of parameters.
Full Fine-Tuning vs. LoRA: The authors compare the effectiveness of full fine-tuning and LoRA (a parameter-efficient method) when training RepLLaMA. They find that LoRA generalises better on independent human judgments, despite full fine-tuning achieving higher scores on the training set.
Input Sequence Length: The authors investigate the effects of varying the maximum training and inference input lengths on RankLLaMA's effectiveness for document reranking. They observe that effectiveness improves as the input length increases, but the gains plateau beyond a certain length, suggesting a point of diminishing returns.
The authors discuss the advancements in pre-trained language models, particularly the development of large decoder-only models like GPT and LLaMA.
They review the evolution of multi-stage text retrieval pipelines, highlighting the impact of pre-trained language models on retrievers and rerankers.
The authors compare their approach with recent methods that prompt LLMs for text reranking in a generative manner, emphasizing the advantages of their fully optimized and efficient multi-stage retrieval system.
The study demonstrates the potential of fine-tuning LLMs as dense retrievers and pointwise rerankers, establishing an effective, state-of-the-art multi-stage retrieval system that outperforms smaller models.
The authors underscore the potential of leveraging LLMs for retrieval tasks in the future and express their intention to continue exploring this direction.
Overall, this paper presents a comprehensive investigation into the use of LLMs in multi-stage text retrieval, showcasing the effectiveness of fine-tuning LLaMA as both a retriever and a reranker. The authors' approach achieves state-of-the-art results on various datasets and offers insights into the potential and challenges of leveraging LLMs for retrieval tasks