ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
The widely cited paper by Omar Khattab and Matei Zaharia from Stanford University
Last updated
Copyright Continuum Labs - 2023
The widely cited paper by Omar Khattab and Matei Zaharia from Stanford University
Last updated
This famous and widely cited June 2020 paper introduced a novel information retrieval (IR) model that addressed the efficiency and effectiveness challenges faced by existing deep learning-based ranking models.
Information retrieval systems are ranked on their ability to quickly and cost effectively search through large document collections to find relevant information.
Recent advances in deep learning, particularly the use of deep language models like BERT (2017), have significantly improved the effectiveness of ranking models.
However, these models are computationally expensive, leading to high latency and resource requirements.
ColBERT (Contextualized Late Interaction over BERT) is a ranking model that addresses these challenges by allowing independent encoding of queries and documents, followed by a computationally cheap interaction step, making it suitable for both re-ranking and end-to-end retrieval scenarios.
These methods compute embeddings for the query and document independently, then calculate a single similarity score between the two vectors.
In contrast, ColBERT computes multiple embeddings for each query and document term, allowing for a more fine-grained interaction.
These methods model word-level interactions between the query and document, typically using an interaction matrix processed by a neural network.
ColBERT also models interactions, but does so in a "late" fashion. It first computes embeddings for query and document terms independently, then calculates interactions using MaxSim operations. This allows for pre-computation of document embeddings, improving efficiency.
BERT-like models consider interactions among all words in the query and document simultaneously, using the Transformer's attention mechanism.
ColBERT also uses BERT, but employs it differently.
Instead of a single, computationally expensive all-to-all interaction, it uses BERT to generate term-level embeddings, then applies cheaper MaxSim operations for interaction. This retains the power of BERT's representations while improving efficiency.
The first step in ColBERT is encoding the query and documents into a bag of fixed-size embeddings.
This is called query augmentation, which allows BERT to produce query-based embeddings at the positions corresponding to these masks.
Query augmentation serves as a soft, differentiable mechanism for learning to expand queries with new terms or re-weigh existing terms based on their importance for matching the query.
The padded sequence of input tokens is then passed into BERT's deep transformer architecture, which computes a contextualised representation of each token.
The contextualised output representations are then passed through a linear layer with no activations.
This layer serves to control the dimension of ColBERT's embeddings, producing m-dimensional embeddings for the layer's output size m, which is typically much smaller than BERT's fixed hidden dimension.
Finally, the output embeddings are normalised so each has an L2 norm equal to one.
Unlike queries, [mask] tokens are not appended to documents.
After passing the input sequence through BERT and the linear layer, the document encoder filters out the embeddings corresponding to punctuation symbols, determined via a predefined list.
This filtering is meant to reduce the number of embeddings per document, as contextualized embeddings of punctuation are hypothesised to be unnecessary for effectiveness.
The key idea behind late interaction is to delay the interaction between the query and document embeddings until after they have been generated independently.
This is in contrast to other approaches that perform the interaction during the encoding process itself.
In ColBERT, the late interaction is conducted as a sum of maximum similarity computations.
The similarity function used can be either cosine similarity (implemented efficiently as dot products due to embedding normalization) or squared L2 distance.
Intuitively, this operation finds the best matching document embedding for each query embedding and sums up these maximum similarity scores.
It captures the overall relevance between the query and the document based on the strongest local matches between their contextualised embeddings.
ColBERT is trained end-to-end using pairwise softmax cross-entropy loss.
The model is then optimised to assign higher scores to positive documents compared to negative documents.
It's important to note that the late interaction mechanism itself has no trainable parameters.
The learnable parameters in ColBERT are in the BERT-based encoders and the additional linear layers.
Query Encoder (fQ): Transforms a query into a set of fixed-size embeddings, where each embedding is contextualised based on the entire query.
Document Encoder (fD): Similar to the query encoder, it transforms a document into a set of embeddings, with each embedding contextualised within the document's content.
Late Interaction Mechanism: Uses a simple yet powerful approach to compute the relevance score between a query and a document by summing the maximum similarity scores across all pairs of query and document embeddings.
The retrieval process in ColBERT involves two stages: an offline indexing stage and an online querying stage.
This process is computationally expensive but needs to be done only once. ColBERT applies several optimisations to speed up indexing:
Parallel encoding of document batches using multiple GPUs.
Padding documents within a batch to the maximum length for efficient batch processing.
Grouping documents by length and processing them in batches of similar lengths (known as length-based bucketing).
Parallelising the text preprocessing (e.g., WordPiece tokenization) across CPU cores.
The document embeddings are then saved to disk using 16-bit or 32-bit floating-point representations.
Online Querying: During online querying, ColBERT can be used in two modes: re-ranking or end-to-end retrieval.
In the re-ranking mode, ColBERT is used to re-rank a small set of candidate documents (e.g., top-k documents from a term-based retrieval model).
In the end-to-end retrieval mode, ColBERT directly retrieves the top-k documents from the entire collection.
This is done using a two-stage approach:
The candidate documents from the first stage are then re-ranked using the full late interaction mechanism to obtain the final top-k results.
The use of ANN search in the first stage allows ColBERT to efficiently handle large document collections by avoiding exhaustive scoring of all documents. The re-ranking stage ensures that the final results are based on the full ColBERT relevance scores.
Overall, the late interaction mechanism and the two-stage retrieval process enable ColBERT to achieve high-quality retrieval results while maintaining computational efficiency, making it suitable for large-scale information retrieval tasks.
The experimental evaluation section of the ColBERT paper is where the authors empirically assess the performance of ColBERT in various retrieval contexts, addressing specific research questions (RQs) to determine the model's efficiency and effectiveness in comparison with both traditional and neural ranking models.
Datasets & Metrics
They used two datasets, MS MARCO and TREC-CAR, which are large enough to train and evaluate deep neural networks.
MS MARCO is used to assess reading comprehension and information retrieval with real-world queries, while TREC-CAR is based on Wikipedia and focuses on complex answer retrieval.
The evaluation metrics include Mean Reciprocal Rank at 10 (MRR@10) and others like Recall@k for different k values.
Implementation
The authors implemented ColBERT using PyTorch and the transformers library.
They fine-tuned the BERT models with specific learning rates and batch sizes, adjusting the number of embeddings per query and the dimension of ColBERT embeddings. They experimented with both BERT base and large models, adjusting training iterations based on the dataset.
Hardware & Time Measurements
They measured latency using a Tesla V100 GPU for re-ranking tasks and Titan V GPUs for CPU-based retrieval and indexing experiments, ensuring a fair assessment of computational efficiency.
Re-ranking Efficiency and Effectiveness (RQ1
ColBERT's re-ranking performance was compared with other neural rankers like KNRM, Duet, and fastText+ConvKNRM, and variations of BERT-based rankers.
The aim was to understand if ColBERT could bridge the gap between efficient and effective neural models.
The results showed that ColBERT achieved comparable or better effectiveness (measured by MRR@10) than these models while being significantly more efficient in terms of latency and computational resources (FLOPs).
End-to-End Retrieval (RQ2)
Beyond re-ranking, the authors tested if ColBERT could support efficient and effective end-to-end retrieval from a large document collection.
They demonstrated that ColBERT could retrieve top-k documents directly from the MS MARCO collection with high effectiveness (MRR@10) and recall metrics, indicating strong performance in filtering a large collection to find relevant documents.
Component Contribution (RQ3)
The authors conducted an ablation study to evaluate the contribution of different components of ColBERT, such as late interaction and query augmentation, to the overall performance. This study aimed to identify which parts of the model were critical for its effectiveness and efficiency.
Indexing-Related Costs (RQ4)
The study also looked into the costs associated with offline computation and memory overhead for indexing with ColBERT. This aspect is crucial for understanding the practicality of deploying ColBERT in real-world scenarios, especially considering the balance between offline indexing costs and online retrieval efficiency.
In summary, the experimental evaluation demonstrated ColBERT's ability to maintain or exceed the effectiveness of state-of-the-art neural rankers while significantly reducing computational costs, highlighting its potential for practical, large-scale information retrieval applications.
ColBERT represents a significant advancement in the field of information retrieval by successfully addressing the trade-off between efficiency and effectiveness in deep learning-based ranking models.
By introducing a late interaction mechanism and leveraging the power of BERT's contextualised representations, ColBERT achieves remarkable speedups and requires substantially fewer computational resources compared to existing BERT-based models, while maintaining high-quality retrieval performance.
The extensive experimental evaluation demonstrates ColBERT's superiority over non-BERT baselines and its potential for practical deployment in large-scale information retrieval systems.
As the volume of digital information continues to grow, the development of efficient and effective retrieval models like ColBERT will be crucial in enabling users to access relevant information quickly and accurately
Here's an example of how ColBERT would work in practice, focusing on the query encoding, document encoding, and late interaction stages.
Let's consider a simple query and a set of two documents:
Query: "What is the capital of France?"
Document 1: "Paris is the capital and most populous city of France, with an estimated population of 2,165,423 residents as of 2019 in an area of more than 105 square kilometres."
Document 2: "London is the capital and largest city of England and the United Kingdom, with a population of just under 9 million."
The query "What is the capital of France?" is tokenized into WordPiece tokens: ['what', 'is', 'the', 'capital', 'of', 'france', '?'].
The special token [Q] is prepended to the query, and the sequence is padded with [mask] tokens to reach the predefined length Nq. Let's assume Nq = 10. Input to BERT: [CLS] [Q] 'what' 'is' 'the' 'capital' 'of' 'france' '?' [mask] [mask]
The padded sequence is passed through BERT, and the contextualized representations of each token are obtained.
Document 1 is tokenized into WordPiece tokens: ['paris', 'is', 'the', 'capital', 'and', 'most', 'populous', 'city', 'of', 'france', ',', 'with', 'an', 'estimated', 'population', 'of', '2', ',', '165', ',', '423', 'residents', 'as', 'of', '2019', 'in', 'an', 'area', 'of', 'more', 'than', '105', 'square', 'kilometres', '.']
The special tokens [CLS] and [D] are prepended to the document. Input to BERT: [CLS] [D] 'paris' 'is' 'the' 'capital' 'and' 'most' 'populous' 'city' 'of' 'france' ',' 'with' 'an' 'estimated' 'population' 'of' '2' ',' '165' ',' '423' 'residents' 'as' 'of' '2019' 'in' 'an' 'area' 'of' 'more' 'than' '105' 'square' 'kilometres' '.'
The sequence is passed through BERT, and the contextualized representations of each token are obtained.
The contextualized representations are passed through a linear layer to reduce their dimensionality to m (128).
The resulting embeddings are normalized, and the embeddings corresponding to punctuation are filtered out.
The documents are ranked based on their relevance scores. In this example, Document 1 would likely receive a higher relevance score than Document 2, as it contains more information related to the query "What is the capital of France?".
This example demonstrates how ColBERT encodes queries and documents separately, and then uses late interaction to compute relevance scores based on the maximum similarity between query and document embeddings.
This approach allows for efficient retrieval while leveraging the power of contextualized representations from BERT.
This is done using BERT-based encoders, denoted as for the query encoder and for the document encoder.
Although a single BERT model is shared between the query and document encoders, ColBERT distinguishes between query and document input sequences by prepending a special token to queries and to documents.
For the query encoder , given a textual query , it is first tokenized into BERT-based WordPiece tokens: .
The special token is prepended to the query, right after BERT's sequence start token [CLS].
If the query has fewer than a predefined number of tokens , it is padded with BERT's special [mask] tokens up to length .
This ensures that the dot-product of any two embeddings becomes equivalent to their cosine similarity, falling in the ] range.
The document encoder follows a similar architecture.
The document is segmented into its constituent tokens , and BERT's start token [CLS] followed by the special token [D] is prepended.
Late interaction is the process of estimating the relevance score between a query and a document , denoted as , using their bags of contextualized embeddings ( and ) obtained from the query and document encoders.
Specifically, for each query embedding in , the maximum similarity score is computed against all document embeddings in .
Mathematically, the relevance score is calculated as:
where and denote the number of embeddings in and , respectively.
Given a triple where is a positive (relevant) document and is a negative (irrelevant) document for query , ColBERT produces a score for each document individually using the late interaction mechanism.
Offline Indexing: During offline indexing, the document encoder is run on each document in the collection, and the resulting embeddings are stored.
The query encoder is run on the input query to obtain the query embeddings .
The pre-computed document embeddings of the candidate documents are loaded from disk and stacked into a 3D tensor .
The relevance scores are then computed using late interaction between and , and the documents are sorted based on their scores.
An approximate nearest neighbor (ANN) search is performed using the FAISS library. For each query embedding in , the top-k' nearest document embeddings are retrieved from the FAISS index. This stage efficiently narrows down the candidate set to a smaller subset of potentially relevant documents.
The contextualized representations are passed through a linear layer to reduce their dimensionality to . Let's assume m = 128.
The resulting embeddings are normalized to have an L2 norm of 1. Query Embeddings , where each is a 128-dimensional vector.
Document 1 Embeddings , where each is a 128-dimensional vector and is the number of non-punctuation tokens in Document 1.
The same process is applied to Document 2, resulting in its embeddings .
For each query embedding in , the maximum similarity score (e.g., cosine similarity) is computed against all document embeddings in .
The same process is repeated for Document 2 embeddings .
The maximum similarity scores are summed across all query embeddings to obtain the relevance scores and for Documents 1 and 2, respectively.
Relevance Score for Document 1
Relevance Score for Document 2