Page cover image

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

The widely cited paper by Omar Khattab and Matei Zaharia from Stanford University

This famous and widely cited June 2020 paper introduced a novel information retrieval (IR) model that addressed the efficiency and effectiveness challenges faced by existing deep learning-based ranking models.

Information retrieval systems are ranked on their ability to quickly and cost effectively search through large document collections to find relevant information.

Recent advances in deep learning, particularly the use of deep language models like BERT (2017), have significantly improved the effectiveness of ranking models.

However, these models are computationally expensive, leading to high latency and resource requirements.

ColBERT (Contextualized Late Interaction over BERT) is a ranking model that addresses these challenges by allowing independent encoding of queries and documents, followed by a computationally cheap interaction step, making it suitable for both re-ranking and end-to-end retrieval scenarios.

Contrasting ColBERT with Other Methods

Representation-based Similarity (DSSM, SNRM)

These methods compute embeddings for the query and document independently, then calculate a single similarity score between the two vectors.

In contrast, ColBERT computes multiple embeddings for each query and document term, allowing for a more fine-grained interaction.

Schematic diagrams illustrating query–document matching paradigms in neural IR.

Query-Document Interaction (e.g., DRMM, KNRM, Conv-KNRM)

These methods model word-level interactions between the query and document, typically using an interaction matrix processed by a neural network.

ColBERT also models interactions, but does so in a "late" fashion. It first computes embeddings for query and document terms independently, then calculates interactions using MaxSim operations. This allows for pre-computation of document embeddings, improving efficiency.

All-to-all Interaction (BERT)

BERT-like models consider interactions among all words in the query and document simultaneously, using the Transformer's attention mechanism.

ColBERT also uses BERT, but employs it differently.

Instead of a single, computationally expensive all-to-all interaction, it uses BERT to generate term-level embeddings, then applies cheaper MaxSim operations for interaction. This retains the power of BERT's representations while improving efficiency.

Source Vespa: Illustration of regular text embedding models that encode all the words in the context window of the language model into a single vector representation. The query document similarity expression is only considering the lone vector representation from the pooling operation. For a great practical introduction and behind-the-scenes of text embedding models, we can recommend this blog post
A conversation on Colbert

In this Youtube transcript, Andrew Yates (Assistant Professor at the University of Amsterdam) and Sergi Castella (Analyst at Zeta Alpha) discus the two influential papers introducing ColBERT (from 2020) and ColBERT v2 (from 2022). This is the summary:

Introduction

  • ColBERT is a neural information retrieval model that differs from other dense retrieval methods.

  • Most dense retrieval methods represent queries and documents as single, low-dimensional vectors (embeddings).

  • ColBERT represents queries and documents using multiple vectors, one for each term, allowing for a more expressive representation.

ColBERT Architecture

  • ColBERT uses a dual-encoder architecture, with separate encoders for queries and documents.

  • The query encoder takes a fixed-length query (32 terms, truncated or padded) and produces an embedding for each query term.

  • The document encoder processes the document text and generates an embedding for each document term.

  • The similarity between a query and a document is computed using the MaxSim operation, which calculates the maximum cosine similarity between each query term embedding and all document term embeddings.

Retrieval and Re-ranking

  • ColBERT can be used as both a retriever and a re-ranker.

  • For retrieval, ColBERT finds the top-k candidate documents for each query term and then re-ranks these candidates using the full MaxSim score.

  • The re-ranking step computes the exact MaxSim score between the query and the retrieved documents, considering all query terms.

Training and Negative Sampling

  • ColBERT is trained using triples (query, positive document, negative document) from datasets like MS MARCO.

  • Negative sampling is crucial for effective training, and ColBERT explores different strategies, such as random negatives, hard negatives (using BM25), and in-batch negatives.

  • Increasing the number of negatives, especially hard negatives, can significantly improve retrieval performance.

ColBERT v2 Improvements

  • ColBERT v2 introduces several improvements over the original model:

a. Distillation from a cross-encoder (e.g., BERT) to enhance the training process.

b. Hybrid data augmentation, combining random sampling and denoised hard negatives.

c. Clustering-based compression of document embeddings using centroids and delta vectors to reduce storage requirements.

Evaluation and Results

  • ColBERT is evaluated on benchmark datasets like MS MARCO (in-domain) and BEIR (out-of-domain).

  • ColBERT v2 outperforms strong baselines, including dense retrieval methods like BERT and RocketQA, particularly in out-of-domain settings.

  • The improvements in ColBERT v2 are primarily attributed to the advanced training techniques, such as distillation and hard negative mining.

Future Directions

  • Further improving the storage efficiency of ColBERT, e.g., through centroid pruning and dimensionality reduction.

  • Investigating the reasons behind ColBERT's strong out-of-domain performance compared to single-vector dense retrieval methods.

  • Exploring alternative granularities for the MaxSim operation, such as sentence-level or paragraph-level representations.

ColBERT is a powerful and efficient neural information retrieval model that strikes a balance between the expressiveness of term-level representations and the efficiency of dense retrieval.

Its strong performance, particularly in out-of-domain settings, has made it a popular choice for researchers and practitioners in the field.

Architecture

Query and Document Encoders

The first step in ColBERT is encoding the query and documents into a bag of fixed-size embeddings.

This is done using BERT-based encoders, denoted as fQfQ for the query encoder and fDfD for the document encoder.

Although a single BERT model is shared between the query and document encoders, ColBERT distinguishes between query and document input sequences by prepending a special token [Q][Q] to queries and [D] [D] to documents.

Query Encoder

For the query encoder (fQ)(fQ), given a textual query qq, it is first tokenized into BERT-based WordPiece tokens: q1,q2,...,qlq1, q2, ..., ql.

The special token [Q] [Q] is prepended to the query, right after BERT's sequence start token [CLS].

If the query has fewer than a predefined number of tokens NqNq, it is padded with BERT's special [mask] tokens up to length NqNq.

This is called query augmentation, which allows BERT to produce query-based embeddings at the positions corresponding to these masks.

Query augmentation serves as a soft, differentiable mechanism for learning to expand queries with new terms or re-weigh existing terms based on their importance for matching the query.

The padded sequence of input tokens is then passed into BERT's deep transformer architecture, which computes a contextualised representation of each token.

The contextualised output representations are then passed through a linear layer with no activations.

This layer serves to control the dimension of ColBERT's embeddings, producing m-dimensional embeddings for the layer's output size m, which is typically much smaller than BERT's fixed hidden dimension.

Finally, the output embeddings are normalised so each has an L2 norm equal to one.

This ensures that the dot-product of any two embeddings becomes equivalent to their cosine similarity, falling in the [1,1 [-1, 1] range.

Document Encoder

The document encoder (fD)(fD) follows a similar architecture.

The document dd is segmented into its constituent tokens d1,d2,...,dmd1, d2, ..., dm, and BERT's start token [CLS] followed by the special token [D] is prepended.

Unlike queries, [mask] tokens are not appended to documents.

After passing the input sequence through BERT and the linear layer, the document encoder filters out the embeddings corresponding to punctuation symbols, determined via a predefined list.

This filtering is meant to reduce the number of embeddings per document, as contextualized embeddings of punctuation are hypothesised to be unnecessary for effectiveness.

Late Interaction

Late interaction is the process of estimating the relevance score between a query qq and a document dd, denoted as Sq,dSq,d, using their bags of contextualized embeddings (EqEq and EdEd) obtained from the query and document encoders.

The key idea behind late interaction is to delay the interaction between the query and document embeddings until after they have been generated independently.

This is in contrast to other approaches that perform the interaction during the encoding process itself.

In ColBERT, the late interaction is conducted as a sum of maximum similarity computations.

Specifically, for each query embedding EqiEqi in EqEq, the maximum similarity score is computed against all document embeddings EdjEdj in EdEd.

The similarity function used can be either cosine similarity (implemented efficiently as dot products due to embedding normalization) or squared L2 distance.

Mathematically, the relevance score Sq,d Sq,d is calculated as:

Sq,d:=Σmax(EqiEdjT)iEqjEdSq,d := Σ max (Eqi · Edj^T) i∈|Eq| j∈|Ed|

where Eq |Eq| and Ed |Ed| denote the number of embeddings in EqEq and EdEd, respectively.

Intuitively, this operation finds the best matching document embedding for each query embedding and sums up these maximum similarity scores.

It captures the overall relevance between the query and the document based on the strongest local matches between their contextualised embeddings.

ColBERT is trained end-to-end using pairwise softmax cross-entropy loss.

Given a triple (q,d+,d),(q, d+, d-), where d+d+ is a positive (relevant) document and dd- is a negative (irrelevant) document for query qq, ColBERT produces a score for each document individually using the late interaction mechanism.

The model is then optimised to assign higher scores to positive documents compared to negative documents.

It's important to note that the late interaction mechanism itself has no trainable parameters.

The learnable parameters in ColBERT are in the BERT-based encoders and the additional linear layers.

The general architecture of ColBERT given a query q and a document d.

Summary

  • Query Encoder (fQ): Transforms a query into a set of fixed-size embeddings, where each embedding is contextualised based on the entire query.

  • Document Encoder (fD): Similar to the query encoder, it transforms a document into a set of embeddings, with each embedding contextualised within the document's content.

  • Late Interaction Mechanism: Uses a simple yet powerful approach to compute the relevance score between a query and a document by summing the maximum similarity scores across all pairs of query and document embeddings.

Retrieval Process

The retrieval process in ColBERT involves two stages: an offline indexing stage and an online querying stage.

Offline Indexing: During offline indexing, the document encoder fDfD is run on each document in the collection, and the resulting embeddings are stored.

This process is computationally expensive but needs to be done only once. ColBERT applies several optimisations to speed up indexing:

  1. Parallel encoding of document batches using multiple GPUs.

  2. Padding documents within a batch to the maximum length for efficient batch processing.

  3. Grouping documents by length and processing them in batches of similar lengths (known as length-based bucketing).

  4. Parallelising the text preprocessing (e.g., WordPiece tokenization) across CPU cores.

The document embeddings are then saved to disk using 16-bit or 32-bit floating-point representations.

Online Querying: During online querying, ColBERT can be used in two modes: re-ranking or end-to-end retrieval.

In the re-ranking mode, ColBERT is used to re-rank a small set of candidate documents (e.g., top-k documents from a term-based retrieval model).

The query encoder fQfQ is run on the input query to obtain the query embeddings EqEq.

The pre-computed document embeddings of the candidate documents are loaded from disk and stacked into a 3D tensor DD.

The relevance scores are then computed using late interaction between EqEq and DD, and the documents are sorted based on their scores.

In the end-to-end retrieval mode, ColBERT directly retrieves the top-k documents from the entire collection.

This is done using a two-stage approach:

  1. An approximate nearest neighbor (ANN) search is performed using the FAISS library. For each query embedding in EqEq, the top-k' nearest document embeddings are retrieved from the FAISS index. This stage efficiently narrows down the candidate set to a smaller subset of potentially relevant documents.

  2. The candidate documents from the first stage are then re-ranked using the full late interaction mechanism to obtain the final top-k results.

The use of ANN search in the first stage allows ColBERT to efficiently handle large document collections by avoiding exhaustive scoring of all documents. The re-ranking stage ensures that the final results are based on the full ColBERT relevance scores.

Overall, the late interaction mechanism and the two-stage retrieval process enable ColBERT to achieve high-quality retrieval results while maintaining computational efficiency, making it suitable for large-scale information retrieval tasks.

Experimental Evaluation

The experimental evaluation section of the ColBERT paper is where the authors empirically assess the performance of ColBERT in various retrieval contexts, addressing specific research questions (RQs) to determine the model's efficiency and effectiveness in comparison with both traditional and neural ranking models.

Datasets & Metrics

They used two datasets, MS MARCO and TREC-CAR, which are large enough to train and evaluate deep neural networks.

MS MARCO is used to assess reading comprehension and information retrieval with real-world queries, while TREC-CAR is based on Wikipedia and focuses on complex answer retrieval.

The evaluation metrics include Mean Reciprocal Rank at 10 (MRR@10) and others like Recall@k for different k values.

Implementation

The authors implemented ColBERT using PyTorch and the transformers library.

They fine-tuned the BERT models with specific learning rates and batch sizes, adjusting the number of embeddings per query and the dimension of ColBERT embeddings. They experimented with both BERT base and large models, adjusting training iterations based on the dataset.

Hardware & Time Measurements

They measured latency using a Tesla V100 GPU for re-ranking tasks and Titan V GPUs for CPU-based retrieval and indexing experiments, ensuring a fair assessment of computational efficiency.

Re-ranking Efficiency and Effectiveness (RQ1

ColBERT's re-ranking performance was compared with other neural rankers like KNRM, Duet, and fastText+ConvKNRM, and variations of BERT-based rankers.

The aim was to understand if ColBERT could bridge the gap between efficient and effective neural models.

The results showed that ColBERT achieved comparable or better effectiveness (measured by MRR@10) than these models while being significantly more efficient in terms of latency and computational resources (FLOPs).

End-to-End Retrieval (RQ2)

Beyond re-ranking, the authors tested if ColBERT could support efficient and effective end-to-end retrieval from a large document collection.

They demonstrated that ColBERT could retrieve top-k documents directly from the MS MARCO collection with high effectiveness (MRR@10) and recall metrics, indicating strong performance in filtering a large collection to find relevant documents.

Component Contribution (RQ3)

The authors conducted an ablation study to evaluate the contribution of different components of ColBERT, such as late interaction and query augmentation, to the overall performance. This study aimed to identify which parts of the model were critical for its effectiveness and efficiency.

Indexing-Related Costs (RQ4)

The study also looked into the costs associated with offline computation and memory overhead for indexing with ColBERT. This aspect is crucial for understanding the practicality of deploying ColBERT in real-world scenarios, especially considering the balance between offline indexing costs and online retrieval efficiency.

In summary, the experimental evaluation demonstrated ColBERT's ability to maintain or exceed the effectiveness of state-of-the-art neural rankers while significantly reducing computational costs, highlighting its potential for practical, large-scale information retrieval applications.

Conclusion

ColBERT represents a significant advancement in the field of information retrieval by successfully addressing the trade-off between efficiency and effectiveness in deep learning-based ranking models.

By introducing a late interaction mechanism and leveraging the power of BERT's contextualised representations, ColBERT achieves remarkable speedups and requires substantially fewer computational resources compared to existing BERT-based models, while maintaining high-quality retrieval performance.

The extensive experimental evaluation demonstrates ColBERT's superiority over non-BERT baselines and its potential for practical deployment in large-scale information retrieval systems.

As the volume of digital information continues to grow, the development of efficient and effective retrieval models like ColBERT will be crucial in enabling users to access relevant information quickly and accurately

Colbert in Action

Here's an example of how ColBERT would work in practice, focusing on the query encoding, document encoding, and late interaction stages.

Let's consider a simple query and a set of two documents:

Query: "What is the capital of France?"

Document 1: "Paris is the capital and most populous city of France, with an estimated population of 2,165,423 residents as of 2019 in an area of more than 105 square kilometres."

Document 2: "London is the capital and largest city of England and the United Kingdom, with a population of just under 9 million."

Query Encoding

  1. The query "What is the capital of France?" is tokenized into WordPiece tokens: ['what', 'is', 'the', 'capital', 'of', 'france', '?'].

  2. The special token [Q] is prepended to the query, and the sequence is padded with [mask] tokens to reach the predefined length Nq. Let's assume Nq = 10. Input to BERT: [CLS] [Q] 'what' 'is' 'the' 'capital' 'of' 'france' '?' [mask] [mask]

  3. The padded sequence is passed through BERT, and the contextualized representations of each token are obtained.

  4. The contextualized representations are passed through a linear layer to reduce their dimensionality to mm. Let's assume m = 128.

  5. The resulting embeddings are normalized to have an L2 norm of 1. Query Embeddings (Eq):[eq1,eq2,...,eq10](Eq): [eq1, eq2, ..., eq10], where each eqieqi is a 128-dimensional vector.

Document Encoding

  1. Document 1 is tokenized into WordPiece tokens: ['paris', 'is', 'the', 'capital', 'and', 'most', 'populous', 'city', 'of', 'france', ',', 'with', 'an', 'estimated', 'population', 'of', '2', ',', '165', ',', '423', 'residents', 'as', 'of', '2019', 'in', 'an', 'area', 'of', 'more', 'than', '105', 'square', 'kilometres', '.']

  2. The special tokens [CLS] and [D] are prepended to the document. Input to BERT: [CLS] [D] 'paris' 'is' 'the' 'capital' 'and' 'most' 'populous' 'city' 'of' 'france' ',' 'with' 'an' 'estimated' 'population' 'of' '2' ',' '165' ',' '423' 'residents' 'as' 'of' '2019' 'in' 'an' 'area' 'of' 'more' 'than' '105' 'square' 'kilometres' '.'

  3. The sequence is passed through BERT, and the contextualized representations of each token are obtained.

  4. The contextualized representations are passed through a linear layer to reduce their dimensionality to m (128).

  5. The resulting embeddings are normalized, and the embeddings corresponding to punctuation are filtered out.

  6. Document 1 Embeddings (Ed1):[ed11,ed12,...,ed1n] (Ed1): [ed1_1, ed1_2, ..., ed1_n], where each ed1ied1_i is a 128-dimensional vector and nn is the number of non-punctuation tokens in Document 1.

  7. The same process is applied to Document 2, resulting in its embeddings (Ed2)(Ed2).

Late Interaction

  1. For each query embedding eqieqi in EqEq, the maximum similarity score (e.g., cosine similarity) is computed against all document embeddings ed1jed1_j in Ed1Ed1.

  2. The same process is repeated for Document 2 embeddings (Ed2)(Ed2).

  3. The maximum similarity scores are summed across all query embeddings to obtain the relevance scores Sq,d1 Sq,d1 and Sq,d2Sq,d2 for Documents 1 and 2, respectively.

  4. Relevance Score for Document 1 (Sq,d1)=sum(max(eq1ed11,eq1ed12,...),max(eq2ed11,eq2ed12,...),...)(Sq,d1) = sum(max(eq1 · ed1_1, eq1 · ed1_2, ...), max(eq2 · ed1_1, eq2 · ed1_2, ...), ...)

  5. Relevance Score for Document 2 (Sq,d2)=sum(max(eq1ed21,eq1ed22,...),max(eq2ed21,eq2ed22,...),...)(Sq,d2) = sum(max(eq1 · ed2_1, eq1 · ed2_2, ...), max(eq2 · ed2_1, eq2 · ed2_2, ...), ...)

  6. The documents are ranked based on their relevance scores. In this example, Document 1 would likely receive a higher relevance score than Document 2, as it contains more information related to the query "What is the capital of France?".

This example demonstrates how ColBERT encodes queries and documents separately, and then uses late interaction to compute relevance scores based on the maximum similarity between query and document embeddings.

This approach allows for efficient retrieval while leveraging the power of contextualized representations from BERT.

Last updated

Logo

Continuum - Accelerated Artificial Intelligence

Continuum WebsiteAxolotl Platform

Copyright Continuum Labs - 2023