ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

The widely cited paper by Omar Khattab and Matei Zaharia from Stanford University

This famous and widely cited June 2020 paper introduced a novel information retrieval (IR) model that addressed the efficiency and effectiveness challenges faced by existing deep learning-based ranking models.

Information retrieval systems are ranked on their ability to quickly and cost effectively search through large document collections to find relevant information.

Recent advances in deep learning, particularly the use of deep language models like BERT (2017), have significantly improved the effectiveness of ranking models.

However, these models are computationally expensive, leading to high latency and resource requirements.

ColBERT (Contextualized Late Interaction over BERT) is a ranking model that addresses these challenges by allowing independent encoding of queries and documents, followed by a computationally cheap interaction step, making it suitable for both re-ranking and end-to-end retrieval scenarios.

Contrasting ColBERT with Other Methods

Representation-based Similarity (DSSM, SNRM)

These methods compute embeddings for the query and document independently, then calculate a single similarity score between the two vectors.

In contrast, ColBERT computes multiple embeddings for each query and document term, allowing for a more fine-grained interaction.

Explanation of Fine-Grained Embeddings

What Are Fine-Grained Embeddings?

Fine-grained embeddings refer to the practice of representing text at a detailed level by generating embeddings (dense vector representations) for individual tokens (words or subword units) rather than for entire sentences or documents. In the context of ColBERT (Contextualized Late Interaction over BERT), both queries and documents are encoded into sequences of token-level embeddings using a language model like BERT.

Why Are They Called "Fine-Grained"?

The term "fine-grained" signifies that the embeddings capture detailed, granular information about each token in the text. Instead of summarizing an entire document or query into a single vector (coarse-grained), fine-grained embeddings preserve the contextual nuances of each token within its sequence. This granularity allows the model to consider the specific contributions of individual words when determining relevance.

How Do Fine-Grained Embeddings Work in ColBERT?

Tokenization and Encoding:
- Queries and Documents: Both are tokenized into sequences of tokens using BERT's WordPiece tokenizer.
- Contextualization: Each token is passed through BERT to obtain a contextualized embedding that captures its meaning in context.
Independent Encoding:
- Queries and Documents are Encoded Separately: This allows document embeddings to be precomputed and stored offline.
- Embeddings for Each Token: Instead of a single vector per query or document, ColBERT maintains an embedding for each token.
Late Interaction Mechanism:
- MaxSim Operation: At query time, ColBERT computes the maximum similarity between each query token embedding and all document token embeddings.
- Aggregation: The relevance score is the sum of these maximum similarities across all query tokens.
- Fine-Grained Matching: This process captures detailed interactions between specific query terms and document terms.

Benefits of Fine-Grained Embeddings in ColBERT:

Detailed Interactions: Captures nuanced relationships between query and document terms, improving retrieval effectiveness.
Contextual Understanding: Embeddings consider the context in which tokens appear, allowing for better handling of synonyms and polysemous words.
Efficiency: Independent encoding enables precomputing document embeddings, reducing computational load at query time.

Query-Document Interaction (e.g., DRMM, KNRM, Conv-KNRM)

These methods model word-level interactions between the query and document, typically using an interaction matrix processed by a neural network.

ColBERT also models interactions, but does so in a "late" fashion. It first computes embeddings for query and document terms independently, then calculates interactions using MaxSim operations. This allows for pre-computation of document embeddings, improving efficiency.

All-to-all Interaction (BERT)

BERT-like models consider interactions among all words in the query and document simultaneously, using the Transformer's attention mechanism.

ColBERT also uses BERT, but employs it differently.

Instead of a single, computationally expensive all-to-all interaction, it uses BERT to generate term-level embeddings, then applies cheaper MaxSim operations for interaction. This retains the power of BERT's representations while improving efficiency.

A conversation on Colbert

In this Youtube transcript, Andrew Yates (Assistant Professor at the University of Amsterdam) and Sergi Castella (Analyst at Zeta Alpha) discus the two influential papers introducing ColBERT (from 2020) and ColBERT v2 (from 2022). This is the summary:

Introduction

ColBERT is a neural information retrieval model that differs from other dense retrieval methods.
Most dense retrieval methods represent queries and documents as single, low-dimensional vectors (embeddings).
ColBERT represents queries and documents using multiple vectors, one for each term, allowing for a more expressive representation.

ColBERT Architecture

ColBERT uses a dual-encoder architecture, with separate encoders for queries and documents.
The query encoder takes a fixed-length query (32 terms, truncated or padded) and produces an embedding for each query term.
The document encoder processes the document text and generates an embedding for each document term.
The similarity between a query and a document is computed using the MaxSim operation, which calculates the maximum cosine similarity between each query term embedding and all document term embeddings.

Retrieval and Re-ranking

ColBERT can be used as both a retriever and a re-ranker.
For retrieval, ColBERT finds the top-k candidate documents for each query term and then re-ranks these candidates using the full MaxSim score.
The re-ranking step computes the exact MaxSim score between the query and the retrieved documents, considering all query terms.

Training and Negative Sampling

ColBERT is trained using triples (query, positive document, negative document) from datasets like MS MARCO.
Negative sampling is crucial for effective training, and ColBERT explores different strategies, such as random negatives, hard negatives (using BM25), and in-batch negatives.
Increasing the number of negatives, especially hard negatives, can significantly improve retrieval performance.

ColBERT v2 Improvements

ColBERT v2 introduces several improvements over the original model:

a. Distillation from a cross-encoder (e.g., BERT) to enhance the training process.

b. Hybrid data augmentation, combining random sampling and denoised hard negatives.

c. Clustering-based compression of document embeddings using centroids and delta vectors to reduce storage requirements.

Evaluation and Results

ColBERT is evaluated on benchmark datasets like MS MARCO (in-domain) and BEIR (out-of-domain).
ColBERT v2 outperforms strong baselines, including dense retrieval methods like BERT and RocketQA, particularly in out-of-domain settings.
The improvements in ColBERT v2 are primarily attributed to the advanced training techniques, such as distillation and hard negative mining.

Future Directions

Further improving the storage efficiency of ColBERT, e.g., through centroid pruning and dimensionality reduction.
Investigating the reasons behind ColBERT's strong out-of-domain performance compared to single-vector dense retrieval methods.
Exploring alternative granularities for the MaxSim operation, such as sentence-level or paragraph-level representations.

ColBERT is a powerful and efficient neural information retrieval model that strikes a balance between the expressiveness of term-level representations and the efficiency of dense retrieval.

Its strong performance, particularly in out-of-domain settings, has made it a popular choice for researchers and practitioners in the field.

Architecture

Query and Document Encoders

The first step in ColBERT is encoding the query and documents into a bag of fixed-size embeddings.

This is done using BERT-based encoders, denoted as $fQ$ for the query encoder and $fD$ for the document encoder.

Although a single BERT model is shared between the query and document encoders, ColBERT distinguishes between query and document input sequences by prepending a special token $[Q]$ to queries and $[D]$ to documents.

Query Encoder

For the query encoder $(fQ)$ , given a textual query $q$ , it is first tokenized into BERT-based WordPiece tokens: $q1, q2, ..., ql$ .

The special token $[Q]$ is prepended to the query, right after BERT's sequence start token [CLS].

If the query has fewer than a predefined number of tokens $Nq$ , it is padded with BERT's special [mask] tokens up to length $Nq$ .

This is called query augmentation, which allows BERT to produce query-based embeddings at the positions corresponding to these masks.

Query augmentation serves as a soft, differentiable mechanism for learning to expand queries with new terms or re-weigh existing terms based on their importance for matching the query.

The padded sequence of input tokens is then passed into BERT's deep transformer architecture, which computes a contextualised representation of each token.

The contextualised output representations are then passed through a linear layer with no activations.

This layer serves to control the dimension of ColBERT's embeddings, producing m-dimensional embeddings for the layer's output size m, which is typically much smaller than BERT's fixed hidden dimension.

Finally, the output embeddings are normalised so each has an L2 norm equal to one.

This ensures that the dot-product of any two embeddings becomes equivalent to their cosine similarity, falling in the $[-1, 1$ ] range.

Document Encoder

The document encoder $(fD)$ follows a similar architecture.

The document $d$ is segmented into its constituent tokens $d1, d2, ..., dm$ , and BERT's start token [CLS] followed by the special token [D] is prepended.

Unlike queries, [mask] tokens are not appended to documents.

After passing the input sequence through BERT and the linear layer, the document encoder filters out the embeddings corresponding to punctuation symbols, determined via a predefined list.

This filtering is meant to reduce the number of embeddings per document, as contextualized embeddings of punctuation are hypothesised to be unnecessary for effectiveness.

Some definitions and questions answered

What Does Dense Retrieval Mean?

Definition:

Dense retrieval refers to information retrieval methods where both queries and documents are represented as dense vectors (embeddings) in a continuous, high-dimensional space. These embeddings capture semantic meanings, allowing for retrieval based on semantic similarity rather than just exact keyword matching.

Key Applications:

Semantic Search: Finding documents that are semantically related to a query, even if they don't share exact keywords.
Question Answering Systems: Retrieving passages that answer a user's question based on meaning rather than keyword overlap.
Recommendation Systems: Suggesting items similar in meaning or context to those the user has interacted with.

How ColBERT Differs from Other Dense Retrieval Methods

While traditional dense retrieval methods encode queries and documents into single dense vectors and compute similarity (e.g., using cosine similarity), ColBERT maintains fine-grained token-level embeddings. This allows it to capture more detailed interactions between query terms and document terms, leading to better retrieval effectiveness. Moreover, ColBERT's architecture enables precomputing document embeddings offline, significantly improving efficiency compared to methods that require joint encoding of query-document pairs.

Understanding Encoding in ColBERT

What Does Encoding Mean in This Context?

In ColBERT, encoding refers to the process of transforming queries and documents from sequences of words into sequences of dense vectors (embeddings) using a language model like BERT. Each token (word or subword unit) in the query or document is mapped to an embedding that captures its contextual meaning within the sequence.

How Does It Work?

Tokenization:
- The text (query or document) is split into tokens using BERT's WordPiece tokenizer.
- This process breaks down words into subword units to handle rare words and improve vocabulary coverage.
Embedding with BERT:
- The tokens are passed through BERT, which produces a contextualized embedding for each token.
- These embeddings capture the meaning of each token considering its context in the sequence.
Linear Projection:
- The high-dimensional embeddings from BERT are projected to a lower-dimensional space using a linear layer.
- This reduces the size of the embeddings, making storage and computation more efficient.
Normalization:
- Each embedding is normalized (e.g., to have a unit L2 norm).
- This allows for efficient similarity computations using dot products equivalent to cosine similarity.

Concept Behind It:

The idea is to represent the semantic content of queries and documents in a numerical form that can be easily compared. By encoding at the token level and maintaining contextual information, ColBERT captures detailed interactions between queries and documents, leading to better retrieval performance.

What is a Dual Encoder and Why is it Used in ColBERT?

Definition:

A dual encoder architecture consists of two separate encoders: one for queries and one for documents. Each encoder processes its input independently to produce embeddings.

Why Use a Dual Encoder?

Efficiency:
- Document embeddings can be precomputed and stored since they don't depend on the query.
- At query time, only the query needs to be encoded, reducing computation.
Scalability:
- Enables handling large document collections because the expensive document encoding is done offline.
Flexibility:
- Allows for independent optimization of query and document encoders if needed.

Why Does It Take a Fixed-Length Query?

Consistency:
- Fixing the query length (e.g., to 32 tokens) ensures that the model handles inputs uniformly.
- Queries shorter than the fixed length are padded, and longer ones are truncated.
Efficiency:
- Simplifies batching and computational processes within the model.

What is a Term in This Context?

A term refers to a token in the tokenized input sequence.
In ColBERT, each term corresponds to a token embedding produced by the encoder.

Why is a Special Token Used? What Does It Do?

Special Tokens in ColBERT:

[CLS]: Standard BERT start-of-sequence token.
[Q]: A special token prepended to queries after [CLS].
[D]: A special token prepended to documents after [CLS].
[MASK]: Used for padding queries up to the fixed length.

Purpose of Special Tokens:

Distinguish Inputs:
- [Q] and [D] indicate whether the sequence is a query or a document.
- Helps the model learn different representations for queries and documents even if they contain similar words.
Provide Context:
- Special tokens influence the contextual embeddings by signalling the type of input.
- They can capture different patterns or features relevant to queries or documents.

Idea Behind It:

By introducing special tokens, the model can condition its embeddings based on whether it's processing a query or a document. This allows for more nuanced representations and can improve retrieval effectiveness.

Query Encoder in ColBERT

Process:

Tokenization:
- The query text is tokenized into WordPiece tokens.
Preparation:
- Prepend [CLS] and [Q] tokens to the tokenized query.
- If the query is shorter than the fixed length (e.g., 32 tokens), append [MASK] tokens to reach the fixed length.
Query Augmentation:
- The use of [MASK] tokens allows the model to learn to expand the query by predicting relevant tokens at these positions during training.
- This provides a soft, differentiable mechanism for query expansion or re-weighting terms.
Encoding:
- The prepared token sequence is passed through BERT to get contextualized embeddings.
Linear Projection and Normalization:
- Embeddings are passed through a linear layer to reduce dimensionality.
- Embeddings are normalized to have unit length.

Why Append [MASK] Tokens to Queries?

Query Augmentation:
- Helps the model learn which additional terms might be relevant for the query.
- Enhances the ability to match documents that may contain relevant information not explicitly mentioned in the query.
Differentiable Mechanism:
- Allows the model to learn this behavior during training without hardcoding additional terms.

Document Encoder in ColBERT

Process:

Tokenization:
- The document text is tokenized into WordPiece tokens.
Preparation:
- Prepend [CLS] and [D] tokens to the tokenized document.
Encoding:
- The prepared token sequence is passed through BERT to get contextualized embeddings.
Linear Projection and Normalization:
- Embeddings are passed through a linear layer to reduce dimensionality.
- Embeddings are normalized to have unit length.
Filtering:
- Embeddings corresponding to punctuation tokens are filtered out.
- This reduces the number of embeddings per document, focusing on meaningful content.

Why Aren't [MASK] Tokens Appended to Documents?

Efficiency:
- Documents are generally longer, and adding [MASK] tokens would increase computational load and storage requirements.
Relevance:
- Unlike queries, documents don't need augmentation for matching purposes. They already contain all the information.
Purpose:
- Query augmentation is about expanding or re-weighting the query terms to better match documents.
- Documents don't require this process since they are matched against the query embeddings.

Similarity Computation Using MaxSim

How is the Similarity Computed?

For Each Query Term Embedding:
- Compute the similarity (e.g., dot product) with each document term embedding.
- Since embeddings are normalized, the dot product equals the cosine similarity.
MaxSim Operation:
- For each query term, take the maximum similarity across all document terms.
- This captures the strongest match for that query term in the document.
Aggregation:
- Sum the maximum similarities for all query terms to get the final relevance score.

Why Use MaxSim?

Captures Strongest Matches:
- Focuses on the best possible match for each query term.
Efficient Computation:
- Reduces the computational complexity by avoiding the need to model all possible interactions.
Pruning-Friendly:
- Supports efficient retrieval using vector similarity indexes, allowing for fast top-k retrieval.

How These Features Work Together to Make ColBERT Effective

Independent Encoding with Contextualization:
- Queries and documents are encoded separately using BERT, capturing rich contextual information.
- This allows for precomputing document embeddings and efficient query-time computation.
Use of Special Tokens and Query Augmentation:
- Special tokens [Q] and [D] help the model distinguish between queries and documents.
- Query augmentation with [MASK] tokens enables the model to expand queries implicitly, improving matching with relevant documents.
Dual-Encoder Architecture:
- Enables scalability by allowing document embeddings to be precomputed and stored.
- The query encoder only needs to process the query at runtime, reducing latency.
MaxSim Similarity Computation:
- Efficiently captures the most significant term matches between queries and documents.
- Supports efficient retrieval using vector similarity search methods.
Dimensionality Reduction and Normalization:
- Reduces storage requirements and computational costs.
- Normalized embeddings facilitate efficient similarity computations.
Filtering in Document Encoder:
- Removing punctuation embeddings reduces the size of document representations without losing meaningful information.
Overall Efficiency and Effectiveness Balance:
- ColBERT maintains high retrieval effectiveness comparable to full BERT-based models.
- Achieves significant improvements in efficiency, making it practical for real-time applications.

Conclusion

ColBERT effectively combines the strengths of deep language models like BERT with efficient retrieval mechanisms.

By independently encoding queries and documents while preserving fine-grained contextual embeddings, and employing an efficient interaction step using MaxSim, ColBERT provides a powerful solution for neural information retrieval. Its design allows for scalable, low-latency retrieval without compromising on the quality of results, making it suitable for a wide range of applications that require both accuracy and speed.

Late Interaction

Late interaction is the process of estimating the relevance score between a query $q$ and a document $d$ , denoted as $Sq,d$ , using their bags of contextualized embeddings ( $Eq$ and $Ed$ ) obtained from the query and document encoders.

The key idea behind late interaction is to delay the interaction between the query and document embeddings until after they have been generated independently.

This is in contrast to other approaches that perform the interaction during the encoding process itself.

In ColBERT, the late interaction is conducted as a sum of maximum similarity computations.

Specifically, for each query embedding $Eqi$ in $Eq$ , the maximum similarity score is computed against all document embeddings $Edj$ in $Ed$ .

The similarity function used can be either cosine similarity (implemented efficiently as dot products due to embedding normalization) or squared L2 distance.

Mathematically, the relevance score $Sq,d$ is calculated as:

Sq,d := Σ max (Eqi · Edj^T) i∈|Eq| j∈|Ed|

where $|Eq|$ and $|Ed|$ denote the number of embeddings in $Eq$ and $Ed$ , respectively.

Intuitively, this operation finds the best matching document embedding for each query embedding and sums up these maximum similarity scores.

It captures the overall relevance between the query and the document based on the strongest local matches between their contextualised embeddings.

ColBERT is trained end-to-end using pairwise softmax cross-entropy loss.

Given a triple $(q, d+, d-),$ where $d+$ is a positive (relevant) document and $d-$ is a negative (irrelevant) document for query $q$ , ColBERT produces a score for each document individually using the late interaction mechanism.

The model is then optimised to assign higher scores to positive documents compared to negative documents.

It's important to note that the late interaction mechanism itself has no trainable parameters.

The learnable parameters in ColBERT are in the BERT-based encoders and the additional linear layers.

Summary

Query Encoder (fQ): Transforms a query into a set of fixed-size embeddings, where each embedding is contextualised based on the entire query.
Document Encoder (fD): Similar to the query encoder, it transforms a document into a set of embeddings, with each embedding contextualised within the document's content.
Late Interaction Mechanism: Uses a simple yet powerful approach to compute the relevance score between a query and a document by summing the maximum similarity scores across all pairs of query and document embeddings.

Retrieval Process

The retrieval process in ColBERT involves two stages: an offline indexing stage and an online querying stage.

Offline Indexing: During offline indexing, the document encoder $fD$ is run on each document in the collection, and the resulting embeddings are stored.

This process is computationally expensive but needs to be done only once. ColBERT applies several optimisations to speed up indexing:

Parallel encoding of document batches using multiple GPUs.
Padding documents within a batch to the maximum length for efficient batch processing.
Grouping documents by length and processing them in batches of similar lengths (known as length-based bucketing).
Parallelising the text preprocessing (e.g., WordPiece tokenization) across CPU cores.

The document embeddings are then saved to disk using 16-bit or 32-bit floating-point representations.

Online Querying: During online querying, ColBERT can be used in two modes: re-ranking or end-to-end retrieval.

In the re-ranking mode, ColBERT is used to re-rank a small set of candidate documents (e.g., top-k documents from a term-based retrieval model).

The query encoder $fQ$ is run on the input query to obtain the query embeddings $Eq$ .

The pre-computed document embeddings of the candidate documents are loaded from disk and stacked into a 3D tensor $D$ .

The relevance scores are then computed using late interaction between $Eq$ and $D$ , and the documents are sorted based on their scores.

In the end-to-end retrieval mode, ColBERT directly retrieves the top-k documents from the entire collection.

This is done using a two-stage approach:

An approximate nearest neighbor (ANN) search is performed using the FAISS library. For each query embedding in $Eq$ , the top-k' nearest document embeddings are retrieved from the FAISS index. This stage efficiently narrows down the candidate set to a smaller subset of potentially relevant documents.
The candidate documents from the first stage are then re-ranked using the full late interaction mechanism to obtain the final top-k results.

The use of ANN search in the first stage allows ColBERT to efficiently handle large document collections by avoiding exhaustive scoring of all documents. The re-ranking stage ensures that the final results are based on the full ColBERT relevance scores.

Overall, the late interaction mechanism and the two-stage retrieval process enable ColBERT to achieve high-quality retrieval results while maintaining computational efficiency, making it suitable for large-scale information retrieval tasks.

Experimental Evaluation

The experimental evaluation section of the ColBERT paper is where the authors empirically assess the performance of ColBERT in various retrieval contexts, addressing specific research questions (RQs) to determine the model's efficiency and effectiveness in comparison with both traditional and neural ranking models.

Datasets & Metrics

They used two datasets, MS MARCO and TREC-CAR, which are large enough to train and evaluate deep neural networks.

MS MARCO is used to assess reading comprehension and information retrieval with real-world queries, while TREC-CAR is based on Wikipedia and focuses on complex answer retrieval.

The evaluation metrics include Mean Reciprocal Rank at 10 (MRR@10) and others like Recall@k for different k values.

Implementation

The authors implemented ColBERT using PyTorch and the transformers library.

They fine-tuned the BERT models with specific learning rates and batch sizes, adjusting the number of embeddings per query and the dimension of ColBERT embeddings. They experimented with both BERT base and large models, adjusting training iterations based on the dataset.

Hardware & Time Measurements

They measured latency using a Tesla V100 GPU for re-ranking tasks and Titan V GPUs for CPU-based retrieval and indexing experiments, ensuring a fair assessment of computational efficiency.

Re-ranking Efficiency and Effectiveness (RQ1

ColBERT's re-ranking performance was compared with other neural rankers like KNRM, Duet, and fastText+ConvKNRM, and variations of BERT-based rankers.

The aim was to understand if ColBERT could bridge the gap between efficient and effective neural models.

The results showed that ColBERT achieved comparable or better effectiveness (measured by MRR@10) than these models while being significantly more efficient in terms of latency and computational resources (FLOPs).

End-to-End Retrieval (RQ2)

Beyond re-ranking, the authors tested if ColBERT could support efficient and effective end-to-end retrieval from a large document collection.

They demonstrated that ColBERT could retrieve top-k documents directly from the MS MARCO collection with high effectiveness (MRR@10) and recall metrics, indicating strong performance in filtering a large collection to find relevant documents.

Component Contribution (RQ3)

The authors conducted an ablation study to evaluate the contribution of different components of ColBERT, such as late interaction and query augmentation, to the overall performance. This study aimed to identify which parts of the model were critical for its effectiveness and efficiency.

Indexing-Related Costs (RQ4)

The study also looked into the costs associated with offline computation and memory overhead for indexing with ColBERT. This aspect is crucial for understanding the practicality of deploying ColBERT in real-world scenarios, especially considering the balance between offline indexing costs and online retrieval efficiency.

In summary, the experimental evaluation demonstrated ColBERT's ability to maintain or exceed the effectiveness of state-of-the-art neural rankers while significantly reducing computational costs, highlighting its potential for practical, large-scale information retrieval applications.

Conclusion

ColBERT represents a significant advancement in the field of information retrieval by successfully addressing the trade-off between efficiency and effectiveness in deep learning-based ranking models.

By introducing a late interaction mechanism and leveraging the power of BERT's contextualised representations, ColBERT achieves remarkable speedups and requires substantially fewer computational resources compared to existing BERT-based models, while maintaining high-quality retrieval performance.

The extensive experimental evaluation demonstrates ColBERT's superiority over non-BERT baselines and its potential for practical deployment in large-scale information retrieval systems.

As the volume of digital information continues to grow, the development of efficient and effective retrieval models like ColBERT will be crucial in enabling users to access relevant information quickly and accurately

Colbert in Action

Here's an example of how ColBERT would work in practice, focusing on the query encoding, document encoding, and late interaction stages.

Let's consider a simple query and a set of two documents:

Query: "What is the capital of France?"

Document 1: "Paris is the capital and most populous city of France, with an estimated population of 2,165,423 residents as of 2019 in an area of more than 105 square kilometres."

Document 2: "London is the capital and largest city of England and the United Kingdom, with a population of just under 9 million."

Query Encoding

The query "What is the capital of France?" is tokenized into WordPiece tokens: ['what', 'is', 'the', 'capital', 'of', 'france', '?'].
The special token [Q] is prepended to the query, and the sequence is padded with [mask] tokens to reach the predefined length Nq. Let's assume Nq = 10. Input to BERT: [CLS] [Q] 'what' 'is' 'the' 'capital' 'of' 'france' '?' [mask] [mask]
The padded sequence is passed through BERT, and the contextualized representations of each token are obtained.
The contextualized representations are passed through a linear layer to reduce their dimensionality to $m$ . Let's assume m = 128.
The resulting embeddings are normalized to have an L2 norm of 1. Query Embeddings $(Eq): [eq1, eq2, ..., eq10]$ , where each $eqi$ is a 128-dimensional vector.

Document Encoding

Document 1 is tokenized into WordPiece tokens: ['paris', 'is', 'the', 'capital', 'and', 'most', 'populous', 'city', 'of', 'france', ',', 'with', 'an', 'estimated', 'population', 'of', '2', ',', '165', ',', '423', 'residents', 'as', 'of', '2019', 'in', 'an', 'area', 'of', 'more', 'than', '105', 'square', 'kilometres', '.']
The special tokens [CLS] and [D] are prepended to the document. Input to BERT: [CLS] [D] 'paris' 'is' 'the' 'capital' 'and' 'most' 'populous' 'city' 'of' 'france' ',' 'with' 'an' 'estimated' 'population' 'of' '2' ',' '165' ',' '423' 'residents' 'as' 'of' '2019' 'in' 'an' 'area' 'of' 'more' 'than' '105' 'square' 'kilometres' '.'
The sequence is passed through BERT, and the contextualized representations of each token are obtained.
The contextualized representations are passed through a linear layer to reduce their dimensionality to m (128).
The resulting embeddings are normalized, and the embeddings corresponding to punctuation are filtered out.
Document 1 Embeddings $(Ed1): [ed1_1, ed1_2, ..., ed1_n]$ , where each $ed1_i$ is a 128-dimensional vector and $n$ is the number of non-punctuation tokens in Document 1.
The same process is applied to Document 2, resulting in its embeddings $(Ed2)$ .

Late Interaction

For each query embedding $eqi$ in $Eq$ , the maximum similarity score (e.g., cosine similarity) is computed against all document embeddings $ed1_j$ in $Ed1$ .
The same process is repeated for Document 2 embeddings $(Ed2)$ .
The maximum similarity scores are summed across all query embeddings to obtain the relevance scores $Sq,d1$ and $Sq,d2$ for Documents 1 and 2, respectively.
Relevance Score for Document 1 $(Sq,d1) = sum(max(eq1 · ed1_1, eq1 · ed1_2, ...), max(eq2 · ed1_1, eq2 · ed1_2, ...), ...)$
Relevance Score for Document 2 $(Sq,d2) = sum(max(eq1 · ed2_1, eq1 · ed2_2, ...), max(eq2 · ed2_1, eq2 · ed2_2, ...), ...)$
The documents are ranked based on their relevance scores. In this example, Document 1 would likely receive a higher relevance score than Document 2, as it contains more information related to the query "What is the capital of France?".

This example demonstrates how ColBERT encodes queries and documents separately, and then uses late interaction to compute relevance scores based on the maximum similarity between query and document embeddings.

This approach allows for efficient retrieval while leveraging the power of contextualized representations from BERT.

PreviousDecoding Sentence-BERT NextSimCSE: Simple Contrastive Learning of Sentence Embeddings

Last updated 10 months ago

Was this helpful?