# ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

This famous and widely cited <mark style="color:blue;">**June 2020**</mark> paper introduced a novel <mark style="color:blue;">**information retrieval (IR) model**</mark> that addressed the efficiency and effectiveness challenges faced by existing deep learning-based ranking models.&#x20;

Information retrieval systems are ranked on their ability to quickly and cost effectively search through large document collections to find relevant information.&#x20;

Recent advances in deep learning, particularly the use of deep language models like BERT (2017), have significantly improved the effectiveness of ranking models.&#x20;

However, these models are <mark style="color:yellow;">computationally expensive</mark>, leading to high latency and resource requirements.&#x20;

<mark style="color:blue;">**ColBERT (Contextualized Late Interaction over BERT)**</mark> is a ranking model that addresses these challenges by allowing *<mark style="color:yellow;">independent encoding of queries and documents</mark>*, followed by a computationally cheap interaction step, making it suitable for both re-ranking and end-to-end retrieval scenarios.

{% embed url="<https://arxiv.org/abs/2004.12832>" %}
ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
{% endembed %}

### <mark style="color:purple;">**Contrasting ColBERT with Other Methods**</mark>

#### <mark style="color:green;">**Representation-based Similarity (DSSM, SNRM)**</mark>

These methods compute embeddings for the query and document <mark style="color:yellow;">independently</mark>, then calculate a single similarity score between the two vectors.

In contrast, ColBERT computes multiple embeddings for each query and document term, allowing for a more fine-grained interaction.

<figure><img src="https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2FDTIseaz30YqD7H0hzkyt%2Fimage.png?alt=media&#x26;token=c82f60ff-b4e1-4b21-9a55-01204b15ce63" alt=""><figcaption><p>Schematic diagrams illustrating query–document matching paradigms in neural IR.</p></figcaption></figure>

<details>

<summary><mark style="color:green;"><strong>Explanation of Fine-Grained Embeddings</strong></mark></summary>

**What Are Fine-Grained Embeddings?**

Fine-grained embeddings refer to the practice of representing text at a detailed level by generating embeddings (dense vector representations) <mark style="color:yellow;">for individual tokens</mark> (words or subword units) *<mark style="color:yellow;">rather than for entire sentences or documents</mark>*. In the context of ColBERT (Contextualized Late Interaction over BERT), both queries and documents are encoded into sequences of token-level embeddings using a language model like BERT.

**Why Are They Called "Fine-Grained"?**

The term "fine-grained" signifies that the *<mark style="color:yellow;">embeddings capture detailed, granular information about each token in the text</mark>*. Instead of summarizing an entire document or query into a single vector (coarse-grained), fine-grained embeddings preserve the contextual nuances of each token within its sequence. This granularity allows the model to consider the specific contributions of individual words when determining relevance.

**How Do Fine-Grained Embeddings Work in ColBERT?**

1. **Tokenization and Encoding:**
   * **Queries and Documents:** Both are tokenized into sequences of tokens using BERT's WordPiece tokenizer.
   * **Contextualization:** Each token is passed through BERT to obtain a contextualized embedding that captures its meaning in context.
2. **Independent Encoding:**
   * **Queries and Documents are Encoded Separately:** This allows document embeddings to be precomputed and stored offline.
   * **Embeddings for Each Token:** Instead of a single vector per query or document, ColBERT maintains an embedding for each token.
3. **Late Interaction Mechanism:**
   * **MaxSim Operation:** At query time, ColBERT computes the maximum similarity between each query token embedding and all document token embeddings.
   * **Aggregation:** The relevance score is the sum of these maximum similarities across all query tokens.
   * **Fine-Grained Matching:** This process captures detailed interactions between specific query terms and document terms.

**Benefits of Fine-Grained Embeddings in ColBERT:**

* **Detailed Interactions:** Captures nuanced relationships between query and document terms, improving retrieval effectiveness.
* **Contextual Understanding:** Embeddings consider the context in which tokens appear, allowing for better handling of synonyms and polysemous words.
* **Efficiency:** Independent encoding enables precomputing document embeddings, reducing computational load at query time.

</details>

#### <mark style="color:green;">**Query-Document Interaction (e.g., DRMM, KNRM, Conv-KNRM)**</mark>

These methods model word-level interactions between the <mark style="color:blue;">query</mark> and <mark style="color:blue;">document</mark>, typically using an interaction matrix processed by a neural network.

ColBERT also models interactions, but does so in a "late" fashion. It <mark style="color:yellow;">first computes embeddings for query and document terms independently</mark>, then calculates interactions using MaxSim operations. This allows for pre-computation of document embeddings, improving efficiency.

#### <mark style="color:green;">**All-to-all Interaction (BERT)**</mark>

BERT-like models consider interactions among all words in the query and document simultaneously, using the Transformer's attention mechanism.

ColBERT also uses BERT, but employs it differently.&#x20;

Instead of a single, computationally expensive all-to-all interaction, it uses BERT to generate term-level embeddings, then applies cheaper MaxSim operations for interaction. This retains the power of BERT's representations while improving efficiency.

<figure><img src="https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2FTysyfbqKzlUlVE2ajZJW%2Fimage.png?alt=media&#x26;token=caf15925-136d-4036-8e50-68a578b88a83" alt="" width="563"><figcaption><p>Source Vespa: <em>Illustration of regular text embedding models that encode all the words in the context window of the language model into a single vector representation. The query document similarity expression is only considering the lone vector representation from the pooling operation. For a great practical introduction and behind-the-scenes of text embedding models, we can recommend</em> <a href="https://osanseviero.github.io/hackerllama/blog/posts/sentence_embeddings/"><em>this blog post</em></a></p></figcaption></figure>

<details>

<summary><mark style="color:green;"><strong>A conversation on Colbert</strong></mark></summary>

In this Youtube transcript, Andrew Yates (Assistant Professor at the University of Amsterdam) and Sergi Castella (Analyst at Zeta Alpha) discus the two influential papers introducing ColBERT (from 2020) and ColBERT v2 (from 2022).  This is the summary:

<mark style="color:blue;">**Introduction**</mark>

* ColBERT is a neural information retrieval model that differs from other <mark style="color:yellow;">dense retrieval methods</mark>.
* Most dense retrieval methods represent queries and documents as single, low-dimensional vectors (embeddings).
* ColBERT represents queries and documents <mark style="color:yellow;">using multiple vectors</mark>, one for each term, allowing for a more expressive representation.

<mark style="color:blue;">**ColBERT Architecture**</mark>

* ColBERT uses a dual-encoder architecture, with separate encoders for queries and documents.
* The query encoder takes a fixed-length query (32 terms, truncated or padded) and produces an embedding for each query term.
* The document encoder processes the document text and generates an embedding for each document term.
* The similarity between a query and a document is computed using the MaxSim operation, which calculates the maximum cosine similarity between each query term embedding and all document term embeddings.

<mark style="color:blue;">**Retrieval and Re-ranking**</mark>

* ColBERT can be used as both a retriever and a re-ranker.
* For retrieval, ColBERT finds the top-k candidate documents for each query term and then re-ranks these candidates using the full MaxSim score.
* The re-ranking step computes the exact MaxSim score between the query and the retrieved documents, considering all query terms.

<mark style="color:blue;">**Training and Negative Sampling**</mark>

* ColBERT is trained using triples (query, positive document, negative document) from datasets like MS MARCO.
* Negative sampling is crucial for effective training, and ColBERT explores different strategies, such as random negatives, hard negatives (using BM25), and in-batch negatives.
* Increasing the number of negatives, especially hard negatives, can significantly improve retrieval performance.

<mark style="color:blue;">**ColBERT v2 Improvements**</mark>

* ColBERT v2 introduces several improvements over the original model:&#x20;

a. <mark style="color:yellow;">Distillation</mark> from a cross-encoder (e.g., BERT) to enhance the training process.&#x20;

b. <mark style="color:yellow;">Hybrid data augmentation</mark>, combining random sampling and denoised hard negatives.&#x20;

c. <mark style="color:yellow;">Clustering-based compression</mark> of document embeddings using centroids and delta vectors to reduce storage requirements.

<mark style="color:blue;">**Evaluation and Results**</mark>

* ColBERT is evaluated on benchmark datasets like MS MARCO (in-domain) and BEIR (out-of-domain).
* ColBERT v2 outperforms strong baselines, including dense retrieval methods like BERT and RocketQA, particularly in out-of-domain settings.
* The improvements in ColBERT v2 are primarily attributed to the advanced training techniques, such as distillation and hard negative mining.

<mark style="color:blue;">**Future Directions**</mark>

* Further improving the storage efficiency of ColBERT, e.g., through centroid pruning and dimensionality reduction.
* Investigating the reasons behind ColBERT's strong out-of-domain performance compared to single-vector dense retrieval methods.
* Exploring alternative granularities for the MaxSim operation, such as sentence-level or paragraph-level representations.

ColBERT is a powerful and efficient neural information retrieval model that strikes a balance between the expressiveness of term-level representations and the efficiency of dense retrieval.

Its strong performance, particularly in out-of-domain settings, has made it a popular choice for researchers and practitioners in the field.

</details>

### <mark style="color:green;">Architecture</mark>

### <mark style="color:blue;">**Query and Document Encoders**</mark>

The first step in ColBERT is encoding the query and documents into a bag of <mark style="color:purple;">fixed-size embeddings</mark>.&#x20;

This is done using BERT-based encoders, denoted as $$fQ$$ for the <mark style="color:blue;">**query encoder**</mark> and $$fD$$ for the <mark style="color:blue;">**document encoder**</mark>.

Although a <mark style="color:blue;">**single BERT model**</mark> is shared between the query and document encoders, ColBERT <mark style="color:yellow;">distinguishes between query and document input sequences</mark> by prepending a <mark style="color:blue;">**special token**</mark> $$\[Q]$$ to <mark style="color:green;">**queries**</mark> and $$\[D]$$to <mark style="color:green;">**documents**</mark>.

### <mark style="color:green;">Query Encoder</mark>

For the <mark style="color:blue;">**query encoder**</mark> $$(fQ)$$, given a textual query $$q$$, it is first tokenized into BERT-based WordPiece tokens: $$q1, q2, ..., ql$$.&#x20;

The <mark style="color:blue;">**special token**</mark> $$\[Q]$$ is prepended to the query, right after BERT's sequence start token \[CLS].

If the query has fewer than a predefined number of tokens $$Nq$$, it is padded with BERT's special \[mask] tokens up to length $$Nq$$.&#x20;

This is called <mark style="color:blue;">**query augmentation**</mark>, which allows BERT to produce query-based embeddings at the positions corresponding to these masks.&#x20;

Query augmentation serves as a soft, differentiable mechanism for learning to expand queries with new terms or re-weigh existing terms based on their importance for matching the query.

The <mark style="color:yellow;">padded sequence of input tokens is then passed into BERT's deep transformer architecture</mark>, which computes a contextualised representation of each token.&#x20;

The contextualised output representations are then passed through a <mark style="color:purple;">**linear layer**</mark> with no activations.&#x20;

This layer serves to <mark style="color:yellow;">control the dimension</mark> of ColBERT's embeddings, producing m-dimensional embeddings for the layer's output size m, which is typically much smaller than BERT's fixed hidden dimension.

Finally, the output embeddings are normalised so each has an L2 norm equal to one.&#x20;

This ensures that the <mark style="color:blue;">**dot-product**</mark> of any two <mark style="color:blue;">**embeddings**</mark> becomes equivalent to their cosine similarity, falling in the $$\[-1, 1$$] range.

### <mark style="color:green;">Document Encoder</mark>

The <mark style="color:blue;">**document encoder**</mark> $$(fD)$$ follows a similar architecture.&#x20;

The <mark style="color:blue;">**document**</mark> $$d$$ is segmented into its constituent tokens $$d1, d2, ..., dm$$, and BERT's start token \[CLS] followed by the special token \[D] is prepended.&#x20;

Unlike queries, \[mask] tokens are not appended to documents.

After passing the input sequence through BERT and the linear layer, the document encoder filters out the embeddings corresponding to punctuation symbols, determined via a predefined list.&#x20;

This filtering is meant to reduce the number of embeddings per document, as contextualized embeddings of punctuation are hypothesised to be unnecessary for effectiveness.

<details>

<summary><mark style="color:green;">Some definitions and questions answered</mark></summary>

#### <mark style="color:blue;">**What Does Dense Retrieval Mean?**</mark>

**Definition:**

Dense retrieval refers to information retrieval methods where <mark style="color:yellow;">both queries and documents</mark> are represented as <mark style="color:purple;">**dense vectors**</mark> (embeddings) in a continuous, high-dimensional space. These embeddings capture semantic meanings, allowing for retrieval based on semantic similarity rather than just exact keyword matching.

**Key Applications:**

* **Semantic Search:** Finding documents that are semantically related to a query, even if they don't share exact keywords.
* **Question Answering Systems:** Retrieving passages that answer a user's question based on meaning rather than keyword overlap.
* **Recommendation Systems:** Suggesting items similar in meaning or context to those the user has interacted with.

***

#### <mark style="color:blue;">**How ColBERT Differs from Other Dense Retrieval Methods**</mark>

While traditional dense retrieval methods encode queries and documents into single dense vectors and compute similarity (e.g., using cosine similarity), ColBERT maintains <mark style="color:purple;">**fine-grained token-level embeddings**</mark>. This allows it to capture more detailed interactions between query terms and document terms, leading to better retrieval effectiveness. Moreover, ColBERT's architecture enables precomputing document embeddings offline, significantly improving efficiency compared to methods that require joint encoding of query-document pairs.

***

#### <mark style="color:blue;">**Understanding Encoding in ColBERT**</mark>

**What Does Encoding Mean in This Context?**

In ColBERT, <mark style="color:blue;">**encoding**</mark> refers to the process of transforming queries and documents from sequences of words into sequences of dense vectors (embeddings) using a language model like BERT. Each token (word or subword unit) in the query or document is mapped to an embedding that captures its contextual meaning within the sequence.

**How Does It Work?**

1. **Tokenization:**
   * The text (query or document) is split into tokens using BERT's WordPiece tokenizer.
   * This process breaks down words into subword units to handle rare words and improve vocabulary coverage.
2. **Embedding with BERT:**
   * The tokens are passed through BERT, which produces a contextualized embedding for each token.
   * These embeddings capture the meaning of each token considering its context in the sequence.
3. **Linear Projection:**
   * The high-dimensional embeddings from BERT are projected to a lower-dimensional space using a linear layer.
   * This reduces the size of the embeddings, making storage and computation more efficient.
4. **Normalization:**
   * Each embedding is normalized (e.g., to have a unit L2 norm).
   * This allows for efficient similarity computations using dot products equivalent to cosine similarity.

**Concept Behind It:**

The idea is to represent the semantic content of queries and documents in a numerical form that can be easily compared. By encoding at the token level and maintaining contextual information, ColBERT captures detailed interactions between queries and documents, leading to better retrieval performance.

***

#### <mark style="color:blue;">**What is a Dual Encoder and Why is it Used in ColBERT?**</mark>

**Definition:**

A **dual encoder** architecture consists of <mark style="color:yellow;">two separate encoders</mark>: one for queries and one for documents. Each encoder processes its input independently to produce embeddings.

**Why Use a Dual Encoder?**

* **Efficiency:**
  * Document embeddings can be precomputed and stored since they don't depend on the query.
  * At query time, only the query needs to be encoded, reducing computation.
* **Scalability:**
  * Enables handling large document collections because the expensive document encoding is done offline.
* **Flexibility:**
  * Allows for independent optimization of query and document encoders if needed.

<mark style="color:blue;">**Why Does It Take a Fixed-Length Query?**</mark>

* **Consistency:**
  * Fixing the query length (e.g., to 32 tokens) ensures that the model handles inputs uniformly.
  * Queries shorter than the fixed length are padded, and longer ones are truncated.
* **Efficiency:**
  * Simplifies batching and computational processes within the model.

**What is a Term in This Context?**

* A **term** refers to a token in the tokenized input sequence.
* In ColBERT, each term corresponds to a token embedding produced by the encoder.

***

#### <mark style="color:blue;">**Why is a Special Token Used? What Does It Do?**</mark>

**Special Tokens in ColBERT:**

* **\[CLS]:** Standard BERT start-of-sequence token.
* **\[Q]:** A special token prepended to queries after \[CLS].
* **\[D]:** A special token prepended to documents after \[CLS].
* **\[MASK]:** Used for padding queries up to the fixed length.

**Purpose of Special Tokens:**

* **Distinguish Inputs:**
  * \[Q] and \[D] indicate whether the sequence is a query or a document.
  * Helps the model learn different representations for queries and documents even if they contain similar words.
* **Provide Context:**
  * Special tokens influence the contextual embeddings by signalling the type of input.
  * They can capture different patterns or features relevant to queries or documents.

**Idea Behind It:**

By introducing special tokens, the model can condition its embeddings based on whether it's processing a query or a document. This allows for more nuanced representations and can improve retrieval effectiveness.

***

#### <mark style="color:blue;">**Query Encoder in ColBERT**</mark>

**Process:**

1. **Tokenization:**
   * The query text is tokenized into WordPiece tokens.
2. **Preparation:**
   * Prepend \[CLS] and \[Q] tokens to the tokenized query.
   * If the query is shorter than the fixed length (e.g., 32 tokens), append \[MASK] tokens to reach the fixed length.
3. **Query Augmentation:**
   * The use of \[MASK] tokens allows the model to learn to <mark style="color:yellow;">**expand**</mark> <mark style="color:yellow;"></mark><mark style="color:yellow;">the query</mark> by predicting relevant tokens at these positions during training.
   * This provides a **soft, differentiable mechanism** for query expansion or re-weighting terms.
4. **Encoding:**
   * The prepared token sequence is passed through BERT to get contextualized embeddings.
5. **Linear Projection and Normalization:**
   * Embeddings are passed through a linear layer to reduce dimensionality.
   * Embeddings are normalized to have unit length.

<mark style="color:blue;">**Why Append \[MASK] Tokens to Queries?**</mark>

* **Query Augmentation:**
  * Helps the model learn which additional terms might be relevant for the query.
  * Enhances the ability to match documents that may contain relevant information not explicitly mentioned in the query.
* **Differentiable Mechanism:**
  * Allows the model to learn this behavior during training without hardcoding additional terms.

***

#### <mark style="color:blue;">**Document Encoder in ColBERT**</mark>

**Process:**

1. **Tokenization:**
   * The document text is tokenized into WordPiece tokens.
2. **Preparation:**
   * Prepend \[CLS] and \[D] tokens to the tokenized document.
3. **Encoding:**
   * The prepared token sequence is passed through BERT to get contextualized embeddings.
4. **Linear Projection and Normalization:**
   * Embeddings are passed through a linear layer to reduce dimensionality.
   * Embeddings are normalized to have unit length.
5. **Filtering:**
   * Embeddings corresponding to punctuation tokens are filtered out.
   * This reduces the number of embeddings per document, focusing on meaningful content.

<mark style="color:blue;">**Why Aren't \[MASK] Tokens Appended to Documents?**</mark>

* **Efficiency:**
  * Documents are generally longer, and adding \[MASK] tokens would increase computational load and storage requirements.
* **Relevance:**
  * Unlike queries, documents don't need augmentation for matching purposes. They already contain all the information.
* **Purpose:**
  * Query augmentation is about expanding or re-weighting the query terms to better match documents.
  * Documents don't require this process since they are matched against the query embeddings.

***

#### <mark style="color:blue;">**Similarity Computation Using MaxSim**</mark>

**How is the Similarity Computed?**

1. **For Each Query Term Embedding:**
   * Compute the similarity (e.g., dot product) with each document term embedding.
   * Since embeddings are normalized, the dot product equals the cosine similarity.
2. **MaxSim Operation:**
   * For each query term, take the maximum similarity across all document terms.
   * This captures the strongest match for that query term in the document.
3. **Aggregation:**
   * Sum the maximum similarities for all query terms to get the final relevance score.

<mark style="color:blue;">**Why Use MaxSim?**</mark>

* **Captures Strongest Matches:**
  * Focuses on the best possible match for each query term.
* **Efficient Computation:**
  * Reduces the computational complexity by avoiding the need to model all possible interactions.
* **Pruning-Friendly:**
  * Supports efficient retrieval using vector similarity indexes, allowing for fast top-k retrieval.

***

#### <mark style="color:blue;">**How These Features Work Together to Make ColBERT Effective**</mark>

1. **Independent Encoding with Contextualization:**
   * Queries and documents are encoded separately using BERT, capturing rich contextual information.
   * This allows for precomputing document embeddings and efficient query-time computation.
2. **Use of Special Tokens and Query Augmentation:**
   * Special tokens \[Q] and \[D] help the model distinguish between queries and documents.
   * Query augmentation with \[MASK] tokens enables the model to expand queries implicitly, improving matching with relevant documents.
3. **Dual-Encoder Architecture:**
   * Enables scalability by allowing document embeddings to be precomputed and stored.
   * The query encoder only needs to process the query at runtime, reducing latency.
4. **MaxSim Similarity Computation:**
   * Efficiently captures the most significant term matches between queries and documents.
   * Supports efficient retrieval using vector similarity search methods.
5. **Dimensionality Reduction and Normalization:**
   * Reduces storage requirements and computational costs.
   * Normalized embeddings facilitate efficient similarity computations.
6. **Filtering in Document Encoder:**
   * Removing punctuation embeddings reduces the size of document representations without losing meaningful information.
7. **Overall Efficiency and Effectiveness Balance:**
   * ColBERT maintains high retrieval effectiveness comparable to full BERT-based models.
   * Achieves significant improvements in efficiency, making it practical for real-time applications.

***

**Conclusion**

ColBERT effectively combines the strengths of deep language models like BERT with efficient retrieval mechanisms.&#x20;

By independently encoding queries and documents while preserving fine-grained contextual embeddings, and employing an efficient interaction step using MaxSim, ColBERT provides a powerful solution for neural information retrieval. Its design allows for scalable, low-latency retrieval without compromising on the quality of results, making it suitable for a wide range of applications that require both accuracy and speed.

</details>

### <mark style="color:purple;">**Late Interaction**</mark>

Late interaction is the process of <mark style="color:yellow;">**estimating the relevance score**</mark> between a <mark style="color:blue;">**query**</mark> $$q$$ and a <mark style="color:blue;">**document**</mark> $$d$$, denoted as $$Sq,d$$, using their bags of contextualized embeddings ($$Eq$$ and $$Ed$$) obtained from the query and document encoders.

The key idea behind late interaction is to <mark style="color:yellow;">**delay the interaction between the query and document embeddings**</mark> *until after they have been generated independently*.&#x20;

This is in contrast to other approaches that perform the interaction during the encoding process itself.

In ColBERT, the late interaction is conducted as a sum of maximum similarity computations.&#x20;

Specifically, for each <mark style="color:blue;">**query embedding**</mark> $$Eqi$$ in $$Eq$$, the maximum similarity score is computed against all document embeddings $$Edj$$ in $$Ed$$.&#x20;

The similarity function used can be either cosine similarity (implemented efficiently as dot products due to embedding normalization) or squared L2 distance.

Mathematically, the relevance score $$Sq,d$$ is calculated as:

$$
Sq,d := Σ max (Eqi · Edj^T) i∈|Eq| j∈|Ed|
$$

where $$|Eq|$$ and $$|Ed|$$denote the number of embeddings in $$Eq$$ and $$Ed$$, respectively.

Intuitively, <mark style="color:yellow;">this operation finds the best matching document embedding for each query embedding</mark> and <mark style="color:yellow;">sums up these maximum similarity scores</mark>.

It captures the overall relevance between the query and the document based on the strongest local matches between their contextualised embeddings.

ColBERT is trained end-to-end using <mark style="color:blue;">**pairwise softmax cross-entropy loss**</mark>.&#x20;

Given a triple $$(q, d+, d-),$$ where $$d+$$ is a positive (relevant) document and $$d-$$ is a negative (irrelevant) document for query $$q$$, ColBERT produces a score for each document individually using the late interaction mechanism.&#x20;

The model is then optimised to assign higher scores to positive documents compared to negative documents.

It's important to note that the late interaction mechanism itself has <mark style="color:yellow;">**no trainable parameters**</mark>.&#x20;

The learnable parameters in ColBERT are in the BERT-based encoders and the additional linear layers.

<figure><img src="https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2F5HbcIosrMZc4PB5qUIIh%2Fimage.png?alt=media&#x26;token=e36e4893-2fd0-464f-8fea-e257b093c238" alt=""><figcaption><p>The general architecture of ColBERT given a query q and a document d.</p></figcaption></figure>

### <mark style="color:green;">Summary</mark>

* <mark style="color:blue;">**Query Encoder (fQ):**</mark> Transforms a query into a set of fixed-size embeddings, where each embedding is contextualised based on the entire query.
* <mark style="color:blue;">**Document Encoder (fD):**</mark> Similar to the query encoder, it transforms a document into a set of embeddings, with each embedding contextualised within the document's content.
* <mark style="color:blue;">**Late Interaction Mechanism:**</mark> Uses a simple yet powerful approach to compute the relevance score between a query and a document by summing the maximum similarity scores across all pairs of query and document embeddings.

### <mark style="color:purple;">**Retrieval Process**</mark>

The retrieval process in ColBERT involves two stages: an <mark style="color:yellow;">offline indexing stage</mark> and an <mark style="color:yellow;">online querying stage</mark>.

*<mark style="color:blue;">**Offline Indexing:**</mark>* During offline indexing, the <mark style="color:blue;">**document encoder**</mark> $$fD$$ is run on each document in the collection, and the resulting embeddings are stored.&#x20;

This process is computationally expensive but *<mark style="color:yellow;">**needs to be done only once**</mark>*. ColBERT applies several optimisations to speed up indexing:

1. Parallel encoding of document batches using multiple GPUs.
2. Padding documents within a batch to the maximum length for efficient batch processing.
3. Grouping documents by length and processing them in batches of similar lengths (known as length-based bucketing).
4. Parallelising the text preprocessing (e.g., WordPiece tokenization) across CPU cores.

The document embeddings are then saved to disk using 16-bit or 32-bit floating-point representations.

*<mark style="color:blue;">**Online Querying:**</mark>* During online querying, ColBERT can be used in two modes: <mark style="color:yellow;">**re-ranking**</mark> or <mark style="color:yellow;">**end-to-end retrieval**</mark>.

In the <mark style="color:yellow;">**re-ranking mode**</mark>, ColBERT is used to re-rank a small set of candidate documents (e.g., top-k documents from a term-based retrieval model).

The query encoder $$fQ$$ is run on the input query to obtain the query embeddings $$Eq$$.&#x20;

The pre-computed document embeddings of the candidate documents are loaded from disk and stacked into a 3D tensor $$D$$.&#x20;

The relevance scores are then computed using late interaction between $$Eq$$ and $$D$$, and the documents are sorted based on their scores.

In the <mark style="color:yellow;">**end-to-end retrieval mode**</mark>, ColBERT directly retrieves the top-k documents from the entire collection.&#x20;

This is done using a two-stage approach:

1. An <mark style="color:blue;">**approximate nearest neighbor (ANN) search**</mark> is performed using the [<mark style="color:blue;">**FAISS library**</mark>](https://training.continuumlabs.ai/knowledge/vector-databases/faiss-facebook-ai-similarity-search). For each query embedding in $$Eq$$, the top-k' nearest document embeddings are retrieved from the FAISS index. This stage efficiently narrows down the candidate set to a smaller subset of potentially relevant documents.
2. The candidate documents from the first stage are then re-ranked using the full late interaction mechanism to obtain the final top-k results.

The use of ANN search in the first stage allows ColBERT to efficiently handle large document collections by avoiding exhaustive scoring of all documents. The re-ranking stage ensures that the final results are based on the full ColBERT relevance scores.

Overall, the late interaction mechanism and the two-stage retrieval process enable ColBERT to achieve high-quality retrieval results while maintaining computational efficiency, making it suitable for large-scale information retrieval tasks.

### <mark style="color:purple;">Experimental Evaluation</mark>

The experimental evaluation section of the ColBERT paper is where the authors empirically assess the performance of ColBERT in various retrieval contexts, addressing specific <mark style="color:blue;">research questions (RQs)</mark> to determine the model's efficiency and effectiveness in comparison with both traditional and neural ranking models.

<mark style="color:blue;">**Datasets & Metrics**</mark>

They used two datasets, MS MARCO and TREC-CAR, which are large enough to train and evaluate deep neural networks. &#x20;

MS MARCO is used to assess <mark style="color:yellow;">reading comprehension and information retrieval with real-world queries,</mark> while TREC-CAR is based on Wikipedia and focuses on <mark style="color:yellow;">complex answer retrieval.</mark>&#x20;

The evaluation metrics include Mean Reciprocal Rank at 10 (MRR\@10) and others like Recall\@k for different k values.

<mark style="color:blue;">**Implementation**</mark>

The authors implemented ColBERT using PyTorch and the transformers library.&#x20;

They fine-tuned the BERT models with specific learning rates and batch sizes, adjusting the number of embeddings per query and the dimension of ColBERT embeddings.  They experimented with both BERT base and large models, adjusting training iterations based on the dataset.

<mark style="color:blue;">**Hardware & Time Measurements**</mark>

They measured latency using a Tesla V100 GPU for re-ranking tasks and Titan V GPUs for CPU-based retrieval and indexing experiments, ensuring a fair assessment of computational efficiency.

<mark style="color:blue;">**Re-ranking Efficiency and Effectiveness (RQ1**</mark>

ColBERT's re-ranking performance was compared with other neural rankers like KNRM, Duet, and fastText+ConvKNRM, and variations of BERT-based rankers. &#x20;

The aim was to understand if ColBERT could bridge the gap between efficient and effective neural models.&#x20;

The results showed that ColBERT achieved comparable or better effectiveness (measured by MRR\@10) than these models while being significantly more efficient in terms of latency and computational resources (FLOPs).

<mark style="color:blue;">**End-to-End Retrieval (RQ2)**</mark>

Beyond re-ranking, the authors tested if ColBERT could support efficient and effective end-to-end retrieval from a large document collection.&#x20;

They demonstrated that ColBERT could retrieve top-k documents directly from the MS MARCO collection with high effectiveness (MRR\@10) and recall metrics, indicating strong performance in filtering a large collection to find relevant documents.

<mark style="color:blue;">**Component Contribution (RQ3)**</mark>

The authors conducted an ablation study to evaluate the contribution of different components of ColBERT, such as late interaction and query augmentation, to the overall performance. This study aimed to identify which parts of the model were critical for its effectiveness and efficiency.

<mark style="color:blue;">**Indexing-Related Costs (RQ4)**</mark>

The study also looked into the costs associated with offline computation and memory overhead for indexing with ColBERT.  This aspect is crucial for understanding the <mark style="color:yellow;">practicality of deploying ColBERT in real-world scenarios</mark>, especially considering the balance between offline indexing costs and online retrieval efficiency.

In summary, the experimental evaluation demonstrated ColBERT's ability to maintain or exceed the effectiveness of state-of-the-art neural rankers while significantly reducing computational costs, highlighting its potential for practical, large-scale information retrieval applications.

### <mark style="color:purple;">Conclusion</mark>

ColBERT represents a significant advancement in the field of information retrieval by successfully addressing the trade-off between efficiency and effectiveness in deep learning-based ranking models.&#x20;

By introducing a late interaction mechanism and leveraging the power of BERT's contextualised representations, ColBERT achieves remarkable speedups and requires substantially fewer computational resources compared to existing BERT-based models, while maintaining high-quality retrieval performance.&#x20;

The extensive experimental evaluation demonstrates ColBERT's superiority over non-BERT baselines and its potential for practical deployment in large-scale information retrieval systems.&#x20;

As the volume of digital information continues to grow, the development of efficient and effective retrieval models like ColBERT will be crucial in enabling users to access relevant information quickly and accurately

### <mark style="color:purple;">Colbert in Action</mark>

Here's an example of how ColBERT would work in practice, focusing on the query encoding, document encoding, and late interaction stages.

Let's consider a <mark style="color:blue;">**simple query**</mark> and a <mark style="color:blue;">**set of two documents**</mark>:

<mark style="color:blue;">**Query:**</mark> "What is the capital of France?"

<mark style="color:blue;">**Document 1**</mark><mark style="color:blue;">:</mark> "Paris is the capital and most populous city of France, with an estimated population of 2,165,423 residents as of 2019 in an area of more than 105 square kilometres."

<mark style="color:blue;">**Document 2:**</mark> "London is the capital and largest city of England and the United Kingdom, with a population of just under 9 million."

### <mark style="color:purple;">**Query Encoding**</mark>

1. The <mark style="color:blue;">**query**</mark> "What is the capital of France?" is <mark style="color:blue;">**tokenized**</mark> into WordPiece tokens: \['what', 'is', 'the', 'capital', 'of', 'france', '?'].
2. The <mark style="color:blue;">**special token**</mark> \[Q] is <mark style="color:blue;">**prepended to the query**</mark>, and the sequence is padded with \[mask] tokens to *<mark style="color:yellow;">reach the predefined length</mark>* Nq. Let's assume Nq = 10. Input to BERT: \[CLS] \[Q] 'what' 'is' 'the' 'capital' 'of' 'france' '?' \[mask] \[mask]
3. The padded sequence is passed through BERT, and the contextualized representations of each token are obtained.
4. The contextualized representations are passed through a linear layer to reduce their dimensionality to $$m$$. Let's assume m = 128.
5. The resulting embeddings are normalized to have an L2 norm of 1.  Query Embeddings $$(Eq): \[eq1, eq2, ..., eq10]$$, where each $$eqi$$ is a 128-dimensional vector.

### <mark style="color:purple;">**Document Encoding**</mark>

1. <mark style="color:blue;">**Document 1**</mark> is <mark style="color:blue;">**tokenized**</mark> into WordPiece tokens: \['paris', 'is', 'the', 'capital', 'and', 'most', 'populous', 'city', 'of', 'france', ',', 'with', 'an', 'estimated', 'population', 'of', '2', ',', '165', ',', '423', 'residents', 'as', 'of', '2019', 'in', 'an', 'area', 'of', 'more', 'than', '105', 'square', 'kilometres', '.']
2. The <mark style="color:blue;">**special tokens**</mark> \[CLS] and \[D] are prepended to the document. Input to BERT: \[CLS] \[D] 'paris' 'is' 'the' 'capital' 'and' 'most' 'populous' 'city' 'of' 'france' ',' 'with' 'an' 'estimated' 'population' 'of' '2' ',' '165' ',' '423' 'residents' 'as' 'of' '2019' 'in' 'an' 'area' 'of' 'more' 'than' '105' 'square' 'kilometres' '.'
3. The sequence is passed through BERT, and the contextualized representations of each token are obtained.
4. The contextualized representations are passed through a linear layer to reduce their dimensionality to m (128).
5. The resulting embeddings are normalized, and the embeddings corresponding to punctuation are filtered out.&#x20;
6. <mark style="color:blue;">**Document 1 Embeddings**</mark> $$(Ed1): \[ed1\_1, ed1\_2, ..., ed1\_n]$$, where each $$ed1\_i$$ is a 128-dimensional vector and $$n$$ is the number of non-punctuation tokens in Document 1.
7. The same process is applied to Document 2, resulting in its embeddings $$(Ed2)$$.

### <mark style="color:purple;">**Late Interaction**</mark>

1. For each query embedding $$eqi$$ in $$Eq$$, the maximum similarity score (e.g., cosine similarity) is computed against all document embeddings $$ed1\_j$$ in $$Ed1$$.
2. The same process is repeated for Document 2 embeddings $$(Ed2)$$.
3. The maximum similarity scores are summed across all query embeddings to obtain the relevance scores $$Sq,d1$$ and $$Sq,d2$$ for Documents 1 and 2, respectively.&#x20;
4. Relevance Score for Document 1 $$(Sq,d1) = sum(max(eq1 · ed1\_1, eq1 · ed1\_2, ...), max(eq2 · ed1\_1, eq2 · ed1\_2, ...), ...)$$
5. Relevance Score for Document 2 $$(Sq,d2) = sum(max(eq1 · ed2\_1, eq1 · ed2\_2, ...), max(eq2 · ed2\_1, eq2 · ed2\_2, ...), ...)$$
6. The documents are ranked based on their relevance scores. In this example, Document 1 would likely receive a higher relevance score than Document 2, as it contains more information related to the query "What is the capital of France?".

This example demonstrates how ColBERT encodes queries and documents separately, and then uses late interaction to compute relevance scores based on the maximum similarity between query and document embeddings.&#x20;

This approach allows for efficient retrieval while leveraging the power of contextualized representations from BERT.
