# BERT as a reranking engine

This <mark style="color:blue;">**April 2020**</mark> paper discusses the integration of pre-trained deep language models, like BERT, into retrieval and ranking pipelines, which has <mark style="color:yellow;">shown significant improvements over traditional bag-of-words models like BM25 in passage retrieval tasks</mark>.&#x20;

While BERT has been effective as a re-ranker, its high computational cost at query time makes it impractical as an initial retriever, *<mark style="color:yellow;">**necessitating the use of BM25 for initial retrieval before BERT re-ranking**</mark>*.

{% embed url="<https://arxiv.org/abs/1901.04085>" %}
Passage Re-ranking with BERT
{% endembed %}

### <mark style="color:purple;">What is the point of reranking?</mark>

Reranking is a process used in information retrieval systems.  The purpose of reranking is to improve the initial ranking of documents or passages provided by a less computationally intensive search system, ensuring that the most relevant results are placed at the top.

#### <mark style="color:green;">**Real-World Applications of Reranking**</mark>

<mark style="color:blue;">**Search Engines**</mark>

* **Improving Search Results:** Search engines initially retrieve a large set of potentially relevant documents using basic criteria. Reranking refines this list to improve user satisfaction by presenting the most relevant results first, based on more complex criteria.

<mark style="color:blue;">**E-commerce Platforms**</mark>

* **Product Recommendations:** In e-commerce, reranking can optimise the order of product listings to better match user intent and preferences, potentially increasing sales and improving customer experiences.

<mark style="color:blue;">**Content Discovery Platforms**</mark>

* **Media and News Aggregation:** Platforms like news aggregators or streaming services use reranking to tailor the feed to the user’s preferences, ensuring that the most appealing articles or shows are highlighted.

<mark style="color:blue;">**Question Answering Systems**</mark>

* **Optimising Answers to Queries:** In systems designed to provide direct answers to user queries, reranking is used to select the most accurate and relevant answers from a set of possible candidates.

<mark style="color:blue;">**Legal and Research Databases**</mark>

* **Document Retrieval:** For professionals who rely on precise information, such as lawyers or researchers, reranking helps by prioritising documents that are most relevant to their specific queries, thereby saving time and improving outcomes.

### <mark style="color:purple;">Key Insights and Methodologies</mark>

The authors adapted BERT, originally designed for a broad range of natural language processing tasks, to specifically focus on the re-ranking of passages in response to queries.&#x20;

This adaptation involves fine-tuning the pre-trained BERT model on the passage re-ranking task.

### <mark style="color:green;">BERT Model Overview</mark>

<mark style="color:blue;">**BERT (Bidirectional Encoder Representations from Transformers)**</mark> uses the transformer architecture, which is based on self-attention mechanisms.&#x20;

The core idea is to model all tokens of the input sequence simultaneously and compute attention weights reflecting how tokens influence each other.&#x20;

The key mathematical components of BERT and transformers are:

* **Self-Attention Mechanism**: This mechanism computes a representation of each token in the context of all tokens in the same input sequence.&#x20;
* For a given token, attention weights determine the influence of all tokens (including itself) on its new representation.&#x20;
* Mathematically, the attention weights are calculated using the softmax of scaled dot products of queries $$(Q)$$, keys $$(K)$$, and values $$(V)$$ matrices derived from the input embeddings:

  $$
  \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d\_k}}\right)V
  $$

  * $$𝑄$$ is the query matrix,
  * $$𝐾$$ is the key matrix,
  * $$𝑉$$ is the value matrix,
  * $$d\_k \text{ is the dimension of the keys}$$
  * $$K^T \text{ denotes the transpose of } K$$
  * $$softmax$$ is the softmax function applied across the relevant dimension to normalize the weights
* <mark style="color:blue;">**Positional Encoding:**</mark> BERT incorporates positional encodings to the input embeddings to retain the order of the words, using sinusoidal functions of different frequencies.
* <mark style="color:blue;">**Layer-wise Feed-forward Networks:**</mark> Each transformer block contains a feed-forward neural network applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.

### <mark style="color:green;">**Passage Re-Ranking Task Using BERT**</mark>

For the passage re-ranking task, BERT is employed as a binary classification model.&#x20;

The steps involve:

* <mark style="color:blue;">**Input Representation:**</mark> Concatenate the <mark style="color:yellow;">query</mark> and the <mark style="color:yellow;">passage</mark> as a <mark style="color:yellow;">single input sequence</mark> to BERT.   This is done by using the <mark style="color:yellow;">query as sentence A</mark> and the <mark style="color:yellow;">passage as sentence B</mark>, separated by special tokens (e.g., \[SEP]) and preceded by a \[CLS] token that serves as the aggregate representation.
* <mark style="color:blue;">**Token Limitation:**</mark> Due to computational constraints, the <mark style="color:blue;">inputs</mark> (query and passage) are <mark style="color:yellow;">truncated to fit within BERT’s maximum sequence length</mark> (512 tokens), ensuring that the model processes only the most relevant portions of the text.
* <mark style="color:blue;">**Output for Classification**</mark><mark style="color:blue;">:</mark> The output vector corresponding to the \[CLS] token, which has been contextually informed by all tokens through the layers of attention and feed-forward networks, is used as the feature vector for classification. This vector is fed into a simple logistic regression layer (or a single-layer neural network) to compute the probability that the passage is relevant to the query.

### <mark style="color:purple;">Example Scenario: Passage Re-Ranking with BERT</mark>

<mark style="color:green;">**Context**</mark>

Suppose you have a search engine query, "What are the benefits of a Mediterranean diet?" and you want to use BERT to re-rank passages that might contain relevant information.

**Steps:**

1. <mark style="color:blue;">**Input Representation:**</mark>
   * <mark style="color:yellow;">**Query**</mark>**&#x20;(Sentence A):** "What are the benefits of a Mediterranean diet?"
   * <mark style="color:yellow;">**Passage**</mark>**&#x20;(Sentence B):** "The Mediterranean diet emphasises eating primarily plant-based foods, such as fruits and vegetables, whole grains, legumes, and nuts."
   * <mark style="color:yellow;">**Concatenation**</mark>**&#x20;with Special Tokens:**

     <pre class="language-css" data-overflow="wrap"><code class="lang-css">[CLS] What are the benefits of a Mediterranean diet? [SEP] The Mediterranean diet emphasizes eating primarily plant-based foods, such as fruits and vegetables, whole grains, legumes, and nuts. [SEP]
     </code></pre>
   * This input is then processed into token IDs using BERT’s tokenizer.
2. <mark style="color:blue;">**Token Limitation:**</mark>
   * If the combined token count exceeds 512, the input is truncated accordingly to fit the model's maximum input size requirements, focusing on retaining the most relevant parts of the query and the passage.
3. <mark style="color:blue;">**Output for Classification:**</mark>
   * BERT processes the input through multiple layers of transformers where each token is embedded, self-attention is applied, and the context is aggregated.
   * The output vector for the \[CLS] token, which aggregates context from the entire sequence, is extracted from the final layer of BERT.
   * This vector is then passed to a logistic regression layer:

$$
\text{Probability of Relevance} = \sigma(W \cdot h\_{\text{CLS}} + b)
$$

* $$\sigma$$ represents the sigmoid function.
* $$W$$ represents the weight matrix.
* $$h\_{\text{CLS}}$$ is the feature vector extracted from the \[CLS] token output by BERT.
* `b` is the bias term.
* $$\cdot$$ denotes matrix multiplication, which is used here to represent the operation between the weight matrix $$𝑊$$ and the feature vector $$h\_{\text{CLS}}$$.
* Here, $$𝑊$$ is a weight matrix, $$h\_{\text{CLS}}$$ is the feature vector from the \[CLS] token, and $$𝑏$$ is a bias term. The sigmoid function $$𝜎$$ maps the linear combination to a probability between 0 and 1.

**Example Output:**

* The <mark style="color:yellow;">logistic regression outputs a probability score</mark>, say 0.85, indicating a high relevance of the passage to the query.

**Use in Re-Ranking:**

* Suppose you have several passages retrieved by a simpler method (e.g., BM25). BERT evaluates each passage's relevance probability as described.
* These passages are then re-ranked based on the probability scores. The passage with the highest score is presented first, followed by others in descending order of their relevance.

#### <mark style="color:blue;">Practical Application:</mark>

This re-ranking approach can dramatically improve the quality of search results in real-world applications, where the initial retrieval might fetch a broad set of potentially relevant documents, and the re-ranking ensures that the most pertinent information is presented to the user first.&#x20;

This method is particularly useful in information-heavy fields such as legal document retrieval, academic research, or any detailed content discovery platform where precision in search results is critical.

<mark style="color:blue;">**Probability Calculation and Ranking**</mark><mark style="color:blue;">:</mark>

The logistic regression model outputs a probability score $$𝑝$$ using the sigmoid function applied to the linear combination of features in the \[CLS] token’s output:

$$
p = \sigma(w^T h\_{\text{CLS}} + b)
$$

Where:

* $$𝑤$$ is the weight vector,
* $$𝑏$$ is a bias term,
* $$h\_{\text{CLS}}$$ is the feature vector from the <mark style="color:yellow;">\[CLS] token</mark>, and
* $$𝜎$$ is the sigmoid function.

<mark style="color:blue;">**Loss Function**</mark>

The model is trained using <mark style="color:blue;">**cross-entropy loss**</mark>, which for binary classification can be expressed as:

$$
L = -\sum\_{j \in J\_{\text{pos}}} \log(p\_j) - \sum\_{j \in J\_{\text{neg}}} \log(1 - p\_j)
$$

where $$j\_{\text{POS}}$$ and $$j\_{\text{POS}}$$ are the sets of indices for relevant and non-relevant passages, respectively.

#### <mark style="color:green;">**Training and Fine-tuning**</mark>

The pre-trained BERT model is fine-tuned on the specific task of passage re-ranking.&#x20;

This involves adjusting the pre-trained parameters to minimise the loss function on a dataset where passages are labeled as relevant or not relevant based on their relationship to queries.&#x20;

The fine-tuning allows BERT to adapt its complex language model to the nuances of determining relevance in passage ranking.

In summary, passage re-ranking with BERT leverages deep learning and the transformer architecture's powerful ability to model language context, refined by fine-tuning on specific information retrievals tasks to deliver highly relevant search results efficiently.

### <mark style="color:purple;">Benchmarks in Information Retrieval</mark>

In the context of machine learning and information retrieval, a **benchmark** typically refers to a standard dataset used to evaluate and compare the performance of different models systematically.&#x20;

Benchmarks are crucial because they provide consistent scenarios or problems that models must solve, allowing for fair comparisons across different approaches and techniques.

The experimental results section discusses how the model was trained and evaluated using <mark style="color:yellow;">two specific passage-ranking datasets</mark>, MS MARCO and TREC-CAR.

Let's dissect why these datasets are used, the specific methodologies implemented, and the benchmarks provided to gauge the model's performance.

### <mark style="color:purple;">Why Use MS MARCO and TREC-CAR?</mark>

#### <mark style="color:green;">MS MARCO (Microsoft MAchine Reading COmprehension)</mark>

* **Purpose**: MS MARCO is designed to <mark style="color:yellow;">simulate real-world information retrieval scenarios</mark>. It uses queries derived from actual user searches and provides manually annotated relevant and non-relevant passages. This setup challenges models to perform well in practical, real-life situations where the relevance of information can vary greatly.
* **Key Metrics**:
  * **MRR\@10 (Mean Reciprocal Rank at 10)**: This metric is particularly suited for tasks where the user is interested in the top result, such as in web searches or when looking for specific information. It measures the reciprocal of the rank at which the first relevant document is retrieved, averaged over all queries. The focus on the top 10 results reflects typical user behavior in browsing search results.

#### <mark style="color:green;">**TREC-CAR (Text REtrieval Conference - Complex Answer Retrieval)**</mark>

* **Purpose**: This benchmark tests the model's ability to <mark style="color:yellow;">handle complex queries that require understanding and retrieving information from specific sections of long documents</mark>, such as Wikipedia articles. This mimics academic or in-depth research scenarios where queries can be very detailed and require precise answers.
* **Key Metrics**:
  * **MAP (Mean Average Precision)**: Ideal for scenarios where multiple relevant documents exist, it measures the precision averaged over all relevant documents retrieved and is useful for assessing retrieval effectiveness across a list of documents.
  * **MRR\@10**: Like MS MARCO, this metric assesses the precision at the top ranks, crucial for evaluating how well the system retrieves the most relevant document within the first few results.

#### <mark style="color:green;">Training and Evaluation Details</mark>

* **Data Characteristics**: Both datasets feature large-scale query sets and diverse document types, offering comprehensive training and testing grounds for models. For instance, MS MARCO’s dataset containing single and sometimes zero relevant results per query tests the model's precision and recall capabilities effectively.
* **Real-World Simulation**: The datasets mirror the variety and complexity of real-world data, helping to ensure that improvements in model performance translate into better user experiences in practical applications, not just theoretical or overly simplified scenarios.

#### <mark style="color:green;">Avoiding Data Leakage</mark>

* To ensure the model's generalizability and to prevent it from merely memorizing specific answers, BERT models are pre-trained or fine-tuned in a controlled manner. For TREC-CAR, particularly, BERT is trained only on parts of Wikipedia not used in the test set to avoid inadvertently learning the test cases.

### <mark style="color:purple;">Results</mark>

The results presented in the table provide a detailed comparison of different information retrieval methods applied to the MS MARCO and TREC-CAR datasets, specifically <mark style="color:yellow;">measuring their performance using the metrics</mark> <mark style="color:blue;">**MRR\@10 (Mean Reciprocal Rank at cut-off 10)**</mark> and <mark style="color:blue;">**MAP (Mean Average Precision)**</mark>.

These metrics are standard benchmarks used to evaluate the effectiveness of search and retrieval systems. Let's break down the benchmarks, methods, and the reported numbers to understand the significance of these results.

### <mark style="color:green;">Methods Evaluated</mark>

* <mark style="color:blue;">**BM25 (Lucene, no tuning; Anserini, tuned)**</mark><mark style="color:blue;">:</mark> A standard information retrieval function that uses <mark style="color:yellow;">term frequency (TF) and inverse document frequency (IDF) to rank documents</mark> based on the query terms they contain. "Lucene, no tuning" implies a basic configuration, whereas "Anserini, tuned" suggests optimisations were made.
* <mark style="color:blue;">**Co-PACRR:**</mark> A convolutional neural model that <mark style="color:yellow;">captures positional and proximity-based features between query and document terms</mark>.
* <mark style="color:blue;">**KNRM**</mark><mark style="color:blue;">:</mark> Kernel-based Neural Ranking Model that uses <mark style="color:yellow;">kernel pooling to model soft matches between query and document terms</mark>.
* <mark style="color:blue;">**Conv-KNRM**</mark><mark style="color:blue;">:</mark> Enhances KNRM by integrating convolutional neural networks to learn n-gram soft matches between query and document terms.
* <mark style="color:blue;">**IRNet†**</mark><mark style="color:blue;">:</mark> Denotes a previous state-of-the-art model which details are unpublished but was leading until the reported results.

<figure><img src="https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2FmoainaPeDlODuTffL81D%2Fchrome_9UFOIQcCXF.png?alt=media&#x26;token=fcb798e2-80d8-4f91-b13f-8d2037fa902a" alt=""><figcaption></figcaption></figure>

### <mark style="color:green;">Results Analysis</mark>

The table shows the <mark style="color:yellow;">MRR\@10 and MAP scores</mark> for different methods across the development (Dev), evaluation (Eval), and test (Test) datasets for both MS MARCO and TREC-CAR.

* **MS MARCO**:
  * <mark style="color:yellow;">**BERT Large**</mark> <mark style="color:yellow;"></mark><mark style="color:yellow;">shows the highest MRR\@10 scores</mark> across Dev and Eval sets, significantly outperforming all other methods. For instance, BERT Large achieves 36.5 on the Dev set and 35.8 on the Eval set, compared to the next best, IRNet, which scores 29.0 and 27.8 respectively.

When BERT Large achieves 36.5 on the Dev set and 35.8 on the Eval set for MS MARCO, it means <mark style="color:yellow;">on average</mark>, the first relevant result appears very close to the top of the results list. Specifically, a score of 36.5 translates to the first relevant result typically appearing around the top 3 results, since the score is scaled out of 100 (if it were expressed as a percentage).

* **TREC-CAR**:
  * Here, <mark style="color:yellow;">BERT Large also excels with a top MRR\@10 score</mark> of 33.5 on the Test set, which is significantly higher than the nearest competitor, IRNet, which scores 28.1.

This indicates that, similar to the MS MARCO results, BERT Large effectively retrieves relevant documents, placing them typically within the top three results.&#x20;

The test set score of 33.5 is significantly higher than IRNet's 28.1, underscoring BERT's superior capability to discern and rank relevant information even in complex query scenarios typical of TREC-CAR, where queries are based on combinations of Wikipedia article titles and section titles.

### <mark style="color:purple;">Results interpretation</mark>

The presented data underscores the superiority of BERT Large in handling complex query passage matching tasks, showcasing its ability to understand and process natural language more effectively than traditional methods and other neural approaches.&#x20;

The large margins by which BERT outperforms other methods highlight its advanced capabilities in semantic understanding and relevance scoring in the context of large-scale information retrieval tasks. These results validate the adoption of BERT for tasks requiring high precision in document retrieval and underscore its impact on advancing the state of the art in search technologies.

### <mark style="color:purple;">Conclusion</mark>

This study has explored the transformative impact of integrating pre-trained deep language models such as BERT into retrieval and ranking pipelines.&#x20;

BERT has demonstrated a substantial enhancement in the accuracy and relevance of passage retrieval tasks over traditional models like BM25, especially when employed as a re-ranker.&#x20;

Despite its computational demands, BERT's sophisticated understanding of context and language nuances significantly improves the quality of search results, confirming its superiority in complex information retrieval scenarios.

The applications of BERT in various real-world systems, from search engines to legal and research databases, illustrate its potential to change e how we interact with information, making searches more efficient and results more pertinent.&#x20;

As industries continue to generate and rely on vast amounts of data, the relevance and precision of search technologies powered by models like BERT become increasingly critical.
