BERT as a reranking engine
Retrieval and Reranking
Last updated
Copyright Continuum Labs - 2023
Retrieval and Reranking
Last updated
This April 2020 paper discusses the integration of pre-trained deep language models, like BERT, into retrieval and ranking pipelines, which has shown significant improvements over traditional bag-of-words models like BM25 in passage retrieval tasks.
While BERT has been effective as a re-ranker, its high computational cost at query time makes it impractical as an initial retriever, necessitating the use of BM25 for initial retrieval before BERT re-ranking.
Reranking is a process used in information retrieval systems. The purpose of reranking is to improve the initial ranking of documents or passages provided by a less computationally intensive search system, ensuring that the most relevant results are placed at the top.
Search Engines
Improving Search Results: Search engines initially retrieve a large set of potentially relevant documents using basic criteria. Reranking refines this list to improve user satisfaction by presenting the most relevant results first, based on more complex criteria.
E-commerce Platforms
Product Recommendations: In e-commerce, reranking can optimise the order of product listings to better match user intent and preferences, potentially increasing sales and improving customer experiences.
Content Discovery Platforms
Media and News Aggregation: Platforms like news aggregators or streaming services use reranking to tailor the feed to the user’s preferences, ensuring that the most appealing articles or shows are highlighted.
Question Answering Systems
Optimising Answers to Queries: In systems designed to provide direct answers to user queries, reranking is used to select the most accurate and relevant answers from a set of possible candidates.
Legal and Research Databases
Document Retrieval: For professionals who rely on precise information, such as lawyers or researchers, reranking helps by prioritising documents that are most relevant to their specific queries, thereby saving time and improving outcomes.
The authors adapted BERT, originally designed for a broad range of natural language processing tasks, to specifically focus on the re-ranking of passages in response to queries.
This adaptation involves fine-tuning the pre-trained BERT model on the passage re-ranking task.
BERT (Bidirectional Encoder Representations from Transformers) uses the transformer architecture, which is based on self-attention mechanisms.
The core idea is to model all tokens of the input sequence simultaneously and compute attention weights reflecting how tokens influence each other.
The key mathematical components of BERT and transformers are:
Self-Attention Mechanism: This mechanism computes a representation of each token in the context of all tokens in the same input sequence.
For a given token, attention weights determine the influence of all tokens (including itself) on its new representation.
Positional Encoding: BERT incorporates positional encodings to the input embeddings to retain the order of the words, using sinusoidal functions of different frequencies.
Layer-wise Feed-forward Networks: Each transformer block contains a feed-forward neural network applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.
For the passage re-ranking task, BERT is employed as a binary classification model.
The steps involve:
Input Representation: Concatenate the query and the passage as a single input sequence to BERT. This is done by using the query as sentence A and the passage as sentence B, separated by special tokens (e.g., [SEP]) and preceded by a [CLS] token that serves as the aggregate representation.
Token Limitation: Due to computational constraints, the inputs (query and passage) are truncated to fit within BERT’s maximum sequence length (512 tokens), ensuring that the model processes only the most relevant portions of the text.
Output for Classification: The output vector corresponding to the [CLS] token, which has been contextually informed by all tokens through the layers of attention and feed-forward networks, is used as the feature vector for classification. This vector is fed into a simple logistic regression layer (or a single-layer neural network) to compute the probability that the passage is relevant to the query.
Context
Suppose you have a search engine query, "What are the benefits of a Mediterranean diet?" and you want to use BERT to re-rank passages that might contain relevant information.
Steps:
Input Representation:
Query (Sentence A): "What are the benefits of a Mediterranean diet?"
Passage (Sentence B): "The Mediterranean diet emphasises eating primarily plant-based foods, such as fruits and vegetables, whole grains, legumes, and nuts."
Concatenation with Special Tokens:
This input is then processed into token IDs using BERT’s tokenizer.
Token Limitation:
If the combined token count exceeds 512, the input is truncated accordingly to fit the model's maximum input size requirements, focusing on retaining the most relevant parts of the query and the passage.
Output for Classification:
BERT processes the input through multiple layers of transformers where each token is embedded, self-attention is applied, and the context is aggregated.
The output vector for the [CLS] token, which aggregates context from the entire sequence, is extracted from the final layer of BERT.
This vector is then passed to a logistic regression layer:
b
is the bias term.
Example Output:
The logistic regression outputs a probability score, say 0.85, indicating a high relevance of the passage to the query.
Use in Re-Ranking:
Suppose you have several passages retrieved by a simpler method (e.g., BM25). BERT evaluates each passage's relevance probability as described.
These passages are then re-ranked based on the probability scores. The passage with the highest score is presented first, followed by others in descending order of their relevance.
This re-ranking approach can dramatically improve the quality of search results in real-world applications, where the initial retrieval might fetch a broad set of potentially relevant documents, and the re-ranking ensures that the most pertinent information is presented to the user first.
This method is particularly useful in information-heavy fields such as legal document retrieval, academic research, or any detailed content discovery platform where precision in search results is critical.
Probability Calculation and Ranking:
Where:
Loss Function
The model is trained using cross-entropy loss, which for binary classification can be expressed as:
The pre-trained BERT model is fine-tuned on the specific task of passage re-ranking.
This involves adjusting the pre-trained parameters to minimise the loss function on a dataset where passages are labeled as relevant or not relevant based on their relationship to queries.
The fine-tuning allows BERT to adapt its complex language model to the nuances of determining relevance in passage ranking.
In summary, passage re-ranking with BERT leverages deep learning and the transformer architecture's powerful ability to model language context, refined by fine-tuning on specific information retrievals tasks to deliver highly relevant search results efficiently.
In the context of machine learning and information retrieval, a benchmark typically refers to a standard dataset used to evaluate and compare the performance of different models systematically.
Benchmarks are crucial because they provide consistent scenarios or problems that models must solve, allowing for fair comparisons across different approaches and techniques.
The experimental results section discusses how the model was trained and evaluated using two specific passage-ranking datasets, MS MARCO and TREC-CAR.
Let's dissect why these datasets are used, the specific methodologies implemented, and the benchmarks provided to gauge the model's performance.
Purpose: MS MARCO is designed to simulate real-world information retrieval scenarios. It uses queries derived from actual user searches and provides manually annotated relevant and non-relevant passages. This setup challenges models to perform well in practical, real-life situations where the relevance of information can vary greatly.
Key Metrics:
MRR@10 (Mean Reciprocal Rank at 10): This metric is particularly suited for tasks where the user is interested in the top result, such as in web searches or when looking for specific information. It measures the reciprocal of the rank at which the first relevant document is retrieved, averaged over all queries. The focus on the top 10 results reflects typical user behavior in browsing search results.
Purpose: This benchmark tests the model's ability to handle complex queries that require understanding and retrieving information from specific sections of long documents, such as Wikipedia articles. This mimics academic or in-depth research scenarios where queries can be very detailed and require precise answers.
Key Metrics:
MAP (Mean Average Precision): Ideal for scenarios where multiple relevant documents exist, it measures the precision averaged over all relevant documents retrieved and is useful for assessing retrieval effectiveness across a list of documents.
MRR@10: Like MS MARCO, this metric assesses the precision at the top ranks, crucial for evaluating how well the system retrieves the most relevant document within the first few results.
Data Characteristics: Both datasets feature large-scale query sets and diverse document types, offering comprehensive training and testing grounds for models. For instance, MS MARCO’s dataset containing single and sometimes zero relevant results per query tests the model's precision and recall capabilities effectively.
Real-World Simulation: The datasets mirror the variety and complexity of real-world data, helping to ensure that improvements in model performance translate into better user experiences in practical applications, not just theoretical or overly simplified scenarios.
To ensure the model's generalizability and to prevent it from merely memorizing specific answers, BERT models are pre-trained or fine-tuned in a controlled manner. For TREC-CAR, particularly, BERT is trained only on parts of Wikipedia not used in the test set to avoid inadvertently learning the test cases.
The results presented in the table provide a detailed comparison of different information retrieval methods applied to the MS MARCO and TREC-CAR datasets, specifically measuring their performance using the metrics MRR@10 (Mean Reciprocal Rank at cut-off 10) and MAP (Mean Average Precision).
These metrics are standard benchmarks used to evaluate the effectiveness of search and retrieval systems. Let's break down the benchmarks, methods, and the reported numbers to understand the significance of these results.
BM25 (Lucene, no tuning; Anserini, tuned): A standard information retrieval function that uses term frequency (TF) and inverse document frequency (IDF) to rank documents based on the query terms they contain. "Lucene, no tuning" implies a basic configuration, whereas "Anserini, tuned" suggests optimisations were made.
Co-PACRR: A convolutional neural model that captures positional and proximity-based features between query and document terms.
KNRM: Kernel-based Neural Ranking Model that uses kernel pooling to model soft matches between query and document terms.
Conv-KNRM: Enhances KNRM by integrating convolutional neural networks to learn n-gram soft matches between query and document terms.
IRNet†: Denotes a previous state-of-the-art model which details are unpublished but was leading until the reported results.
The table shows the MRR@10 and MAP scores for different methods across the development (Dev), evaluation (Eval), and test (Test) datasets for both MS MARCO and TREC-CAR.
MS MARCO:
BERT Large shows the highest MRR@10 scores across Dev and Eval sets, significantly outperforming all other methods. For instance, BERT Large achieves 36.5 on the Dev set and 35.8 on the Eval set, compared to the next best, IRNet, which scores 29.0 and 27.8 respectively.
When BERT Large achieves 36.5 on the Dev set and 35.8 on the Eval set for MS MARCO, it means on average, the first relevant result appears very close to the top of the results list. Specifically, a score of 36.5 translates to the first relevant result typically appearing around the top 3 results, since the score is scaled out of 100 (if it were expressed as a percentage).
TREC-CAR:
Here, BERT Large also excels with a top MRR@10 score of 33.5 on the Test set, which is significantly higher than the nearest competitor, IRNet, which scores 28.1.
This indicates that, similar to the MS MARCO results, BERT Large effectively retrieves relevant documents, placing them typically within the top three results.
The test set score of 33.5 is significantly higher than IRNet's 28.1, underscoring BERT's superior capability to discern and rank relevant information even in complex query scenarios typical of TREC-CAR, where queries are based on combinations of Wikipedia article titles and section titles.
The presented data underscores the superiority of BERT Large in handling complex query passage matching tasks, showcasing its ability to understand and process natural language more effectively than traditional methods and other neural approaches.
The large margins by which BERT outperforms other methods highlight its advanced capabilities in semantic understanding and relevance scoring in the context of large-scale information retrieval tasks. These results validate the adoption of BERT for tasks requiring high precision in document retrieval and underscore its impact on advancing the state of the art in search technologies.
This study has explored the transformative impact of integrating pre-trained deep language models such as BERT into retrieval and ranking pipelines.
BERT has demonstrated a substantial enhancement in the accuracy and relevance of passage retrieval tasks over traditional models like BM25, especially when employed as a re-ranker.
Despite its computational demands, BERT's sophisticated understanding of context and language nuances significantly improves the quality of search results, confirming its superiority in complex information retrieval scenarios.
The applications of BERT in various real-world systems, from search engines to legal and research databases, illustrate its potential to change e how we interact with information, making searches more efficient and results more pertinent.
As industries continue to generate and rely on vast amounts of data, the relevance and precision of search technologies powered by models like BERT become increasingly critical.
Mathematically, the attention weights are calculated using the softmax of scaled dot products of queries , keys , and values matrices derived from the input embeddings:
is the query matrix,
is the key matrix,
is the value matrix,
is the softmax function applied across the relevant dimension to normalize the weights
represents the sigmoid function.
represents the weight matrix.
is the feature vector extracted from the [CLS] token output by BERT.
denotes matrix multiplication, which is used here to represent the operation between the weight matrix and the feature vector .
Here, is a weight matrix, is the feature vector from the [CLS] token, and is a bias term. The sigmoid function maps the linear combination to a probability between 0 and 1.
The logistic regression model outputs a probability score using the sigmoid function applied to the linear combination of features in the [CLS] token’s output:
is the weight vector,
is a bias term,
is the feature vector from the [CLS] token, and
is the sigmoid function.
where and are the sets of indices for relevant and non-relevant passages, respectively.