Massive Text Embedding Benchmark

The leaderboard for embedding models

This March 2023 paper discusses the creation of the Massive Text Embedding Benchmark (MTEB).

Overview of MTEB

MTEB is introduced to address a gap in the evaluation of text embeddings.

The key points about MTEB are:

Scope of Evaluation

MTEB expands the evaluation of text embeddings beyond a narrow focus.

Traditional evaluations often limit themselves to a small set of datasets from a single task, not adequately covering the diverse applications of text embeddings.

Comprehensive Benchmarking

The benchmark encompasses 8 embedding tasks, including bitext mining, classification, clustering, pair classification, reranking, retrieval, semantic textual similarity (STS), and summarization, across 58 datasets and 112 languages.

Inclusiveness of Models

It benchmarks 33 different models, making it one of the most comprehensive benchmarks to date in the field of text embeddings.

Key Observations and Findings

Lack of Dominant Method

The evaluation reveals that no single text embedding method consistently outperforms others across all tasks. This suggests the absence of a universal text embedding method that can provide state-of-the-art results for all embedding tasks.

Diversity of Use Cases

The paper highlights the vast range of use cases for natural language embeddings, from clustering and topic representation to search systems and text mining, and as features for downstream models.

Practicality and Intractability

The paper notes the infeasibility of using generative language models or cross-encoders for certain applications due to their extensive computational requirements.

Challenges in the Field

Limited Evaluation Regimes

Current text embedding models are often evaluated in a constrained manner, focusing on tasks like STS and classification, but not thoroughly tested for transferability to other tasks like search or clustering.

Poor Correlation with Real-World Use Cases:

It's mentioned that STS evaluations may not correlate well with other real-world applications, indicating a gap in the current evaluation methodologies.

Influence of Implementation Details

The paper emphasises the impact that pre-processing and hyperparameter settings can have on model performance, suggesting that these factors can obscure whether performance improvements are genuine or a result of favourable evaluation setups.

Contributions of the Paper

Introduction of MTEB: The paper introduces MTEB as a solution to provide clarity on model performance across a variety of embedding tasks.

Ease of Evaluation: MTEB's open-source software allows for easy evaluation of any embedding model with minimal coding effort.

Holistic View: The paper promises a holistic view of the state of text embedding models, including both open-source models and those accessible via APIs.

No Single Best Solution: An important finding is that there is no single best solution for text embeddings, as different models excel in different tasks.

Benchmarks

Existing Benchmarks: The paper references various benchmarks like (Super)GLUE, Big-BENCH, and SemEval, which have traditionally been used for text embedding evaluation.

Limitations: These benchmarks have limitations, particularly in representing the variety of real-world applications. For instance, SemEval focuses mostly on semantic textual similarity (STS), while SentEval lacks tasks like retrieval or clustering. USEB is mentioned as primarily reranking-focused and BEIR as the standard for zero-shot information retrieval.

Insufficiency of STS: The document highlights the insufficiency of STS-focused benchmarks in capturing the broader spectrum of text embedding applications.

Embedding Models

Evolution of Models: The transition from context-unaware models like Glove to context-aware models based on the transformer architecture (like BERT and SBERT) is outlined.

Fine-Tuning: The paper mentions the trend of fine-tuning transformer models with a contrastive loss objective for text pair embeddings.

Variety and Confusion: There's an emphasis on the variety of pre-trained transformer models available, leading to confusion about which model is best for specific embedding use cases.

Additional MTEB Tasks

Retrieval: Involves identifying relevant documents for given queries. The model embeds queries and documents, and rankings are based on cosine similarity scores. Metrics like nDCG@k and MRR@k are used, with nDCG@10 being the primary metric.

Normalized Discounted Cumulative Gain (NDCG)

Normalized Discounted Cumulative Gain (NDCG) is a popular metric used to evaluate the performance of ranking models, particularly in search engines and recommendation systems.

Ranking models predict the ranks of items based on search queries and assign relevance scores to each item.

NDCG is a measure of ranking quality that compares the relevance of items returned by a search engine or recommendation system to an ideal ranking.

The components of NDCG are:

Cumulative Gain (CG): The sum of relevance scores (gains) for items within a search query

Discounted Cumulative Gain (DCG): Extends CG by discounting the gains based on the item's position in the ranking.

Ideal Discounted Cumulative Gain (IDCG): The best possible DCG for a group, assuming the most relevant items are at the top.

Normalized Discounted Cumulative Gain (NDCG): Normalizes DCG by dividing it by IDCG, allowing for fair comparisons between different search groups.

NDCG@K considers only the top K ranked items in the calculation.

NDCG is used in model monitoring to evaluate the performance of ranking models in production. They provide examples of how companies like music streaming apps and social media platforms use NDCG to assess the relevance of their recommendations.

A low NDCG value in production means and how it can indicate performance degradation in a recommendation system.

Semantic Textual Similarity (STS): The task is to determine the similarity of sentence pairs. The similarity is computed using distance metrics, benchmarked against ground truth similarities. Spearman correlation based on cosine similarity is the main metric.

Summarisation: Involves scoring machine-generated summaries against human-written ones. The closest score based on cosine similarity is used as the model’s score. Pearson and Spearman correlations with human assessments are the key metrics.

Specific Insights on Model Categories

Self-supervised Methods

Transformer-based: BERT, when used with mean-pooling, directly produces text embeddings. SimCSE-Unsup further enhances BERT with additional self-supervised training.
Non-transformer: Models like Komninos and Glove provide faster, context-unaware word embeddings.

Supervised Methods

Transformer Encoder Methods: Include models like coCondenser, Contriever, LaBSE, and SimCSE-BERT-sup, which are BERT-based with variations in training stages or data.
Transformer Decoder Methods: SGPT Bi-Encoders demonstrate fine-tuning of a minimal fraction of GPT parameters, focusing on STS or retrieval tasks depending on the variant.
Non-transformer Context-Aware Model: LASER uses LSTM architecture and is trained on parallel data for bitext mining applications.

Analysis of Task-Specific Performance

Classification: ST5 models excel in classification tasks, with ST5-XXL showing particularly high performance.
Clustering: MPNet, despite being smaller, competes well with larger models like ST5-XXL. This suggests that fine-tuning on diverse datasets benefits clustering tasks.
Pair Classification: GTR-XL and GTR-XXL lead in this area, but models rank differently on STS, underscoring the need for diverse task benchmarking.
Reranking: MPNet and MiniLM models show strong performance, possibly due to training dataset overlaps.
Retrieval: SGPT-5.8B-msmarco excels in retrieval, while retrieval-specialised models underperform in STS tasks, indicating a division in the field between retrieval-focused and similarity-focused models.

Multilingual Performance

Bitext Mining: Dominated by LaBSE, with varying performance across languages.
Multilingual Classification and STS: Mixed results, with SGPT-BLOOM-7B1-msmarco performing well in languages it has been pre-trained on.

Key Insights

No Universal Best Model: The benchmark reveals no single model dominates across all tasks, highlighting the need for task-specific model selection.

Trade-off Between Size and Performance: Larger models generally perform better but come with higher computational costs.

Context-Aware vs. Word Embeddings: Context-aware transformer models generally outperform traditional word embeddings but require more computational resources.

Task-Specific Fine-Tuning: The effectiveness of a model can significantly vary based on how it has been fine-tuned and the specific task it is applied to.

Bifurcation of Retrieval and Similarity Tasks: A clear distinction is observed between models optimized for retrieval tasks and those for similarity tasks, indicating different underlying model requirements for these types of tasks.

Multilingual Capabilities: Performance varies significantly across languages, reflecting the challenges in developing truly universal, multilingual embedding models.

The analysis of the Massive Text Embedding Benchmark (MTEB) provides an understanding of the current landscape of embedding tools and their effectiveness across various tasks.

Leveraging these insights, we can explore creative ideas for new embedding tools that address specific challenges or unexplored areas in the field,

Here are some innovative ideas for new embedding tools, followed by a summary of the current best tools for different tasks as indicated by the MTEB analysis:

Creative Ideas for New Embedding Tools

Multimodal Embedding Generator: Develop an embedding tool that can process and integrate multiple types of data (text, audio, visual) to create rich, multimodal embeddings. This would be particularly useful for applications that require understanding content across different media formats, such as social media analysis or multimedia content categorization.

Dynamic Temporal Embeddings: Create embeddings that evolve over time to capture the changing meanings or relevance of words and phrases. This tool could be especially useful in fields like trend analysis, where the significance and context of terms can shift rapidly.

Cross-Cultural Embedding Tool: Develop a tool that focuses on cross-cultural nuances in language, capable of understanding idioms, slang, and culturally specific references. This would be invaluable for global sentiment analysis, marketing, and cultural studies.

Interactive Embedding Visualiser: A tool that not only generates embeddings but also provides an interactive visualisation platform. Users could explore how different words or phrases relate to each other in the embedding space, which would be beneficial for educational purposes and to help researchers develop better embeddings.

Domain-Specific Embedding Optimiser: Given the variability of model performance across tasks, a tool that optimises existing embeddings for specific domains (like legal, medical, or technical fields) would be highly beneficial. This tool could fine-tune general embeddings to make them more effective for specialized applications.

Embedding Personalisation Engine: A tool that creates user-specific embeddings based on their interaction with content. This could be used for personalised recommendation systems or tailored content generation.

Low-Resource Language Embedding Enhancer: Focus on developing embeddings for languages that have limited digital resources available. This tool could use techniques like transfer learning from resource-rich languages to improve NLP capabilities in underrepresented languages.

Current Best Embedding Tools for Different Tasks (Based on MTEB)

Classification: ST5 models (e.g., ST5-XXL) showed the highest average performance in classification tasks.

Clustering: MPNet and MiniLM demonstrated strong performance in clustering tasks, competing well even with larger models.

Pair Classification: GTR-XL and GTR-XXL were noted for their strong performance in pair classification tasks.

Reranking: MPNet and MiniLM again performed strongly in reranking tasks, especially in specific datasets like SciDocsRR.

Retrieval: SGPT-5.8B-msmarco excelled in retrieval tasks, particularly in the BEIR benchmark subset.

STS (Semantic Textual Similarity): For STS tasks, ST5-XXL had the highest performance, indicating its effectiveness in capturing semantic similarities.

Bitext Mining and Multilingual Tasks: LaBSE dominated in bitext mining, while performance in multilingual classification and STS was mixed, with SGPT-BLOOM-7B1-msmarco performing well in certain languages.

PreviousImproving Text Embeddings with Large Language Models NextRocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking

Last updated 1 year ago

Was this helpful?