# Massive Text Embedding Benchmark

This <mark style="color:blue;">**March 2023**</mark> paper discusses the creation of the <mark style="color:blue;">**Massive Text Embedding Benchmark (MTEB)**</mark><mark style="color:blue;">.</mark>&#x20;

{% embed url="<https://arxiv.org/abs/2210.07316>" %}
Massive Text Embedding Benchmark paper
{% endembed %}

### <mark style="color:purple;">Overview of MTEB</mark>

MTEB is introduced to address a <mark style="color:yellow;">gap in the evaluation of text embeddings.</mark>&#x20;

The key points about MTEB are:

<mark style="color:green;">**Scope of Evaluation**</mark>

MTEB expands the evaluation of text embeddings beyond a narrow focus.&#x20;

Traditional evaluations often limit themselves to a small set of datasets from a single task, not adequately covering the diverse applications of text embeddings.

<mark style="color:green;">**Comprehensive Benchmarking**</mark>

The <mark style="color:yellow;">benchmark encompasses 8 embedding tasks</mark>, including bitext mining, classification, clustering, pair classification, reranking, retrieval, semantic textual similarity (STS), and summarization, across 58 datasets and 112 languages.

<mark style="color:green;">**Inclusiveness of Models**</mark>

It benchmarks 33 different models, making it one of the most comprehensive benchmarks to date in the field of text embeddings.

### <mark style="color:purple;">Key Observations and Findings</mark>

<mark style="color:green;">**Lack of Dominant Method**</mark>

The evaluation reveals that <mark style="color:yellow;">no single text embedding method consistently outperforms others across all tasks</mark>. This suggests the absence of a universal text embedding method that can provide state-of-the-art results for all embedding tasks.

<mark style="color:green;">**Diversity of Use Cases**</mark>

The paper highlights the vast range of use cases for natural language embeddings, from clustering and topic representation to search systems and text mining, and as features for downstream models.

<mark style="color:green;">**Practicality and Intractability**</mark>

The paper notes the infeasibility of using generative language models or cross-encoders for certain applications due to their extensive computational requirements.

### <mark style="color:purple;">Challenges in the Field</mark>

<mark style="color:green;">**Limited Evaluation Regimes**</mark>

Current text embedding models are often evaluated in a constrained manner, focusing on tasks like STS and classification, but not thoroughly tested for transferability to other tasks like search or clustering.

<mark style="color:green;">**Poor Correlation with Real-World Use Cases**</mark><mark style="color:green;">:</mark>

It's mentioned that <mark style="color:yellow;">STS evaluations may not correlate well with other real-world applications</mark>, indicating a gap in the current evaluation methodologies.

<mark style="color:green;">**Influence of Implementation Details**</mark>

The paper emphasises the impact that pre-processing and hyperparameter settings can have on model performance, suggesting that these factors can obscure whether performance improvements are genuine or a result of favourable evaluation setups.

### <mark style="color:purple;">Contributions of the Paper</mark>

<mark style="color:green;">**Introduction of MTEB**</mark><mark style="color:green;">:</mark> The paper introduces MTEB as a solution to provide clarity on model performance across a variety of embedding tasks.

<mark style="color:green;">**Ease of Evaluation**</mark><mark style="color:green;">:</mark> MTEB's open-source software allows for easy evaluation of any embedding model with minimal coding effort.

<mark style="color:green;">**Holistic View**</mark><mark style="color:green;">:</mark> The paper promises a holistic view of the state of text embedding models, including both open-source models and those accessible via APIs.

<mark style="color:green;">**No Single Best Solution**</mark><mark style="color:green;">:</mark> An important finding is that there is no single best solution for text embeddings, as different models excel in different tasks.

### <mark style="color:purple;">**Benchmarks**</mark>

<mark style="color:green;">**Existing Benchmarks**</mark><mark style="color:green;">:</mark> The paper references various benchmarks like (Super)GLUE, Big-BENCH, and SemEval, which have traditionally been used for text embedding evaluation.

<mark style="color:green;">**Limitations**</mark><mark style="color:green;">:</mark> These benchmarks have limitations, particularly in representing the variety of real-world applications. For instance, SemEval focuses mostly on <mark style="color:yellow;">semantic textual similarity (STS),</mark> while SentEval lacks tasks like retrieval or clustering. USEB is mentioned as primarily reranking-focused and BEIR as the standard for zero-shot information retrieval.

<mark style="color:green;">**Insufficiency of STS**</mark><mark style="color:green;">:</mark> The document highlights the insufficiency of STS-focused benchmarks in capturing the broader spectrum of text embedding applications.

### <mark style="color:purple;">**Embedding Models**</mark>

<mark style="color:green;">**Evolution of Models**</mark><mark style="color:green;">:</mark> The transition from context-unaware models like Glove to context-aware models based on the transformer architecture (like BERT and SBERT) is outlined.

<mark style="color:green;">**Fine-Tuning**</mark><mark style="color:green;">:</mark> The paper mentions the trend of fine-tuning transformer models with a contrastive loss objective for text pair embeddings.

<mark style="color:green;">**Variety and Confusion**</mark><mark style="color:green;">:</mark> There's an emphasis on the variety of pre-trained transformer models available, leading to confusion about which model is best for specific embedding use cases.

### <mark style="color:purple;">Additional MTEB Tasks</mark>

<mark style="color:green;">**Retrieval**</mark><mark style="color:green;">:</mark> Involves identifying relevant documents for given queries. The model embeds queries and documents, and rankings are based on cosine similarity scores. Metrics like nDCG\@k and MRR\@k are used, with nDCG\@10 being the primary metric.

<details>

<summary><mark style="color:green;">Normalized Discounted Cumulative Gain (NDCG)</mark></summary>

Normalized Discounted Cumulative Gain (NDCG) is a popular metric used to evaluate the performance of ranking models, particularly in search engines and recommendation systems.&#x20;

Ranking models predict the ranks of items based on search queries and assign relevance scores to each item.

NDCG is a measure of ranking quality that compares the relevance of items returned by a search engine or recommendation system to an ideal ranking.

The components of NDCG are:

1. Cumulative Gain (CG): The sum of relevance scores (gains) for items within a search query
2. Discounted Cumulative Gain (DCG): Extends CG by discounting the gains based on the item's position in the ranking.&#x20;
3. Ideal Discounted Cumulative Gain (IDCG): The best possible DCG for a group, assuming the most relevant items are at the top.&#x20;
4. Normalized Discounted Cumulative Gain (NDCG): Normalizes DCG by dividing it by IDCG, allowing for fair comparisons between different search groups.

NDCG\@K considers only the top K ranked items in the calculation.

NDCG is used in model monitoring to evaluate the performance of ranking models in production. They provide examples of how companies like music streaming apps and social media platforms use NDCG to assess the relevance of their recommendations.

A low NDCG value in production means and how it can indicate performance degradation in a recommendation system.&#x20;

</details>

<mark style="color:green;">**Semantic Textual Similarity (STS)**</mark><mark style="color:green;">:</mark> The task is to determine the similarity of sentence pairs. The similarity is computed using distance metrics, benchmarked against ground truth similarities. Spearman correlation based on cosine similarity is the main metric.

<mark style="color:green;">**Summarisation**</mark><mark style="color:green;">:</mark> Involves scoring machine-generated summaries against human-written ones. The closest score based on cosine similarity is used as the model’s score. Pearson and Spearman correlations with human assessments are the key metrics.

### <mark style="color:purple;">**Specific Insights on Model Categories**</mark>

<mark style="color:green;">**Self-supervised Methods**</mark>

* **Transformer-based**: BERT, when used with mean-pooling, directly produces text embeddings. SimCSE-Unsup further enhances BERT with additional self-supervised training.
* **Non-transformer**: Models like Komninos and Glove provide faster, context-unaware word embeddings.

<mark style="color:green;">**Supervised Methods**</mark>

* **Transformer Encoder Methods**: Include models like coCondenser, Contriever, LaBSE, and SimCSE-BERT-sup, which are BERT-based with variations in training stages or data.
* **Transformer Decoder Methods**: SGPT Bi-Encoders demonstrate fine-tuning of a minimal fraction of GPT parameters, focusing on STS or retrieval tasks depending on the variant.
* **Non-transformer Context-Aware Model**: LASER uses LSTM architecture and is trained on parallel data for bitext mining applications.

### <mark style="color:purple;">Analysis of Task-Specific Performance</mark>

* **Classification**: ST5 models excel in classification tasks, with ST5-XXL showing particularly high performance.
* **Clustering**: MPNet, despite being smaller, competes well with larger models like ST5-XXL. This suggests that fine-tuning on diverse datasets benefits clustering tasks.
* **Pair Classification**: GTR-XL and GTR-XXL lead in this area, but models rank differently on STS, underscoring the need for diverse task benchmarking.
* **Reranking**: MPNet and MiniLM models show strong performance, possibly due to training dataset overlaps.
* **Retrieval**: SGPT-5.8B-msmarco excels in retrieval, while retrieval-specialised models underperform in STS tasks, indicating a division in the field between retrieval-focused and similarity-focused models.

### <mark style="color:purple;">Multilingual Performance</mark>

* **Bitext Mining**: Dominated by LaBSE, with varying performance across languages.
* **Multilingual Classification and STS**: Mixed results, with SGPT-BLOOM-7B1-msmarco performing well in languages it has been pre-trained on.

### <mark style="color:purple;">Key Insights</mark>

<mark style="color:green;">**No Universal Best Model**</mark><mark style="color:green;">:</mark> The benchmark reveals no single model dominates across all tasks, highlighting the need for task-specific model selection.

<mark style="color:green;">**Trade-off Between Size and Performance**</mark><mark style="color:green;">:</mark> Larger models generally perform better but come with higher computational costs.

<mark style="color:green;">**Context-Aware vs. Word Embeddings**</mark><mark style="color:green;">:</mark> <mark style="color:yellow;">Context-aware transformer models generally outperform traditional word embeddings but require more computational resources.</mark>

<mark style="color:green;">**Task-Specific Fine-Tuning**</mark><mark style="color:green;">:</mark> The effectiveness of a model can significantly vary based on how it has been fine-tuned and the specific task it is applied to.

<mark style="color:green;">**Bifurcation of Retrieval and Similarity Tasks**</mark><mark style="color:green;">:</mark> A clear distinction is observed between models optimized for retrieval tasks and those for similarity tasks, indicating different underlying model requirements for these types of tasks.

<mark style="color:green;">**Multilingual Capabilities**</mark><mark style="color:green;">:</mark> Performance varies significantly across languages, reflecting the challenges in developing truly universal, multilingual embedding models.

The analysis of the Massive Text Embedding Benchmark (MTEB) provides an understanding of the current landscape of embedding tools and their effectiveness across various tasks.&#x20;

Leveraging these insights, we can explore creative ideas for new embedding tools that address specific challenges or unexplored areas in the field,&#x20;

Here are some innovative ideas for new embedding tools, followed by a summary of the current best tools for different tasks as indicated by the MTEB analysis:

### <mark style="color:purple;">Creative Ideas for New Embedding Tools</mark>

<mark style="color:green;">**Multimodal Embedding Generator**</mark><mark style="color:green;">:</mark> Develop an embedding tool that can <mark style="color:yellow;">process and integrate multiple types of data (text, audio, visual) to create rich, multimodal embeddings</mark>. This would be particularly useful for applications that require understanding content across different media formats, such as social media analysis or multimedia content categorization.

<mark style="color:green;">**Dynamic Temporal Embeddings**</mark><mark style="color:green;">:</mark> Create embeddings that evolve over time to capture the changing meanings or relevance of words and phrases. This tool could be especially useful in fields like trend analysis, where the significance and context of terms can shift rapidly.

<mark style="color:green;">**Cross-Cultural Embedding Tool**</mark><mark style="color:green;">:</mark> Develop a tool that focuses on cross-cultural nuances in language, capable of understanding idioms, slang, and culturally specific references. This would be invaluable for global sentiment analysis, marketing, and cultural studies.

<mark style="color:green;">**Interactive Embedding Visualiser**</mark><mark style="color:green;">:</mark> A tool that not only generates embeddings but also provides an interactive visualisation platform. Users could explore how different words or phrases relate to each other in the embedding space, which would be beneficial for educational purposes and to help researchers develop better embeddings.

<mark style="color:green;">**Domain-Specific Embedding Optimiser**</mark><mark style="color:green;">:</mark> Given the variability of model performance across tasks, a tool that optimises existing embeddings for specific domains (like legal, medical, or technical fields) would be highly beneficial. This tool could fine-tune general embeddings to make them more effective for specialized applications.

<mark style="color:green;">**Embedding Personalisation Engine**</mark><mark style="color:green;">:</mark> A tool that creates user-specific embeddings based on their interaction with content. This could be used for personalised recommendation systems or tailored content generation.

<mark style="color:green;">**Low-Resource Language Embedding Enhancer**</mark>: Focus on developing embeddings for languages that have limited digital resources available. This tool could use techniques like transfer learning from resource-rich languages to improve NLP capabilities in underrepresented languages.

### <mark style="color:purple;">Current Best Embedding Tools for Different Tasks (Based on MTEB)</mark>

<mark style="color:green;">**Classification**</mark><mark style="color:green;">:</mark> ST5 models (e.g., ST5-XXL) showed the highest average performance in classification tasks.

<mark style="color:green;">**Clustering**</mark><mark style="color:green;">:</mark> MPNet and MiniLM demonstrated strong performance in clustering tasks, competing well even with larger models.

<mark style="color:green;">**Pair Classification**</mark><mark style="color:green;">:</mark> GTR-XL and GTR-XXL were noted for their strong performance in pair classification tasks.

<mark style="color:green;">**Reranking**</mark><mark style="color:green;">:</mark> MPNet and MiniLM again performed strongly in reranking tasks, especially in specific datasets like SciDocsRR.

<mark style="color:green;">**Retrieval**</mark><mark style="color:green;">:</mark> SGPT-5.8B-msmarco excelled in retrieval tasks, particularly in the BEIR benchmark subset.

<mark style="color:green;">**STS (Semantic Textual Similarity)**</mark><mark style="color:green;">:</mark> For STS tasks, ST5-XXL had the highest performance, indicating its effectiveness in capturing semantic similarities.

<mark style="color:green;">**Bitext Mining and Multilingual Tasks**</mark><mark style="color:green;">:</mark> LaBSE dominated in bitext mining, while performance in multilingual classification and STS was mixed, with SGPT-BLOOM-7B1-msmarco performing well in certain languages.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://training.continuumlabs.ai/knowledge/vector-databases/massive-text-embedding-benchmark.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
