Improving Text Embeddings with Large Language Models

Microsoft Corporation

This January 2024 paper introduces a novel method to obtain high-quality text embeddings using synthetic data.

Their approach seeks to simplify the process of obtaining high-quality text embeddings while also achieving strong performance on competitive benchmarks, surpassing previous methods by a significant margin.

The commercial implications of this research are significant.

By providing a more efficient and effective method for obtaining text embeddings, this approach can help businesses improve the performance of their natural language processing (NLP) applications, such as search engines, recommendation systems, and customer support chatbots.

Additionally, the ability to generate high-quality text embeddings for a wide range of tasks and languages can enable companies to expand their services to new markets and domains, potentially increasing their customer base and revenue streams.

What is an embedding model?

An embedding model is designed to convert text into numerical representations (vectors) in a way that captures the semantic meaning of the text.

These vectors can then be used in various machine learning tasks to compare, categorise, or understand texts based on their semantic similarity.

An embedding model doesn't generate text or predictions directly. Instead, it transforms text into a high-dimensional space where similar meanings are placed closer together. This can be used in tasks like semantic search, clustering, or as part of a larger system for more complex tasks.

Example

For instance, if you input "apple" into an embedding model, it provides a vector that represents the concept of "apple."

If you input "fruit," you get a different vector, but it should be close to "apple" in the vector space because they are semantically related.

These embeddings capture the semantic essence of text in a continuous, low-dimensional space, facilitating tasks like information retrieval, question answering, and more.

The goal here is to refine these embeddings so they can better understand and represent the nuanced meanings of text.

What does this paper propose?

The method outlines in this paper method deviates from traditional multi-stage training methods, which often rely on large volumes of weakly-supervised text pairs and manually curated datasets, constrained by task diversity and language coverage.

Key points and insights from the paper

Use of Synthetic Data

Proprietary LMs are used to create diverse synthetic data, which is then used to fine-tune open-source decoder-only LLMs using standard contrastive loss. This approach contrasts with existing methods that require multi-stage training and labelled data.

By generating synthetic data using LLMs, the model can learn from a vast range of artificially created text embedding scenarios, covering a wide spectrum of languages and tasks.

This approach not only addresses the limitations of task diversity and language coverage but also simplifies the training process by eliminating the need for complex, multi-stage training pipelines.

Empirical Results

The new method demonstrates strong performance on competitive text embedding benchmarks (BEIR and MTEB) without relying on labelled data. When mixed with labelled data, the model sets new state-of-the-art results, showing a significant improvement.

Efficiency and Performance

The proposed method achieves competitive or even state-of-the-art performance on text embedding benchmarks with less than 1k training steps and without relying on labeled data.

This indicates a significant advancement in training efficiency and effectiveness of text embeddings.

Contrastive Loss

The training involves using standard contrastive loss, a method that helps the model learn by contrasting positive examples (similar or related texts) against negative ones (unrelated texts).

This helps in refining the embeddings so that similar texts are closer in the embedding space, while dissimilar ones are further apart.

Creation of the training dataset

Categorisation of Embedding Tasks

The researchers categorise text embedding tasks into two main groups: asymmetric and symmetric tasks.

This is crucial for tailoring the data generation process to the specific needs of different types of embedding tasks, ensuring that the synthetic data covers a wide range of potential scenarios.

Asymmetric Tasks

These involve semantically related queries and documents that are not direct paraphrases.

In the context mentioned, "not paraphrases" refers to the relationship between the query and the document in asymmetric tasks.

When it's stated that the query and document are semantically related but are not paraphrases of each other, it means that while the query and the document share a thematic or conceptual connection, the wording, structure, or phrasing between the two is not identical or nearly identical.

Paraphrasing typically involves rewording a sentence or passage while retaining the original meaning. So, if a query and a document were paraphrases of each other, they would convey the same message but with different words or sentence structures.

However, in asymmetric tasks, the objective is to capture a broader and more nuanced relationship where the document is relevant to the query but does not merely restate the query in different words. For example:

Query: "How to prepare for a marathon?"
Positive Document: "Marathon training requires consistent running, proper nutrition, and a well-planned schedule."

Here, the document provides relevant information in response to the query but doesn't paraphrase it.

This kind of relationship is essential for tasks like information retrieval or question answering, where the goal is to find documents that provide valuable and pertinent information in response to a query, rather than just rephrasing the query itself.

These tasks are further divided based on the length of queries and documents, creating subcategories like short-long, long-short, short-short, and long-long matches.

Short-long match

A short query and a long document, which is a common scenario in commercial search engines.

Example:

Query: "Apple stock price",

Document: "Apple Inc. (AAPL) is a multinational technology company... (a detailed financial report)"

Long-short match

A long query and a short document.

Example: Query: "What are the health benefits of regular exercise for adults over 50?",

Document: "Regular exercise can help improve cardiovascular health, maintain muscle mass, and reduce the risk of chronic diseases in older adults."

Short-short match

Both the query and document are short.

Example: Query: "Best Italian restaurants",

Document: "Top-rated Italian dining spots in the city, offering authentic cuisine and cozy ambiance."

Long-long match

Both the query and document are long.

Example: Query: "A detailed comparison of the features and specifications of iPhone 13 and Samsung Galaxy S22",

Document: "The iPhone 13 and Samsung Galaxy S22 are two of the most popular smartphones on the market. Let's take a closer look at their features and specifications... (a comprehensive comparison)"

To create the synthetic dataset, the authors generated 500k examples with 150k unique instructions using GPT-35-Turbo and GPT-4. The total token consumption was about 180 million.

Symmetric Tasks

These tasks involve queries and documents with similar semantic meanings but different surface forms.

The term "surface form" refers to the literal, explicit way in which words are presented or arranged in text, as opposed to their deeper, underlying semantic meaning.

When discussing symmetric tasks in text embeddings, the mention of queries and documents having similar semantic meanings but different surface forms means that while the text pieces convey the same or very similar information or intent, the words and their arrangement (the surface form) differ.

For example, consider the two sentences:

"How can I increase the battery life of my phone?"
"What are some ways to extend my phone's battery duration?"

Both sentences ask essentially the same question but use different wording and structure—that is, they have different surface forms. Yet, their semantic meaning or intent (inquiring about improving phone battery life) is the same.

In tasks like semantic textual similarity (STS) and bitext retrieval, the goal is often to identify and link texts that have similar meanings, regardless of their surface forms.

This is crucial for many applications in natural language processing, such as machine translation, information retrieval, and question answering systems, where understanding that different phrases can convey the same meaning is vital for effective processing and response generation.

Reasons for Structuring the Training Method

Enhanced Diversity: The detailed categorisation and two-step prompting strategy ensure a wide range of scenarios are covered, essential for training a model to handle diverse real-world tasks.
Quality Assurance: The methodical approach to template design and data filtering ensures that only high-quality, relevant data is included in the training set.
Global Applicability: By generating data in multiple languages, the model is trained to be effective across different linguistic contexts, broadening its usability.
Efficiency: Despite the detailed and nuanced approach, the method is designed to be efficient, requiring less than 1k training steps, making it practical for real-world applications.

This structured training method is tailored to generate a rich, diverse, and high-quality dataset, which is necessary for training robust text embedding models capable of handling a wide array of tasks across different languages and contexts.

The training process in this paper involves fine-tuning a pretrained large language model (LLM) using a combination of synthetic data and labeled datasets.

The authors employ a contrastive learning approach with the InfoNCE loss function to optimise the model.

Here's a step-by-step breakdown of the training process

Data Preparation

The authors created several different prompt templates for ChatGPT to generate them a large synthetic dataset.

Example: Long-short matching subgroup prompt template

You have been assigned a text classification task: Classify whether a given movie review is positive or negative.

Your mission is to write one text classification example for this task in JSON format. The JSON object must contain the following keys:
- "input_text": a string, the input text specified by the classification task.
- "label": a string, the correct label of the input text.
- "misleading_label": a string, an incorrect label that is related to the task.

Please adhere to the following guidelines:
- The "input_text" should be at least 100 words and diverse in expression.
- The "misleading_label" must be a valid label for the given task, but not as appropriate as the "label" for the "input_text".
- The values for all fields should be in English.
- Avoid including the values of the "label" and "misleading_label" fields in the "input_text", that would make the task too easy.
- The "input_text" is clear and requires high school level education to comprehend.

Your output must always be a JSON object only, do not explain yourself or output anything else. Be creative!

The resulting synthetic data generated by the LLM

{
  "input_text": "The film is a masterful blend of stunning visuals, impeccable performances, and a thought-provoking narrative. The director's vision is evident in every frame, as the cinematography captures the essence of the characters' emotional journeys. The lead actors deliver nuanced and powerful performances, bringing depth and authenticity to their roles. The screenplay is intelligent and well-crafted, exploring complex themes with finesse. While the pacing may be slower than some viewers prefer, it allows for a rich and immersive experience. Overall, this film is a must-see for anyone who appreciates artful storytelling and the power of cinema.",
  "label": "positive",
  "misleading_label": "neutral"
}

Example: Monolingual STS

Write a sentence triple with varying semantic similarity scores in JSON format. The semantic similarity score ranges from 1 to 5, with 1 denotes least similar and 5 denotes most similar.

Please adhere to the following guidelines:
- The keys in JSON are "S1", "S2", and "S3", the values are all strings in English, do not add any other keys.
- There should be some word overlaps between all three sentences.
- The similarity score between S1 and S2 should be 4.5.
- The similarity score between S1 and S3 should be 2.5.
- The sentences require college level education to understand and should be diverse in terms of topic and length.

Your output must always be a JSON object only with three keys "S1", "S2" and "S3", do not explain yourself or output anything else. Be creative!

The resulting synthetic data generated by the LLM

{
  "S1": "The economic implications of the proposed policy changes are far-reaching and complex.",
  "S2": "The suggested policy amendments have significant and multifaceted consequences for the economy.",
  "S3": "The new regulations will have some impact on businesses, but the exact effects are unclear."
}

These examples demonstrate how the LLM is given a specific set of instructions and guidelines to generate synthetic data in a structured JSON format.

The generated data adhered to the provided requirements, such as word count, language, similarity scores, and educational level needed for comprehension.

Query Instruction Template

The query instruction template is used to modify the original query $(q+)$ to create a new query $(q+_inst)$ that includes a task definition.

This is done to provide additional context to the model about the specific task it should perform when processing the query-document pair.

Here's a detailed explanation of the process:

For each relevant query-document pair $(q+, d+)$ , the original query $(q+)$ is modified using the instruction template.

The instruction template consists of two parts

"Instruct: {task_definition}": This part provides a one-sentence description of the embedding task. The "{task_definition}" placeholder is replaced with the actual task description.
"Query: {q+}": This part includes the original query (q+).

The two parts are then concatenated with a newline character (\n) to form the new query (q+_inst).

For the synthetic data, the task definitions are obtained from the outputs of the LLM.

For other datasets like MS-MARCO, the task definitions are manually crafted and applied to all queries in the dataset.

The document side $(d+)$ is not modified with any instruction prefix.

This allows the document index to be prebuilt, and the task can be customised by changing only the query side.

By applying the query instruction template, the model receives additional information about the specific task it should perform for each query-document pair. This can help the model better understand the context and generate more relevant embeddings for the given task.

For each relevant query-document pair $(q, d)$ , apply the following instruction template to the original query q+ to generate a new query $q+_inst$

Here, "{task_definition}" is a placeholder for a one-sentence description of the embedding task.

For generated synthetic data, the outputs from the brainstorming step are used. For other datasets, such as MS-MARCO, the task definitions are manually crafted and applied to all the queries in the dataset.

Why the use of MARCO Dataset in the training dataset

The MS-MARCO dataset is used in addition to the synthetic data for a few reasons:

Comparison with previous work: The authors mention that they report results when the only labeled supervision is the MS-MARCO passage ranking dataset to provide a fair comparison with some previous work. This suggests that using MS-MARCO allows them to benchmark their model against existing methods that have been evaluated on this dataset.

Real-world data: While synthetic data is valuable for generating a large and diverse training set, it may not capture all the nuances and complexities of real-world data. Including a well-established dataset like MS-MARCO ensures that the model is exposed to actual user queries and relevant documents, which can help improve its performance on real-world tasks.

Validation of the approach: By using both synthetic data and a public dataset like MS-MARCO, the authors can demonstrate that their approach is effective not only on artificially generated data but also on a widely-used benchmark dataset. This helps to validate the generalizability and robustness of their method.

Embedding Extraction

Given a pretrained LLM (in this case, Mistral-7b), append an [EOS] token to the end of the query $(q+_inst)$ and document $(d+)$ .
Feed the modified query and document into the pretrained LLM.
Obtain the query and document embeddings $(hq+_inst, hd+)$ by taking the last layer [EOS] vector.

Fine-tuning Dataset

The fine-tuning dataset consists of both the generated synthetic data and a collection of 13 public datasets, yielding approximately 1.8M examples after sampling.

At the end of the data generation process, the fine-tuning dataset would have the following structure:

Synthetic Data

Short-long matching: 167k examples
Long-short matching: 122k examples
Short-short matching: 13k examples
Long-long matching: 17k examples
Bitext retrieval: 89k examples
Monolingual STS: 99k examples

Public Datasets

13 datasets (e.g., MS-MARCO, NQ, SQuAD, TriviaQA, etc.)

The final fine-tuning dataset is a combination of the synthetic data and the public datasets, totalling approximately 1.8M examples after sampling.

This diverse dataset, covering various task types and languages, is used to train the Mistral-7b model using the InfoNCE loss function.

Training Objective

The authors used the InfoNCE loss function to train the embedding model:

min L = -log(ϕ(q+_inst, d+) / (ϕ(q+_inst, d+) + Σ_i∈N ϕ(q+_inst, n_i)))

$N$ denotes the set of all negatives (i.e., non-relevant documents).
$ϕ(q, d)$ is a function that computes the matching score between query $q$ and document $d$ , using the temperature-scaled cosine similarity:

ϕ(q, d) = exp((1 / τ) * cos(hq, hd))

$τ$ is a temperature hyperparameter, fixed to 0.02 in the experiments.

What is the point of this research?

The purpose of this research is to develop a simpler and more effective method for obtaining high-quality text embeddings, which are vector representations of natural language that encode semantic information.

Text embeddings are widely used in various commercial applications, such as:

Information retrieval (IR): Text embeddings enable efficient retrieval of relevant documents from large-scale corpora, which is essential for search engines, content recommendation systems, and knowledge management platforms.

Question answering: Text embeddings can be used to find the most relevant passages or documents to answer a given question, improving the accuracy and efficiency of question-answering systems.

Semantic textual similarity: Text embeddings help determine the semantic similarity between two pieces of text, which is crucial for tasks like duplicate detection, plagiarism checking, and data deduplication.

Bitext mining: Text embeddings can be employed to identify parallel sentences across different languages, facilitating the creation of bilingual corpora for machine translation and cross-lingual information retrieval.

Item recommendation: Text embeddings can be used to represent user preferences and item descriptions, enabling personalized recommendations in e-commerce and content platforms.

The main motivation behind this research is to overcome the limitations of existing methods for obtaining text embeddings.

Current methods often rely on complex multi-stage training pipelines, requiring substantial engineering efforts to curate large amounts of relevance pairs.

Also, they depend on manually collected datasets that are often constrained by the diversity of tasks and the coverage of languages.

By leveraging large language models (LLMs) to generate diverse synthetic data and fine-tuning open-source decoder-only LLMs using contrastive loss, the proposed method eliminates the need for complex training pipelines and manually collected datasets.

This approach not only simplifies the process of obtaining high-quality text embeddings but also achieves strong performance on competitive benchmarks, surpassing previous methods by a significant margin.

The commercial implications of this research are significant.

References

Sentence and Text Embeddings

Sanjeev Arora et al. propose a baseline for sentence embeddings, ICLR 2017.
Tianyu Gao et al. introduce SimCSE, a simple contrastive learning method, EMNLP 2021.
Xianming Li and Jing Li explore angle-optimized text embeddings, ArXiv 2023.
Zehan Li et al. work on general text embeddings with multi-stage contrastive learning, ArXiv 2023.
Tomas Mikolov et al. present Efficient Estimation of Word Representations in Vector Space, ICLR 2013.
Niklas Muennighoff discusses GPT sentence embeddings for semantic search, ArXiv 2022.
Niklas Muennighoff et al. introduce the Massive Text Embedding Benchmark, EACL 2023.
Jianmo Ni et al. present Sentence-T5, ACL 2022.
Jeffrey Pennington et al. describe GloVe: Global Vectors for Word Representation, EMNLP 2014.
Nils Reimers and Iryna Gurevych introduce Sentence-BERT, EMNLP-IJCNLP 2019.

Information Retrieval and Question Answering

Luiz Henrique Bonifacio et al. discuss unsupervised dataset generation, SIGIR 2022.
Daniel Fernando Campos et al. describe MS MARCO, a machine reading comprehension dataset, ArXiv 2016.
Zhuyun Dai et al. present Promptagator for few-shot dense retrieval, ICLR 2022.
Vladimir Karpukhin et al. discuss Dense Passage Retrieval, EMNLP 2020.
Patrick S. H. Lewis et al. explore Retrieval-Augmented Generation, NeurIPS 2020.
Yifu Qiu et al. introduce DuReader-retrieval for passage retrieval, EMNLP 2022.
Liang Wang et al. discuss Query2doc using large language models, EMNLP 2023.
Zhilin Yang et al. introduce HotpotQA, a multi-hop question answering dataset, EMNLP 2018.

Natural Language Processing (NLP) and AI Models

Samuel R. Bowman et al. develop a large annotated corpus for natural language inference, EMNLP 2015.
Tom B. Brown et al. discuss language models as few-shot learners, NeurIPS 2020.
Alexis Conneau et al. study supervised learning of universal sentence representations, EMNLP 2017.
Jacob Devlin et al. introduce BERT for language understanding, NAACL-HLT 2019.
Angela Fan et al. present ELI5 for long-form question answering, ACL 2019.
Tianyu Gao et al. discuss enabling LLMs to generate text with citations, ArXiv 2023.
Edward J. Hu et al. describe Lora, a low-rank adaptation of LLMs, ICLR 2022.
OpenAI details GPT-4 in their technical report, ArXiv 2023.

Diverse Topics in NLP

Alexis Conneau et al. explore unsupervised cross-lingual representation learning, ACL 2020.
Gautier Izacard et al. discuss unsupervised dense information retrieval, ArXiv 2021.
Albert Q Jiang et al. introduce Mistral 7b, ArXiv 2023.
Baptiste Rozière et al. discuss Code llama for open foundation models, ArXiv 2023.
Hongjin Su et al. detail instruction-fine-tuned text embeddings, ACL 2023.
Kexin Wang et al. discuss Generative Pseudo Labeling for domain adaptation, NAACL-HLT 2022.
Liang Wang et al. explore text embeddings by weakly-supervised contrastive pre-training, ArXiv 2022.
Shitao Xiao et al. introduce C-pack for advancing general Chinese embedding, ArXiv 2023.
Xiaohui Xie et al. develop T2ranking, a Chinese benchmark for passage ranking, ArXiv 2023.
Xinyu Zhang et al. present Mr. TyDi for multilingual dense retrieval, ACL Workshop 2021.
Xinyu Crystina Zhang et al. discuss Miracl, a multilingual retrieval dataset, Transactions of the ACL 2023.

PreviousQuestions Are All You Need to Train a Dense Passage Retriever NextMassive Text Embedding Benchmark

Last updated 1 year ago

Was this helpful?