Improving Text Embeddings with Large Language Models
Liang Wang and the Microsoft team
Last updated
Copyright Continuum Labs - 2023
Liang Wang and the Microsoft team
Last updated
This January 2024 paper introduces a novel method to obtain high-quality text embeddings using synthetic data.
Their approach seeks to simplify the process of obtaining high-quality text embeddings while also achieving strong performance on competitive benchmarks, surpassing previous methods by a significant margin.
The commercial implications of this research are significant.
By providing a more efficient and effective method for obtaining text embeddings, this approach can help businesses improve the performance of their natural language processing (NLP) applications, such as search engines, recommendation systems, and customer support chatbots.
Additionally, the ability to generate high-quality text embeddings for a wide range of tasks and languages can enable companies to expand their services to new markets and domains, potentially increasing their customer base and revenue streams.
An embedding model is designed to convert text into numerical representations (vectors) in a way that captures the semantic meaning of the text.
These vectors can then be used in various machine learning tasks to compare, categorise, or understand texts based on their semantic similarity.
An embedding model doesn't generate text or predictions directly. Instead, it transforms text into a high-dimensional space where similar meanings are placed closer together. This can be used in tasks like semantic search, clustering, or as part of a larger system for more complex tasks.
Example
For instance, if you input "apple" into an embedding model, it provides a vector that represents the concept of "apple." If you input "fruit," you get a different vector, but it should be close to "apple" in the vector space because they are semantically related.
These embeddings capture the semantic essence of text in a continuous, low-dimensional space, facilitating tasks like information retrieval, question answering, and more.
The goal here is to refine these embeddings so they can better understand and represent the nuanced meanings of text.
The method outlines in this paper method deviates from traditional multi-stage training methods, which often rely on large volumes of weakly-supervised text pairs and manually curated datasets, constrained by task diversity and language coverage.
Use of Synthetic Data
Proprietary LMs are used to create diverse synthetic data, which is then used to fine-tune open-source decoder-only LLMs using standard contrastive loss. This approach contrasts with existing methods that require multi-stage training and labelled data.
By generating synthetic data using LLMs, the model can learn from a vast range of artificially created text embedding scenarios, covering a wide spectrum of languages and tasks.
This approach not only addresses the limitations of task diversity and language coverage but also simplifies the training process by eliminating the need for complex, multi-stage training pipelines.
Empirical Results
The new method demonstrates strong performance on competitive text embedding benchmarks (BEIR and MTEB) without relying on labelled data. When mixed with labelled data, the model sets new state-of-the-art results, showing a significant improvement.
Efficiency and Performance
The proposed method achieves competitive or even state-of-the-art performance on text embedding benchmarks with less than 1k training steps and without relying on labeled data.
This indicates a significant advancement in training efficiency and effectiveness of text embeddings.
Contrastive Loss
The training involves using standard contrastive loss, a method that helps the model learn by contrasting positive examples (similar or related texts) against negative ones (unrelated texts).
This helps in refining the embeddings so that similar texts are closer in the embedding space, while dissimilar ones are further apart.
Categorisation of Embedding Tasks
The researchers categorise text embedding tasks into two main groups: asymmetric and symmetric tasks.
This is crucial for tailoring the data generation process to the specific needs of different types of embedding tasks, ensuring that the synthetic data covers a wide range of potential scenarios.
These involve semantically related queries and documents that are not direct paraphrases.
In the context mentioned, "not paraphrases" refers to the relationship between the query and the document in asymmetric tasks.
When it's stated that the query and document are semantically related but are not paraphrases of each other, it means that while the query and the document share a thematic or conceptual connection, the wording, structure, or phrasing between the two is not identical or nearly identical.
Paraphrasing typically involves rewording a sentence or passage while retaining the original meaning. So, if a query and a document were paraphrases of each other, they would convey the same message but with different words or sentence structures.
However, in asymmetric tasks, the objective is to capture a broader and more nuanced relationship where the document is relevant to the query but does not merely restate the query in different words. For example:
Query: "How to prepare for a marathon?"
Positive Document: "Marathon training requires consistent running, proper nutrition, and a well-planned schedule."
Here, the document provides relevant information in response to the query but doesn't paraphrase it.
This kind of relationship is essential for tasks like information retrieval or question answering, where the goal is to find documents that provide valuable and pertinent information in response to a query, rather than just rephrasing the query itself.
These tasks are further divided based on the length of queries and documents, creating subcategories like short-long, long-short, short-short, and long-long matches.
A short query and a long document, which is a common scenario in commercial search engines.
Example:
Query: "Apple stock price",
Document: "Apple Inc. (AAPL) is a multinational technology company... (a detailed financial report)"
A long query and a short document.
Example: Query: "What are the health benefits of regular exercise for adults over 50?",
Document: "Regular exercise can help improve cardiovascular health, maintain muscle mass, and reduce the risk of chronic diseases in older adults."
Both the query and document are short.
Example: Query: "Best Italian restaurants",
Document: "Top-rated Italian dining spots in the city, offering authentic cuisine and cozy ambiance."
Both the query and document are long.
Example: Query: "A detailed comparison of the features and specifications of iPhone 13 and Samsung Galaxy S22",
Document: "The iPhone 13 and Samsung Galaxy S22 are two of the most popular smartphones on the market. Let's take a closer look at their features and specifications... (a comprehensive comparison)"
To create the synthetic dataset, the authors generated 500k examples with 150k unique instructions using GPT-35-Turbo and GPT-4. The total token consumption was about 180 million.
These tasks involve queries and documents with similar semantic meanings but different surface forms.
The term "surface form" refers to the literal, explicit way in which words are presented or arranged in text, as opposed to their deeper, underlying semantic meaning.
When discussing symmetric tasks in text embeddings, the mention of queries and documents having similar semantic meanings but different surface forms means that while the text pieces convey the same or very similar information or intent, the words and their arrangement (the surface form) differ.
For example, consider the two sentences:
"How can I increase the battery life of my phone?"
"What are some ways to extend my phone's battery duration?"
Both sentences ask essentially the same question but use different wording and structure—that is, they have different surface forms. Yet, their semantic meaning or intent (inquiring about improving phone battery life) is the same.
In tasks like semantic textual similarity (STS) and bitext retrieval, the goal is often to identify and link texts that have similar meanings, regardless of their surface forms.
This is crucial for many applications in natural language processing, such as machine translation, information retrieval, and question answering systems, where understanding that different phrases can convey the same meaning is vital for effective processing and response generation.
Enhanced Diversity: The detailed categorisation and two-step prompting strategy ensure a wide range of scenarios are covered, essential for training a model to handle diverse real-world tasks.
Quality Assurance: The methodical approach to template design and data filtering ensures that only high-quality, relevant data is included in the training set.
Global Applicability: By generating data in multiple languages, the model is trained to be effective across different linguistic contexts, broadening its usability.
Efficiency: Despite the detailed and nuanced approach, the method is designed to be efficient, requiring less than 1k training steps, making it practical for real-world applications.
This structured training method is tailored to generate a rich, diverse, and high-quality dataset, which is necessary for training robust text embedding models capable of handling a wide array of tasks across different languages and contexts.
The training process in this paper involves fine-tuning a pretrained large language model (LLM) using a combination of synthetic data and labeled datasets.
The authors employ a contrastive learning approach with the InfoNCE loss function to optimise the model.
The authors created several different prompt templates for ChatGPT to generate them a large synthetic dataset.
These examples demonstrate how the LLM is given a specific set of instructions and guidelines to generate synthetic data in a structured JSON format.
The generated data adhered to the provided requirements, such as word count, language, similarity scores, and educational level needed for comprehension.
The query instruction template is used to modify the original query to create a new query that includes a task definition.
This is done to provide additional context to the model about the specific task it should perform when processing the query-document pair.
Here's a detailed explanation of the process:
For each relevant query-document pair , the original query is modified using the instruction template.
"Instruct: {task_definition}": This part provides a one-sentence description of the embedding task. The "{task_definition}" placeholder is replaced with the actual task description.
"Query: {q+}": This part includes the original query (q+).
The two parts are then concatenated with a newline character (\n) to form the new query (q+_inst).
For the synthetic data, the task definitions are obtained from the outputs of the LLM.
For other datasets like MS-MARCO, the task definitions are manually crafted and applied to all queries in the dataset.
The document side is not modified with any instruction prefix.
This allows the document index to be prebuilt, and the task can be customised by changing only the query side.
By applying the query instruction template, the model receives additional information about the specific task it should perform for each query-document pair. This can help the model better understand the context and generate more relevant embeddings for the given task.
For each relevant query-document pair , apply the following instruction template to the original query q+ to generate a new query
Here, "{task_definition}" is a placeholder for a one-sentence description of the embedding task.
For generated synthetic data, the outputs from the brainstorming step are used. For other datasets, such as MS-MARCO, the task definitions are manually crafted and applied to all the queries in the dataset.
Given a pretrained LLM (in this case, Mistral-7b), append an [EOS] token to the end of the query and document .
Feed the modified query and document into the pretrained LLM.
Obtain the query and document embeddings by taking the last layer [EOS] vector.
The fine-tuning dataset consists of both the generated synthetic data and a collection of 13 public datasets, yielding approximately 1.8M examples after sampling.
At the end of the data generation process, the fine-tuning dataset would have the following structure:
Synthetic Data
Short-long matching: 167k examples
Long-short matching: 122k examples
Short-short matching: 13k examples
Long-long matching: 17k examples
Bitext retrieval: 89k examples
Monolingual STS: 99k examples
13 datasets (e.g., MS-MARCO, NQ, SQuAD, TriviaQA, etc.)
The final fine-tuning dataset is a combination of the synthetic data and the public datasets, totalling approximately 1.8M examples after sampling.
This diverse dataset, covering various task types and languages, is used to train the Mistral-7b model using the InfoNCE loss function.
The authors used the InfoNCE loss function to train the embedding model:
denotes the set of all negatives (i.e., non-relevant documents).
is a function that computes the matching score between query and document , using the temperature-scaled cosine similarity:
is a temperature hyperparameter, fixed to 0.02 in the experiments.
The purpose of this research is to develop a simpler and more effective method for obtaining high-quality text embeddings, which are vector representations of natural language that encode semantic information.
Text embeddings are widely used in various commercial applications, such as:
Information retrieval (IR): Text embeddings enable efficient retrieval of relevant documents from large-scale corpora, which is essential for search engines, content recommendation systems, and knowledge management platforms.
Question answering: Text embeddings can be used to find the most relevant passages or documents to answer a given question, improving the accuracy and efficiency of question-answering systems.
Semantic textual similarity: Text embeddings help determine the semantic similarity between two pieces of text, which is crucial for tasks like duplicate detection, plagiarism checking, and data deduplication.
Bitext mining: Text embeddings can be employed to identify parallel sentences across different languages, facilitating the creation of bilingual corpora for machine translation and cross-lingual information retrieval.
Item recommendation: Text embeddings can be used to represent user preferences and item descriptions, enabling personalized recommendations in e-commerce and content platforms.
The main motivation behind this research is to overcome the limitations of existing methods for obtaining text embeddings.
Current methods often rely on complex multi-stage training pipelines, requiring substantial engineering efforts to curate large amounts of relevance pairs.
Also, they depend on manually collected datasets that are often constrained by the diversity of tasks and the coverage of languages.
By leveraging large language models (LLMs) to generate diverse synthetic data and fine-tuning open-source decoder-only LLMs using contrastive loss, the proposed method eliminates the need for complex training pipelines and manually collected datasets.
This approach not only simplifies the process of obtaining high-quality text embeddings but also achieves strong performance on competitive benchmarks, surpassing previous methods by a significant margin.
The commercial implications of this research are significant.
By providing a more efficient and effective method for obtaining text embeddings, this approach can help businesses improve the performance of their natural language processing (NLP) applications, such as search engines, recommendation systems, and customer support chatbots.
Additionally, the ability to generate high-quality text embeddings for a wide range of tasks and languages can enable companies to expand their services to new markets and domains, potentially increasing their customer base and revenue streams.
Sanjeev Arora et al. propose a baseline for sentence embeddings, ICLR 2017.
Tianyu Gao et al. introduce SimCSE, a simple contrastive learning method, EMNLP 2021.
Xianming Li and Jing Li explore angle-optimized text embeddings, ArXiv 2023.
Zehan Li et al. work on general text embeddings with multi-stage contrastive learning, ArXiv 2023.
Tomas Mikolov et al. present Efficient Estimation of Word Representations in Vector Space, ICLR 2013.
Niklas Muennighoff discusses GPT sentence embeddings for semantic search, ArXiv 2022.
Niklas Muennighoff et al. introduce the Massive Text Embedding Benchmark, EACL 2023.
Jianmo Ni et al. present Sentence-T5, ACL 2022.
Jeffrey Pennington et al. describe GloVe: Global Vectors for Word Representation, EMNLP 2014.
Nils Reimers and Iryna Gurevych introduce Sentence-BERT, EMNLP-IJCNLP 2019.
Luiz Henrique Bonifacio et al. discuss unsupervised dataset generation, SIGIR 2022.
Daniel Fernando Campos et al. describe MS MARCO, a machine reading comprehension dataset, ArXiv 2016.
Zhuyun Dai et al. present Promptagator for few-shot dense retrieval, ICLR 2022.
Vladimir Karpukhin et al. discuss Dense Passage Retrieval, EMNLP 2020.
Patrick S. H. Lewis et al. explore Retrieval-Augmented Generation, NeurIPS 2020.
Yifu Qiu et al. introduce DuReader-retrieval for passage retrieval, EMNLP 2022.
Liang Wang et al. discuss Query2doc using large language models, EMNLP 2023.
Zhilin Yang et al. introduce HotpotQA, a multi-hop question answering dataset, EMNLP 2018.
Samuel R. Bowman et al. develop a large annotated corpus for natural language inference, EMNLP 2015.
Tom B. Brown et al. discuss language models as few-shot learners, NeurIPS 2020.
Alexis Conneau et al. study supervised learning of universal sentence representations, EMNLP 2017.
Jacob Devlin et al. introduce BERT for language understanding, NAACL-HLT 2019.
Angela Fan et al. present ELI5 for long-form question answering, ACL 2019.
Tianyu Gao et al. discuss enabling LLMs to generate text with citations, ArXiv 2023.
Edward J. Hu et al. describe Lora, a low-rank adaptation of LLMs, ICLR 2022.
OpenAI details GPT-4 in their technical report, ArXiv 2023.
Alexis Conneau et al. explore unsupervised cross-lingual representation learning, ACL 2020.
Gautier Izacard et al. discuss unsupervised dense information retrieval, ArXiv 2021.
Albert Q Jiang et al. introduce Mistral 7b, ArXiv 2023.
Baptiste Rozière et al. discuss Code llama for open foundation models, ArXiv 2023.
Hongjin Su et al. detail instruction-fine-tuned text embeddings, ACL 2023.
Kexin Wang et al. discuss Generative Pseudo Labeling for domain adaptation, NAACL-HLT 2022.
Liang Wang et al. explore text embeddings by weakly-supervised contrastive pre-training, ArXiv 2022.
Shitao Xiao et al. introduce C-pack for advancing general Chinese embedding, ArXiv 2023.
Xiaohui Xie et al. develop T2ranking, a Chinese benchmark for passage ranking, ArXiv 2023.
Xinyu Zhang et al. present Mr. TyDi for multilingual dense retrieval, ACL Workshop 2021.
Xinyu Crystina Zhang et al. discuss Miracl, a multilingual retrieval dataset, Transactions of the ACL 2023.