# Improving Text Embeddings with Large Language Models

This <mark style="color:blue;">**January 2024**</mark> paper introduces a <mark style="color:green;">**novel method to obtain high-quality text embeddings**</mark> *<mark style="color:yellow;">**using synthetic data.**</mark>*

Their approach seeks to simplify the process of obtaining high-quality text embeddings while also achieving strong performance on competitive benchmarks, surpassing previous methods by a significant margin.

The commercial implications of this research are significant.&#x20;

By providing a more efficient and effective method for obtaining text embeddings, this approach can help businesses improve the performance of their natural language processing (NLP) applications, such as search engines, recommendation systems, and customer support chatbots.&#x20;

Additionally, the ability to generate high-quality text embeddings for a wide range of tasks and languages can enable companies to expand their services to new markets and domains, potentially increasing their customer base and revenue streams.

### <mark style="color:purple;">What is an embedding model?</mark>

An embedding model is designed to <mark style="color:yellow;">convert text into numerical representations (vectors) in a way that</mark> <mark style="color:yellow;"></mark>*<mark style="color:yellow;">**captures the semantic meaning of the text**</mark>**.*** &#x20;

These vectors can then be used in various machine learning tasks to compare, categorise, or understand texts based on their semantic similarity.

An embedding model doesn't generate text or predictions directly.  Instead, it transforms text into a high-dimensional space where similar meanings are placed closer together.  This can be used in tasks like semantic search, clustering, or as part of a larger system for more complex tasks.

<mark style="color:green;">**Example**</mark>

For instance, if you input "apple" into an embedding model, it provides a vector that represents the concept of "apple."

If you input "fruit," you get a different vector, but it should be close to "apple" in the vector space because they are semantically related.

These embeddings capture the semantic essence of text in a <mark style="color:yellow;">**continuous, low-dimensional space**</mark>, facilitating tasks like information retrieval, question answering, and more. &#x20;

The goal here is to refine these embeddings so they can better understand and represent the nuanced meanings of text.

### <mark style="color:purple;">What does this paper propose?</mark>

The method outlines in this paper method deviates from traditional multi-stage training methods, which often rely on large volumes of weakly-supervised text pairs and manually curated datasets, constrained by task diversity and language coverage.

{% embed url="<https://arxiv.org/abs/2401.00368>" %}
"Improving Text Embeddings with Large Language Models"
{% endembed %}

### <mark style="color:purple;">Key points and insights from the paper</mark>

<mark style="color:green;">**Use of Synthetic Data**</mark>

Proprietary LMs are used to create diverse synthetic data, which is then used to fine-tune open-source decoder-only LLMs *<mark style="color:yellow;">**using standard contrastive loss**</mark>*.  This approach contrasts with existing methods that require multi-stage training and labelled data.

By generating synthetic data using LLMs, the model can learn from a vast range of artificially created text embedding scenarios, covering a wide spectrum of languages and tasks.

This approach not only addresses the limitations of task diversity and language coverage but also simplifies the training process by eliminating the need for complex, multi-stage training pipelines.

<mark style="color:green;">**Empirical Results**</mark>

The new method demonstrates strong performance on competitive text embedding benchmarks (BEIR and MTEB) *<mark style="color:yellow;">**without relying on labelled data**</mark>*.  When mixed with labelled data, the model sets new state-of-the-art results, showing a significant improvement.

<mark style="color:green;">**Efficiency and Performance**</mark>

The proposed method achieves competitive or even state-of-the-art performance on text embedding benchmarks with less than 1k training steps and without relying on labeled data.&#x20;

This indicates a significant advancement in training efficiency and effectiveness of text embeddings.

<mark style="color:green;">**Contrastive Loss**</mark>

The training involves using standard contrastive loss, *<mark style="color:yellow;">**a method that helps the model learn by contrasting positive examples (similar or related texts) against negative ones (unrelated texts)**</mark>*.&#x20;

This helps in refining the embeddings so that similar texts are closer in the embedding space, while dissimilar ones are further apart.

### <mark style="color:purple;">Creation of the training dataset</mark>

<mark style="color:green;">**Categorisation of Embedding Tasks**</mark>

The researchers categorise text embedding tasks into two main groups: *<mark style="color:yellow;">**asymmetric and symmetric tasks**</mark>*.&#x20;

This is crucial for tailoring the data generation process to the specific needs of different types of embedding tasks, ensuring that the synthetic data covers a wide range of potential scenarios.

### <mark style="color:purple;">**Asymmetric Tasks**</mark>

These involve <mark style="color:yellow;">semantically related queries and documents that are</mark> <mark style="color:yellow;"></mark>*<mark style="color:yellow;">**not**</mark>* <mark style="color:yellow;"></mark><mark style="color:yellow;">direct paraphrases</mark>.&#x20;

In the context mentioned, <mark style="color:green;">**"not paraphrases"**</mark> refers to the relationship between the query and the document in asymmetric tasks. &#x20;

When it's stated that the query and document are <mark style="color:yellow;">**semantically related**</mark> but are not paraphrases of each other, it means that while the query and the document share a thematic or conceptual connection, the wording, structure, or phrasing between the two is not identical or nearly identical.

Paraphrasing typically involves rewording a sentence or passage while retaining the original meaning. So, if a query and a document were paraphrases of each other, they would convey the same message but with different words or sentence structures.

However, in <mark style="color:blue;">**asymmetric tasks**</mark>, the *<mark style="color:yellow;">**objective is to capture a broader and more nuanced relationship**</mark>* where the document is relevant to the query but does not merely restate the query in different words. For example:

* Query: "How to prepare for a marathon?"
* Positive Document: "Marathon training requires consistent running, proper nutrition, and a well-planned schedule."

Here, the document provides relevant information in response to the query but doesn't paraphrase it.&#x20;

This kind of relationship is essential for tasks like information retrieval or question answering, where the goal is to find documents that provide valuable and pertinent information in response to a query, rather than just rephrasing the query itself.

These tasks are <mark style="color:yellow;">**further divided based on the length of queries and documents**</mark>, creating subcategories like <mark style="color:blue;">**short-long, long-short, short-short, and long-long matches.**</mark>&#x20;

#### <mark style="color:green;">Short-long match</mark>

A short query and a long document, which is a common scenario in commercial search engines.&#x20;

Example:&#x20;

Query: "Apple stock price",&#x20;

Document: "Apple Inc. (AAPL) is a multinational technology company... (a detailed financial report)"

#### <mark style="color:green;">Long-short match</mark>

A long query and a short document.&#x20;

Example: Query: "What are the health benefits of regular exercise for adults over 50?",&#x20;

Document: "Regular exercise can help improve cardiovascular health, maintain muscle mass, and reduce the risk of chronic diseases in older adults."

#### <mark style="color:green;">Short-short match</mark>

Both the query and document are short.&#x20;

Example: Query: "Best Italian restaurants",&#x20;

Document: "Top-rated Italian dining spots in the city, offering authentic cuisine and cozy ambiance."

#### <mark style="color:green;">Long-long match</mark>

Both the query and document are long.&#x20;

Example: Query: "A detailed comparison of the features and specifications of iPhone 13 and Samsung Galaxy S22",&#x20;

Document: "The iPhone 13 and Samsung Galaxy S22 are two of the most popular smartphones on the market. Let's take a closer look at their features and specifications... (a comprehensive comparison)"

<figure><img src="https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2FsL36iTgCEQ57EYGsFNlO%2Fimage.png?alt=media&#x26;token=34ab0b0e-80e9-4cd6-9193-2256a3fe2713" alt="" width="491"><figcaption><p>Statistics of the Synthetic Data</p></figcaption></figure>

To create the synthetic dataset, the authors generated 500k examples with 150k unique instructions using GPT-35-Turbo and GPT-4. The total token consumption was about 180 million.&#x20;

### <mark style="color:purple;">**Symmetric Tasks**</mark>

These tasks involve queries and documents with similar semantic meanings but different <mark style="color:blue;">**surface forms.**</mark>&#x20;

The term "surface form" refers to the *<mark style="color:yellow;">**literal, explicit way in which words are presented or arranged in text**</mark>*, as opposed to their deeper, underlying semantic meaning. &#x20;

When discussing symmetric tasks in text embeddings, the mention of queries and documents having similar semantic meanings but different surface forms means that while the text pieces convey the same or very similar information or intent, the words and their arrangement (the surface form) differ.

For example, consider the two sentences:

1. "How can I increase the battery life of my phone?"
2. "What are some ways to extend my phone's battery duration?"

Both sentences ask essentially the same question but use different wording and structure—that is, they have different surface forms.  Yet, their semantic meaning or intent (inquiring about improving phone battery life) is the same.

In tasks like semantic textual similarity (STS) and bitext retrieval, the goal is often to identify and link texts that have similar meanings, regardless of their surface forms.&#x20;

This is crucial for many applications in natural language processing, such as machine translation, information retrieval, and question answering systems, where understanding that different phrases can convey the same meaning is vital for effective processing and response generation.

### <mark style="color:purple;">**Reasons for Structuring the Training Method**</mark>

* <mark style="color:green;">**Enhanced Diversity:**</mark> The detailed categorisation and two-step prompting strategy ensure a wide range of scenarios are covered, essential for training a model to handle diverse real-world tasks.
* <mark style="color:green;">**Quality Assurance:**</mark> The methodical approach to template design and data filtering ensures that only high-quality, relevant data is included in the training set.
* <mark style="color:green;">**Global Applicability:**</mark> By generating data in multiple languages, the model is trained to be effective across different linguistic contexts, broadening its usability.
* <mark style="color:green;">**Efficiency:**</mark> Despite the detailed and nuanced approach, the method is designed to be efficient, requiring less than 1k training steps, making it practical for real-world applications.

This structured training method is tailored to generate a rich, diverse, and high-quality dataset, which is necessary for training robust text embedding models capable of handling a wide array of tasks across different languages and contexts.

The training process in this paper involves fine-tuning a pretrained large language model (LLM) using a combination of synthetic data and labeled datasets. &#x20;

The authors employ a <mark style="color:blue;">**contrastive learning approach with the InfoNCE loss function**</mark> to optimise the model.

### <mark style="color:purple;">Here's a step-by-step breakdown of the training process</mark>

### <mark style="color:blue;">Data Preparation</mark>

The authors created several different prompt templates for ChatGPT to generate them a large synthetic dataset.

#### <mark style="color:green;">Example: Long-short matching subgroup prompt template</mark>

{% code overflow="wrap" %}

```json
You have been assigned a text classification task: Classify whether a given movie review is positive or negative.

Your mission is to write one text classification example for this task in JSON format. The JSON object must contain the following keys:
- "input_text": a string, the input text specified by the classification task.
- "label": a string, the correct label of the input text.
- "misleading_label": a string, an incorrect label that is related to the task.

Please adhere to the following guidelines:
- The "input_text" should be at least 100 words and diverse in expression.
- The "misleading_label" must be a valid label for the given task, but not as appropriate as the "label" for the "input_text".
- The values for all fields should be in English.
- Avoid including the values of the "label" and "misleading_label" fields in the "input_text", that would make the task too easy.
- The "input_text" is clear and requires high school level education to comprehend.

Your output must always be a JSON object only, do not explain yourself or output anything else. Be creative!
```

{% endcode %}

#### <mark style="color:green;">The resulting synthetic data generated by the LLM</mark>

{% code overflow="wrap" %}

```json
{
  "input_text": "The film is a masterful blend of stunning visuals, impeccable performances, and a thought-provoking narrative. The director's vision is evident in every frame, as the cinematography captures the essence of the characters' emotional journeys. The lead actors deliver nuanced and powerful performances, bringing depth and authenticity to their roles. The screenplay is intelligent and well-crafted, exploring complex themes with finesse. While the pacing may be slower than some viewers prefer, it allows for a rich and immersive experience. Overall, this film is a must-see for anyone who appreciates artful storytelling and the power of cinema.",
  "label": "positive",
  "misleading_label": "neutral"
}
```

{% endcode %}

#### <mark style="color:green;">Example: Monolingual STS</mark>&#x20;

{% code overflow="wrap" %}

```json
Write a sentence triple with varying semantic similarity scores in JSON format. The semantic similarity score ranges from 1 to 5, with 1 denotes least similar and 5 denotes most similar.

Please adhere to the following guidelines:
- The keys in JSON are "S1", "S2", and "S3", the values are all strings in English, do not add any other keys.
- There should be some word overlaps between all three sentences.
- The similarity score between S1 and S2 should be 4.5.
- The similarity score between S1 and S3 should be 2.5.
- The sentences require college level education to understand and should be diverse in terms of topic and length.

Your output must always be a JSON object only with three keys "S1", "S2" and "S3", do not explain yourself or output anything else. Be creative!
```

{% endcode %}

#### <mark style="color:green;">The resulting synthetic data generated by the LLM</mark>

{% code overflow="wrap" %}

```json
{
  "S1": "The economic implications of the proposed policy changes are far-reaching and complex.",
  "S2": "The suggested policy amendments have significant and multifaceted consequences for the economy.",
  "S3": "The new regulations will have some impact on businesses, but the exact effects are unclear."
}
```

{% endcode %}

These examples demonstrate how the LLM is given a specific set of instructions and guidelines to *<mark style="color:yellow;">**generate synthetic data in a structured JSON format**</mark>*.&#x20;

The generated data adhered to the provided requirements, such as word count, language, similarity scores, and educational level needed for comprehension.

### <mark style="color:blue;">Query Instruction Template</mark>

The <mark style="color:blue;">**query instruction template**</mark> is used to <mark style="color:blue;">**modify the original query**</mark> $$(q+)$$ to <mark style="color:blue;">**create a new query**</mark> $$(q+\_inst)$$ that *<mark style="color:yellow;">**includes a task definition**</mark>*.&#x20;

This is done to provide additional context to the model about the specific task it should perform when processing the query-document pair.

Here's a detailed explanation of the process:

For each <mark style="color:blue;">**relevant query-document pair**</mark> $$(q+, d+)$$, the <mark style="color:blue;">**original query**</mark> $$(q+)$$ is modified using the <mark style="color:blue;">**instruction template**</mark>.

#### <mark style="color:green;">The instruction template consists of two parts</mark>

* **"Instruct:&#x20;**<mark style="color:purple;">**{task\_definition}**</mark>**"**: This part provides a one-sentence description of the embedding task. The <mark style="color:purple;">**"{task\_definition}"**</mark> placeholder is *<mark style="color:yellow;">**replaced with the actual task description**</mark>*.
* <mark style="color:purple;">**"Query: {q+}"**</mark>: This part includes the original query (q+).

The two parts are then <mark style="color:blue;">**concatenated with a newline character**</mark>**&#x20;**<mark style="color:purple;">**(\n)**</mark> to form the <mark style="color:blue;">**new query**</mark> (q+\_inst).

For the synthetic data, the <mark style="color:blue;">**task definitions**</mark> are obtained from the outputs of the LLM.&#x20;

For other datasets like MS-MARCO, the <mark style="color:blue;">**task definitions**</mark> are manually crafted and applied to all queries in the dataset.

The <mark style="color:blue;">**document side**</mark> $$(d+)$$ is *<mark style="color:yellow;">**not modified with any instruction prefix**</mark>*. &#x20;

This allows the document index to be prebuilt, and the task can be customised by changing only the query side.

By applying the <mark style="color:blue;">**query instruction template**</mark>, the model receives additional information about the specific task it should perform for each query-document pair.  This can help the model better understand the context and generate more relevant embeddings for the given task.

For each relevant <mark style="color:blue;">**query-document pair**</mark> $$(q, d)$$, apply the following <mark style="color:blue;">**instruction template**</mark> to the <mark style="color:blue;">**original query**</mark> q+ to generate a <mark style="color:blue;">**new query**</mark> $$q+\_inst$$

Here, <mark style="color:purple;">**"{task\_definition}"**</mark> is a placeholder for a *<mark style="color:yellow;">one-sentence description of the embedding task</mark>*.&#x20;

For generated synthetic data, the outputs from the brainstorming step are used.  For other datasets, such as MS-MARCO, the task definitions are manually crafted and applied to all the queries in the dataset.

<details>

<summary><mark style="color:green;"><strong>Why the use of MARCO Dataset in the training dataset</strong></mark></summary>

The MS-MARCO dataset is used in addition to the synthetic data for a few reasons:

<mark style="color:purple;">Comparison with previous work:</mark> The authors mention that they report results when the only labeled supervision is the MS-MARCO passage ranking dataset to provide a fair comparison with some previous work. This suggests that using MS-MARCO allows them to benchmark their model against existing methods that have been evaluated on this dataset.

<mark style="color:purple;">Real-world data:</mark> While synthetic data is valuable for generating a large and diverse training set, it may not capture all the nuances and complexities of real-world data. Including a well-established dataset like MS-MARCO ensures that the model is exposed to actual user queries and relevant documents, which can help improve its performance on real-world tasks.

<mark style="color:purple;">Validation of the approach:</mark> By using both synthetic data and a public dataset like MS-MARCO, the authors can demonstrate that their approach is effective not only on artificially generated data but also on a widely-used benchmark dataset. This helps to validate the generalizability and robustness of their method.

</details>

#### <mark style="color:green;">Embedding Extraction</mark>

* Given a pretrained LLM (in this case, Mistral-7b), append an <mark style="color:blue;">**\[EOS] token**</mark> to the end of the query $$(q+\_inst)$$ and document $$(d+)$$.
* Feed the <mark style="color:blue;">**modified query**</mark> and <mark style="color:blue;">**document**</mark> into the pretrained LLM.
* Obtain the <mark style="color:blue;">**query and document embeddings**</mark> $$(hq+\_inst, hd+)$$ by taking the last layer <mark style="color:blue;">**\[EOS]**</mark> vector.

### <mark style="color:blue;">Fine-tuning Dataset</mark>

The fine-tuning dataset consists of *<mark style="color:yellow;">**both the generated synthetic data and a collection of 13 public datasets**</mark>*, yielding approximately <mark style="color:blue;">**1.8M examples**</mark> after sampling.&#x20;

At the end of the data generation process, the fine-tuning dataset would have the following structure:

<mark style="color:blue;">**Synthetic Data**</mark>

* <mark style="color:purple;">**Short-long matching:**</mark> 167k examples
* <mark style="color:purple;">**Long-short matching:**</mark> 122k examples
* <mark style="color:purple;">**Short-short matching:**</mark> 13k examples
* <mark style="color:purple;">**Long-long matching:**</mark> 17k examples
* <mark style="color:purple;">**Bitext retrieval:**</mark> 89k examples
* <mark style="color:purple;">**Monolingual STS**</mark><mark style="color:blue;">**:**</mark> 99k examples

#### <mark style="color:blue;">Public Datasets</mark>

* 13 datasets (e.g., MS-MARCO, NQ, SQuAD, TriviaQA, etc.)

The final fine-tuning dataset is a combination of the synthetic data and the public datasets, totalling approximately 1.8M examples after sampling.&#x20;

This diverse dataset, covering various task types and languages, is used to train the Mistral-7b model using the <mark style="color:blue;">**InfoNCE loss function.**</mark>

### <mark style="color:blue;">Training Objective</mark>

The authors used the <mark style="color:blue;">**InfoNCE loss function**</mark> to train the embedding model:

{% embed url="<https://paperswithcode.com/method/infonce>" %}
An explanation of this loss function
{% endembed %}

$$
min L = -log(ϕ(q+\_inst, d+) / (ϕ(q+\_inst, d+) + Σ\_i∈N ϕ(q+\_inst, n\_i)))
$$

* $$N$$ denotes the <mark style="color:blue;">**set of all negatives**</mark> (i.e., non-relevant documents).
* $$ϕ(q, d)$$ is a function that computes the matching score between query $$q$$ and document $$d$$, using the <mark style="color:blue;">**temperature-scaled cosine similarity**</mark>:

$$
ϕ(q, d) = exp((1 / τ) \* cos(hq, hd))
$$

* $$τ$$ is a <mark style="color:blue;">**temperature hyperparameter**</mark>, fixed to 0.02 in the experiments.

### <mark style="color:purple;">What is the point of this research?</mark>

The purpose of this research is to develop a simpler and more effective method for obtaining high-quality text embeddings, which are vector representations of natural language that encode semantic information.&#x20;

Text embeddings are widely used in various commercial applications, such as:

<mark style="color:green;">**Information retrieval (IR):**</mark> Text embeddings enable efficient retrieval of relevant documents from large-scale corpora, which is essential for search engines, content recommendation systems, and knowledge management platforms.

<mark style="color:green;">**Question answering:**</mark> Text embeddings can be used to find the most relevant passages or documents to answer a given question, improving the accuracy and efficiency of question-answering systems.

<mark style="color:green;">**Semantic textual similarity:**</mark> Text embeddings help determine the semantic similarity between two pieces of text, which is crucial for tasks like duplicate detection, plagiarism checking, and data deduplication.

<mark style="color:green;">**Bitext mining:**</mark> Text embeddings can be employed to identify parallel sentences across different languages, facilitating the creation of bilingual corpora for machine translation and cross-lingual information retrieval.

<mark style="color:green;">**Item recommendation:**</mark> Text embeddings can be used to represent user preferences and item descriptions, enabling personalized recommendations in e-commerce and content platforms.

The main motivation behind this research is to overcome the limitations of existing methods for obtaining text embeddings.&#x20;

Current methods often rely on complex multi-stage training pipelines, requiring substantial engineering efforts to curate large amounts of relevance pairs.&#x20;

Also, they depend on manually collected datasets that are often constrained by the diversity of tasks and the coverage of languages.

By leveraging large language models (LLMs) to generate diverse synthetic data and fine-tuning open-source decoder-only LLMs using contrastive loss, <mark style="color:yellow;">the proposed method eliminates the need for complex training pipelines and manually collected datasets.</mark>&#x20;

This approach not only simplifies the process of obtaining high-quality text embeddings but also achieves strong performance on competitive benchmarks, surpassing previous methods by a significant margin.

The commercial implications of this research are significant.&#x20;

By providing a more efficient and effective method for obtaining text embeddings, this approach can help businesses improve the performance of their natural language processing (NLP) applications, such as search engines, recommendation systems, and customer support chatbots.&#x20;

Additionally, the ability to generate high-quality text embeddings for a wide range of tasks and languages can enable companies to expand their services to new markets and domains, potentially increasing their customer base and revenue streams.

### <mark style="color:purple;">References</mark>

#### <mark style="color:green;">Sentence and Text Embeddings</mark>

* Sanjeev Arora et al. propose a baseline for sentence embeddings, ICLR 2017.
* Tianyu Gao et al. introduce SimCSE, a simple contrastive learning method, EMNLP 2021.
* Xianming Li and Jing Li explore angle-optimized text embeddings, ArXiv 2023.
* Zehan Li et al. work on general text embeddings with multi-stage contrastive learning, ArXiv 2023.
* Tomas Mikolov et al. present Efficient Estimation of Word Representations in Vector Space, ICLR 2013.
* Niklas Muennighoff discusses GPT sentence embeddings for semantic search, ArXiv 2022.
* Niklas Muennighoff et al. introduce the Massive Text Embedding Benchmark, EACL 2023.
* Jianmo Ni et al. present Sentence-T5, ACL 2022.
* Jeffrey Pennington et al. describe GloVe: Global Vectors for Word Representation, EMNLP 2014.
* Nils Reimers and Iryna Gurevych introduce Sentence-BERT, EMNLP-IJCNLP 2019.

#### <mark style="color:green;">Information Retrieval and Question Answering</mark>

* Luiz Henrique Bonifacio et al. discuss unsupervised dataset generation, SIGIR 2022.
* Daniel Fernando Campos et al. describe MS MARCO, a machine reading comprehension dataset, ArXiv 2016.
* Zhuyun Dai et al. present Promptagator for few-shot dense retrieval, ICLR 2022.
* Vladimir Karpukhin et al. discuss Dense Passage Retrieval, EMNLP 2020.
* Patrick S. H. Lewis et al. explore Retrieval-Augmented Generation, NeurIPS 2020.
* Yifu Qiu et al. introduce DuReader-retrieval for passage retrieval, EMNLP 2022.
* Liang Wang et al. discuss Query2doc using large language models, EMNLP 2023.
* Zhilin Yang et al. introduce HotpotQA, a multi-hop question answering dataset, EMNLP 2018.

#### <mark style="color:green;">Natural Language Processing (NLP) and AI Models</mark>

* Samuel R. Bowman et al. develop a large annotated corpus for natural language inference, EMNLP 2015.
* Tom B. Brown et al. discuss language models as few-shot learners, NeurIPS 2020.
* Alexis Conneau et al. study supervised learning of universal sentence representations, EMNLP 2017.
* Jacob Devlin et al. introduce BERT for language understanding, NAACL-HLT 2019.
* Angela Fan et al. present ELI5 for long-form question answering, ACL 2019.
* Tianyu Gao et al. discuss enabling LLMs to generate text with citations, ArXiv 2023.
* Edward J. Hu et al. describe Lora, a low-rank adaptation of LLMs, ICLR 2022.
* OpenAI details GPT-4 in their technical report, ArXiv 2023.

#### <mark style="color:green;">Diverse Topics in NLP</mark>

* Alexis Conneau et al. explore unsupervised cross-lingual representation learning, ACL 2020.
* Gautier Izacard et al. discuss unsupervised dense information retrieval, ArXiv 2021.
* Albert Q Jiang et al. introduce Mistral 7b, ArXiv 2023.
* Baptiste Rozière et al. discuss Code llama for open foundation models, ArXiv 2023.
* Hongjin Su et al. detail instruction-fine-tuned text embeddings, ACL 2023.
* Kexin Wang et al. discuss Generative Pseudo Labeling for domain adaptation, NAACL-HLT 2022.
* Liang Wang et al. explore text embeddings by weakly-supervised contrastive pre-training, ArXiv 2022.
* Shitao Xiao et al. introduce C-pack for advancing general Chinese embedding, ArXiv 2023.
* Xiaohui Xie et al. develop T2ranking, a Chinese benchmark for passage ranking, ArXiv 2023.
* Xinyu Zhang et al. present Mr. TyDi for multilingual dense retrieval, ACL Workshop 2021.
* Xinyu Crystina Zhang et al. discuss Miracl, a multilingual retrieval dataset, Transactions of the ACL 2023.
