One Embedder, Any Task: Instruction-Finetuned Text Embeddings

This May 2023 paper introduces INSTRUCTOR, a method for computing text embeddings using task instructions.

INSTRUCTOR is a single embedder that can generate text embeddings tailored to different downstream tasks and domains without requiring further task-specific fine-tuning.

This is achieved through instruction-based fine-tuning on a multitask mixture of 330 diverse datasets with human-written task instructions (MEDI dataset).

Key points

INSTRUCTOR embeds every input together with instructions explaining the use case (e.g., task and domain descriptions).
The same input text will be encoded into different embeddings based on the end task (e.g., duplicate detection, information retrieval, or question classification).
INSTRUCTOR can be used for a wide range of downstream applications without additional fine-tuning, including classification, semantic textual similarity, information retrieval, text generation evaluation, and prompt retrieval for in-context learning.

GitHub - xlang-ai/instructor-embedding: [ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text EmbeddingsGitHub

GitHub Repository

INSTRUCTOR is a powerful and versatile tool for generating task-specific text embeddings without the need for further fine-tuning.

It can be easily integrated into various applications to enhance the performance of downstream tasks such as text classification, semantic similarity, information retrieval, and more.

Here's how you can use INSTRUCTOR in your development workflow, based on the provided GitHub repository:

Installation:

Create a virtual environment using conda: conda env create -n instructor python=3.7
Clone the INSTRUCTOR repository: git clone https://github.com/HKUNLP/instructor-embedding
Install the required dependencies: pip install -r requirements.txt
Install the InstructorEmbedding package: pip install InstructorEmbedding or pip install -e .

Loading a pre-trained model:

Download a pre-trained INSTRUCTOR model (e.g., hkunlp/instructor-large) using the provided model list.
Load the model using the INSTRUCTOR class from the InstructorEmbedding package: model = INSTRUCTOR('hkunlp/instructor-large')

Generating customised embeddings:

Prepare your text inputs along with the corresponding task instructions using the unified template: Represent the domain text_type for task_objective:, where domain and task_objective are optional, and text_type is required.
Call the encode function of the loaded model to generate customized embeddings: embeddings = model.encode(texts_with_instructions)

Applying INSTRUCTOR to specific use cases:

Compute similarities between texts:
- Encode two groups of sentences using INSTRUCTOR with customised instructions.
- Calculate the cosine similarity between the generated embeddings using cosine_similarity from sklearn.metrics.pairwise.
Use customized embeddings for information retrieval:
- Encode the query and corpus documents using INSTRUCTOR with appropriate instructions.
- Calculate the cosine similarity between the query and document embeddings.
- Retrieve the most relevant document based on the similarity scores.
Use customized embeddings for clustering:
- Encode the sentences using INSTRUCTOR with customized instructions for clustering.
- Apply a clustering algorithm (e.g., MiniBatchKMeans from sklearn.cluster) to the generated embeddings.
- Assign cluster labels to the sentences based on the clustering results.

Training INSTRUCTOR (optional):

If you want to train INSTRUCTOR on your own dataset, follow these steps:
- Prepare your training data in the unified format used by the MEDI dataset.
- Run the provided training script (train.py) with the appropriate arguments, specifying the pre-trained checkpoint, output directory, cache directory, and other training hyperparameters.

Evaluation (optional):

To evaluate the performance of INSTRUCTOR on benchmark datasets, follow the provided evaluation scripts for MTEB, Billboard, and Prompt Retrieval.
Install the necessary dependencies and run the evaluation scripts with the desired model checkpoint and task name.

By leveraging the power of INSTRUCTOR, you can easily generate task-specific text embeddings for a wide range of applications without the need for additional fine-tuning.

The provided GitHub repository offers a comprehensive set of tools and examples to help you integrate INSTRUCTOR seamlessly into your development workflow.

Remember to explore the various use cases and experiment with different task instructions to optimise the performance of your downstream tasks. The versatility and flexibility of INSTRUCTOR make it a valuable asset in any text embedding project.

Why it works

INSTRUCTOR is trained on MEDI, a collection of 330 text embedding datasets newly annotated with human-written task instructions. This multitask mixture covers diverse task categories and domains.
The model is trained with a contrastive loss that maximises the similarity between semantically related text pairs while minimising the similarity between unrelated pairs.
Instruction-based fine-tuning enables INSTRUCTOR to learn task-specific representations by conditioning the embeddings on the task instructions. This allows the model to adapt to different downstream tasks and domains.
The diverse training data in MEDI, which includes both symmetric (e.g., text similarity) and asymmetric (e.g., open-domain QA) tasks, helps INSTRUCTOR generate broadly applicable embeddings.
The inclusion of the Super-NaturalInstructions dataset in MEDI improves INSTRUCTOR's robustness to paraphrased instructions, making it less sensitive to variations in instruction format and style.

Why it doesn't need fine-tuning for domain-specific tasks

INSTRUCTOR is designed to generate task-aware embeddings based on the provided task instructions, eliminating the need for further task-specific fine-tuning.
The model is trained on a diverse set of tasks and domains in MEDI, which allows it to generalise well to unseen tasks and domains.
The instruction-based approach enables INSTRUCTOR to adapt its embeddings to different use cases on-the-fly, based on the task instructions provided at inference time.
Experiments show that INSTRUCTOR significantly outperforms prior state-of-the-art embedding models on a wide range of downstream tasks, including those not seen during training, demonstrating its strong generalization capabilities.

In summary, INSTRUCTOR introduces a text embedding model using task instructions.

By training on a diverse multitask mixture with human-written instructions, INSTRUCTOR learns to generate task-aware embeddings that can be used for various downstream applications without the need for further fine-tuning.

This approach significantly improves the model's generalisation capabilities and makes it a powerful tool for a wide range of natural language processing tasks.

PreviousLarge Language Model Based Text Augmentation Enhanced Personality Detection Model NextVector Databases are not the only solution

Last updated 6 months ago