One Embedder, Any Task: Instruction-Finetuned Text Embeddings
Last updated
Copyright Continuum Labs - 2023
Last updated
This May 2023 paper introduces INSTRUCTOR, a method for computing text embeddings using task instructions.
INSTRUCTOR is a single embedder that can generate text embeddings tailored to different downstream tasks and domains without requiring further task-specific fine-tuning.
This is achieved through instruction-based fine-tuning on a multitask mixture of 330 diverse datasets with human-written task instructions (MEDI dataset).
INSTRUCTOR embeds every input together with instructions explaining the use case (e.g., task and domain descriptions).
The same input text will be encoded into different embeddings based on the end task (e.g., duplicate detection, information retrieval, or question classification).
INSTRUCTOR can be used for a wide range of downstream applications without additional fine-tuning, including classification, semantic textual similarity, information retrieval, text generation evaluation, and prompt retrieval for in-context learning.
INSTRUCTOR is trained on MEDI, a collection of 330 text embedding datasets newly annotated with human-written task instructions. This multitask mixture covers diverse task categories and domains.
The model is trained with a contrastive loss that maximises the similarity between semantically related text pairs while minimising the similarity between unrelated pairs.
Instruction-based fine-tuning enables INSTRUCTOR to learn task-specific representations by conditioning the embeddings on the task instructions. This allows the model to adapt to different downstream tasks and domains.
The diverse training data in MEDI, which includes both symmetric (e.g., text similarity) and asymmetric (e.g., open-domain QA) tasks, helps INSTRUCTOR generate broadly applicable embeddings.
The inclusion of the Super-NaturalInstructions dataset in MEDI improves INSTRUCTOR's robustness to paraphrased instructions, making it less sensitive to variations in instruction format and style.
INSTRUCTOR is designed to generate task-aware embeddings based on the provided task instructions, eliminating the need for further task-specific fine-tuning.
The model is trained on a diverse set of tasks and domains in MEDI, which allows it to generalise well to unseen tasks and domains.
The instruction-based approach enables INSTRUCTOR to adapt its embeddings to different use cases on-the-fly, based on the task instructions provided at inference time.
Experiments show that INSTRUCTOR significantly outperforms prior state-of-the-art embedding models on a wide range of downstream tasks, including those not seen during training, demonstrating its strong generalization capabilities.
In summary, INSTRUCTOR introduces a text embedding model using task instructions.
By training on a diverse multitask mixture with human-written instructions, INSTRUCTOR learns to generate task-aware embeddings that can be used for various downstream applications without the need for further fine-tuning.
This approach significantly improves the model's generalisation capabilities and makes it a powerful tool for a wide range of natural language processing tasks.