LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
Continuum Knowledge
Continuum Knowledge
  • Continuum
  • Data
    • Datasets
      • Pre Training Data
      • Types of Fine Tuning
      • Self Instruct Paper
      • Self-Alignment with Instruction Backtranslation
      • Systematic Evaluation of Instruction-Tuned Large Language Models on Open Datasets
      • Instruction Tuning
      • Instruction Fine Tuning - Alpagasus
      • Less is More For Alignment
      • Enhanced Supervised Fine Tuning
      • Visualising Data using t-SNE
      • UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
      • Training and Evaluation Datasets
      • What is perplexity?
  • MODELS
    • Foundation Models
      • The leaderboard
      • Foundation Models
      • LLama 2 - Analysis
      • Analysis of Llama 3
      • Llama 3.1 series
      • Google Gemini 1.5
      • Platypus: Quick, Cheap, and Powerful Refinement of LLMs
      • Mixtral of Experts
      • Mixture-of-Agents (MoA)
      • Phi 1.5
        • Refining the Art of AI Training: A Deep Dive into Phi 1.5's Innovative Approach
      • Phi 2.0
      • Phi-3 Technical Report
  • Training
    • The Fine Tuning Process
      • Why fine tune?
        • Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
        • Explanations in Fine Tuning
      • Tokenization
        • Tokenization Is More Than Compression
        • Tokenization - SentencePiece
        • Tokenization explore
        • Tokenizer Choice For LLM Training: Negligible or Crucial?
        • Getting the most out of your tokenizer for pre-training and domain adaptation
        • TokenMonster
      • Parameter Efficient Fine Tuning
        • P-Tuning
          • The Power of Scale for Parameter-Efficient Prompt Tuning
        • Prefix-Tuning: Optimizing Continuous Prompts for Generation
        • Harnessing the Power of PEFT: A Smarter Approach to Fine-tuning Pre-trained Models
        • What is Low-Rank Adaptation (LoRA) - explained by the inventor
        • Low Rank Adaptation (Lora)
        • Practical Tips for Fine-tuning LMs Using LoRA (Low-Rank Adaptation)
        • QLORA: Efficient Finetuning of Quantized LLMs
        • Bits and Bytes
        • The Magic behind Qlora
        • Practical Guide to LoRA: Tips and Tricks for Effective Model Adaptation
        • The quantization constant
        • QLORA: Efficient Finetuning of Quantized Language Models
        • QLORA and Fine-Tuning of Quantized Language Models (LMs)
        • ReLoRA: High-Rank Training Through Low-Rank Updates
        • SLoRA: Federated Parameter Efficient Fine-Tuning of Language Models
        • GaLora: Memory-Efficient LLM Training by Gradient Low-Rank Projection
      • Hyperparameters
        • Batch Size
        • Padding Tokens
        • Mixed precision training
        • FP8 Formats for Deep Learning
        • Floating Point Numbers
        • Batch Size and Model loss
        • Batch Normalisation
        • Rethinking Learning Rate Tuning in the Era of Language Models
        • Sample Packing
        • Gradient accumulation
        • A process for choosing the learning rate
        • Learning Rate Scheduler
        • Checkpoints
        • A Survey on Efficient Training of Transformers
        • Sequence Length Warmup
        • Understanding Training vs. Evaluation Data Splits
        • Cross-entropy loss
        • Weight Decay
        • Optimiser
        • Caching
      • Training Processes
        • Extending the context window
        • PyTorch Fully Sharded Data Parallel (FSDP)
        • Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
        • YaRN: Efficient Context Window Extension of Large Language Models
        • Sliding Window Attention
        • LongRoPE
        • Reinforcement Learning
        • An introduction to reinforcement learning
        • Reinforcement Learning from Human Feedback (RLHF)
        • Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  • INFERENCE
    • Why is inference important?
      • Grouped Query Attention
      • Key Value Cache
      • Flash Attention
      • Flash Attention 2
      • StreamingLLM
      • Paged Attention and vLLM
      • TensorRT-LLM
      • Torchscript
      • NVIDIA L40S GPU
      • Triton Inference Server - Introduction
      • Triton Inference Server
      • FiDO: Fusion-in-Decoder optimised for stronger performance and faster inference
      • Is PUE a useful measure of data centre performance?
      • SLORA
  • KNOWLEDGE
    • Vector Databases
      • A Comprehensive Survey on Vector Databases
      • Vector database management systems: Fundamental concepts, use-cases, and current challenges
      • Using the Output Embedding to Improve Language Models
      • Decoding Sentence-BERT
      • ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
      • SimCSE: Simple Contrastive Learning of Sentence Embeddings
      • Questions Are All You Need to Train a Dense Passage Retriever
      • Improving Text Embeddings with Large Language Models
      • Massive Text Embedding Benchmark
      • RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking
      • LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
      • Embedding and Fine-Tuning in Neural Language Models
      • Embedding Model Construction
      • Demystifying Embedding Spaces using Large Language Models
      • Fine-Tuning Llama for Multi-Stage Text Retrieval
      • Large Language Model Based Text Augmentation Enhanced Personality Detection Model
      • One Embedder, Any Task: Instruction-Finetuned Text Embeddings
      • Vector Databases are not the only solution
      • Knowledge Graphs
        • Harnessing Knowledge Graphs to Elevate AI: A Technical Exploration
        • Unifying Large Language Models and Knowledge Graphs: A Roadmap
      • Approximate Nearest Neighbor (ANN)
      • High Dimensional Data
      • Principal Component Analysis (PCA)
      • Vector Similarity Search - HNSW
      • FAISS (Facebook AI Similarity Search)
      • Unsupervised Dense Retrievers
    • Retrieval Augmented Generation
      • Retrieval-Augmented Generation for Large Language Models: A Survey
      • Fine-Tuning or Retrieval?
      • Revolutionising Information Retrieval: The Power of RAG in Language Models
      • A Survey on Retrieval-Augmented Text Generation
      • REALM: Retrieval-Augmented Language Model Pre-Training
      • Retrieve Anything To Augment Large Language Models
      • Generate Rather Than Retrieve: Large Language Models Are Strong Context Generators
      • Active Retrieval Augmented Generation
      • DSPy: LM Assertions: Enhancing Language Model Pipelines with Computational Constraints
      • DSPy: Compiling Declarative Language Model Calls
      • DSPy: In-Context Learning for Extreme Multi-Label Classification
      • Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
      • HYDE: Revolutionising Search with Hypothetical Document Embeddings
      • Enhancing Recommender Systems with Large Language Model Reasoning Graphs
      • Retrieval Augmented Generation (RAG) versus fine tuning
      • RAFT: Adapting Language Model to Domain Specific RAG
      • Summarisation Methods and RAG
      • Lessons Learned on LLM RAG Solutions
      • Stanford: Retrieval Augmented Language Models
      • Overview of RAG Approaches with Vector Databases
      • Mastering Chunking in Retrieval-Augmented Generation (RAG) Systems
    • Semantic Routing
    • Resource Description Framework (RDF)
  • AGENTS
    • What is agency?
      • Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves
      • Types of Agents
      • The risk of AI agency
      • Understanding Personality in Large Language Models: A New Frontier in AI Psychology
      • AI Agents - Reasoning, Planning, and Tool Calling
      • Personality and Brand
      • Agent Interaction via APIs
      • Bridging Minds and Machines: The Legacy of Newell, Shaw, and Simon
      • A Survey on Language Model based Autonomous Agents
      • Large Language Models as Agents
      • AI Reasoning: A Deep Dive into Chain-of-Thought Prompting
      • Enhancing AI Reasoning with Self-Taught Reasoner (STaR)
      • Exploring the Frontier of AI: The "Tree of Thoughts" Framework
      • Toolformer: Revolutionising Language Models with API Integration - An Analysis
      • TaskMatrix.AI: Bridging Foundational AI Models with Specialised Systems for Enhanced Task Completion
      • Unleashing the Power of LLMs in API Integration: The Rise of Gorilla
      • Andrew Ng's presentation on AI agents
      • Making AI accessible with Andrej Karpathy and Stephanie Zhan
  • Regulation and Ethics
    • Regulation and Ethics
      • Privacy
      • Detecting AI Generated content
      • Navigating the IP Maze in AI: The Convergence of Blockchain, Web 3.0, and LLMs
      • Adverse Reactions to generative AI
      • Navigating the Ethical Minefield: The Challenge of Security in Large Language Models
      • Navigating the Uncharted Waters: The Risks of Autonomous AI in Military Decision-Making
  • DISRUPTION
    • Data Architecture
      • What is a data pipeline?
      • What is Reverse ETL?
      • Unstructured Data and Generatve AI
      • Resource Description Framework (RDF)
      • Integrating generative AI with the Semantic Web
    • Search
      • BM25 - Search Engine Ranking Function
      • BERT as a reranking engine
      • BERT and Google
      • Generative Engine Optimisation (GEO)
      • Billion-scale similarity search with GPUs
      • FOLLOWIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
      • Neural Collaborative Filtering
      • Federated Neural Collaborative Filtering
      • Latent Space versus Embedding Space
      • Improving Text Embeddings with Large Language Models
    • Recommendation Engines
      • On Interpretation and Measurement of Soft Attributes for Recommendation
      • A Survey on Large Language Models for Recommendation
      • Model driven recommendation systems
      • Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations
      • Foundation Models for Recommender Systems
      • Exploring the Impact of Large Language Models on Recommender Systems: An Extensive Review
      • AI driven recommendations - harming autonomy?
    • Logging
      • A Taxonomy of Anomalies in Log Data
      • Deeplog
      • LogBERT: Log Anomaly Detection via BERT
      • Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection
      • Log-based Anomaly Detection with Deep Learning: How Far Are We?
      • Deep Learning for Anomaly Detection in Log Data: A Survey
      • LogGPT
      • Adaptive Semantic Gate Networks (ASGNet) for log-based anomaly diagnosis
  • Infrastructure
    • The modern data centre
      • Enhancing Data Centre Efficiency: Strategies to Improve PUE
      • TCO of NVIDIA GPUs and falling barriers to entry
      • Maximising GPU Utilisation with Kubernetes and NVIDIA GPU Operator
      • Data Centres
      • Liquid Cooling
    • Servers and Chips
      • The NVIDIA H100 GPU
      • NVIDIA H100 NVL
      • Lambda Hyperplane 8-H100
      • NVIDIA DGX Servers
      • NVIDIA DGX-2
      • NVIDIA DGX H-100 System
      • NVLink Switch
      • Tensor Cores
      • NVIDIA Grace Hopper Superchip
      • NVIDIA Grace CPU Superchip
      • NVIDIA GB200 NVL72
      • Hopper versus Blackwell
      • HGX: High-Performance GPU Platforms
      • ARM Chips
      • ARM versus x86
      • RISC versus CISC
      • Introduction to RISC-V
    • Networking and Connectivity
      • Infiniband versus Ethernet
      • NVIDIA Quantum InfiniBand
      • PCIe (Peripheral Component Interconnect Express)
      • NVIDIA ConnectX InfiniBand adapters
      • NVMe (Non-Volatile Memory Express)
      • NVMe over Fabrics (NVMe-oF)
      • NVIDIA Spectrum-X
      • NVIDIA GPUDirect
      • Evaluating Modern GPU Interconnect
      • Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)
      • Next-generation networking in AI environments
      • NVIDIA Collective Communications Library (NCCL)
    • Data and Memory
      • NVIDIA BlueField Data Processing Units (DPUs)
      • Remote Direct Memory Access (RDMA)
      • High Bandwidth Memory (HBM3)
      • Flash Memory
      • Model Requirements
      • Calculating GPU memory for serving LLMs
      • Transformer training costs
      • GPU Performance Optimisation
    • Libraries and Complements
      • NVIDIA Base Command
      • NVIDIA AI Enterprise
      • CUDA - NVIDIA GTC 2024 presentation
      • RAPIDs
      • RAFT
    • Vast Data Platform
      • Vast Datastore
      • Vast Database
      • Vast Data Engine
      • DASE (Disaggregated and Shared Everything)
      • Dremio and VAST Data
    • Storage
      • WEKA: A High-Performance Storage Solution for AI Workloads
      • Introduction to NVIDIA GPUDirect Storage (GDS)
        • GDS cuFile API
      • NVIDIA Magnum IO GPUDirect Storage (GDS)
      • Vectors in Memory
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Explanation of "Classes" in this Context
  • Why This is Challenging
  • Proposed Solution: Infer-Retrieve-Rank
  • Key Features
  • Advantages
  • Related Work
  • Canonical Approaches
  • Comparison with Infer-Retrieve-Rank
  • Breakdown and explanation of the Infer–Retrieve–Rank Code Block
  • Explanation
  • Class Definition and Initialization
  • Forward Method
  • Breakdown and Explanation of Seed-prompts Code Blocks
  • Seed-prompt for BioDEX Infer Module
  • The code block
  • Seed-prompt for BioDEX Rank Module
  • How the BioDEX Infer and Rank Modules Work Together
  • Seed-prompt for ESCO Infer Module
  • Seed-prompt for ESCO Rank Module
  • Code Block
  • How the Infer and Rank Modules Work Together
  • Explanation of the Metrics
  • Rank-Precision (RP)
  • Dataset
  • BioDEX Dataset
  • ESCO Dataset
  • Comprehensive Analysis of the Results
  • Proposed Method: Infer–Retrieve–Rank
  • Summary of Results
  • Conclusion
  • Role of Large Language Models (LLMs) in the Infer–Retrieve–Rank Process
  • Key Components and Roles
  • How LLMs Work Together
  • Summary

Was this helpful?

  1. KNOWLEDGE
  2. Retrieval Augmented Generation

DSPy: In-Context Learning for Extreme Multi-Label Classification

PreviousDSPy: Compiling Declarative Language Model CallsNextOptimizing Instructions and Demonstrations for Multi-Stage Language Model Programs

Last updated 9 months ago

Was this helpful?

This January 2024 paper addresses the challenges in solving multi-label classification problems with thousands of classes using in-context learning alone.

Language models (LMs) often lack prior knowledge about the specific classes and demonstrating every class in a prompt is impractical.

The authors propose a general program named Infer–Retrieve–Rank (IReRa) to efficiently tackle such problems.

Implemented using the DSPy programming model, IReRa defines multi-step interactions between LMs and retrievers. DSPy optimizers tune the program towards specific datasets using only a few labeled examples.

The proposed solution achieves state-of-the-art results across various benchmarks without requiring finetuning, prompt engineering, or extensive labeled data. The program is highly adaptable to new tasks and datasets, demonstrating competitive performance even in benchmarks with vastly different characteristics.

Explanation of "Classes" in this Context

In the context of this paper, "classes" refer to the different categories or labels that an item can belong to in a multi-label classification task.

Here’s a simplified explanation:

  • Multi-label Classification: This is a type of problem where each item can be assigned more than one label or category. For example, an email might be classified as both "important" and "work-related."

  • Classes: These are the possible labels or categories that an item can be assigned to. In this case, there can be upwards of 10,000 different classes. For example, if you were classifying job descriptions, classes might include labels like "software development," "data analysis," "project management," etc.

  • Extreme Multi-label Classification (XMC): When there are a very large number of possible classes, it becomes an extreme classification problem. Handling such a large number of classes is challenging because a language model needs to understand and distinguish between all these potential categories.

Why This is Challenging

  1. Lack of Prior Knowledge: Language models might not have prior knowledge about the specific classes, especially when there are thousands of them.

  2. Infeasibility of Demonstration: It’s generally impractical to demonstrate every class in a prompt because of the sheer number of classes.

  3. Complex Configuration: Existing methods often require complex configurations with multiple LM calls, prompts, and hyperparameters, making it difficult to apply them to new datasets or LMs.

Proposed Solution: Infer-Retrieve-Rank

To address these challenges, the authors propose a method called Infer-Retrieve-Rank (IReRa), which involves the following steps:

  1. Infer: An in-context learning module processes the input and predicts a set of applicable terms (queries).

  2. Retrieve: These predicted terms are then related to the actual label space using a frozen retriever.

  3. Rank: A second in-context learning module re-ranks the retrieved labels.

Key Features

  • Minimal Prompt: A minimal prompt is used to bootstrap the process, making it easier to configure and adapt to new tasks.

  • Zero-shot Teacher LM: This model generates initial demonstrations to optimise the few-shot Student LM.

  • Efficiency: The approach uses only about 50 labeled examples and minimal training data to achieve state-of-the-art results.

  • DSPy Programming Model: The DSPy model allows for the separate specification and optimisation of the program, making it flexible and generalisable.

Advantages

  • No Finetuning Required: The method does not require extensive finetuning for different datasets.

  • Ease of Adaptation: Adapting to new datasets involves simple steps like writing a new prompt and configuring the LMs.

  • State-of-the-art Performance: The proposed method achieves state-of-the-art results in various benchmarks with minimal labeled data and no prompt engineering.

Related Work

The related work section outlines previous approaches and methods used to tackle extreme multi-label classification (XMC) problems and compares them with the proposed Infer-Retrieve-Rank method.

Canonical Approaches

  1. Finetuning Specialized Retrievers or Binary Classifiers:

    • Specialized Retrievers: These are finetuned over the label space to efficiently retrieve relevant labels.

    • Binary Classifiers: One binary classifier is finetuned per class to decide whether an input belongs to a specific class or not.

    • Drawbacks: Both approaches require a significant amount of labeled data, as each class needs at least a few labeled examples to train effectively.

  2. Distant Supervision:

    • Purpose: Used to avoid manual data labeling.

    • Method: Employs heuristics or external resources to automatically label data. This approach can provide initial labels without manual annotation.

  3. Synthetic Data Bootstrapping:

    • Method: Large Language Models (LLMs) are used to generate synthetic data to augment the training dataset.

    • Examples: Decorte et al. (2023), Clavié and Soulié (2023), and De Raedt et al. (2023) used this method to bootstrap synthetic data.

  4. Finetuning Retrievers on Adjacent Problems:

    • Method: Retrievers are finetuned on related problems where labeled data is available.

    • Example: Remy et al. (2022) used this approach to leverage available data from adjacent problems to improve retriever performance.

  5. Reranking with Additional LLM Calls:

    • Method: An additional LLM call is made at inference time to rerank a list of candidate labels, aiming to boost performance.

    • Example: Clavié and Soulié (2023) employed this technique to enhance label accuracy.

  6. Inference-time Multiple LLM Calls:

    • Method: Zhu and Zamani (2023) utilized multiple GPT-3.5 calls combined with retrieval to bootstrap synthetic prompts per input, infer labels, and rerank them.

    • Evaluation: This approach was evaluated on two recommendation tasks where the input and output documents were of the same type.

Comparison with Infer-Retrieve-Rank

The authors compare their Infer-Retrieve-Rank program with the aforementioned methods, highlighting several key differences and advantages:

  1. Efficiency:

    • Minimal Data Requirement: Infer-Retrieve-Rank can achieve state-of-the-art performance using only approximately 50 labeled examples, making it much more data-efficient compared to other methods that require extensive labeled data.

    • Few LLM Calls: Unlike methods requiring numerous LLM calls per input, Infer-Retrieve-Rank minimizes the number of LLM calls, enhancing efficiency.

  2. No Finetuning Required:

    • The proposed method does not rely on finetuning the LMs or retrievers, simplifying the development and deployment process.

  3. Modular and Declarative Program Logic:

    • Flexibility: The program logic is defined in a modular and declarative manner, allowing it to be seamlessly applied to different benchmarks with a minimal seed-prompt.

    • Automatic Optimization: The DSPy programming model handles optimisation automatically, significantly reducing the need for iterative prompt engineering. This optimisation can be completed in as little as ten minutes.

  4. Configurable Components:

    • Adaptability: The choice of LMs and retrievers can be configured, ensuring the relevance and potential enhancement when stronger components become available.

    • Single Seed-prompt: The method requires at most one seed-prompt per task and in-context module, simplifying the setup process.

Breakdown and explanation of the Infer–Retrieve–Rank Code Block

The provided code block implements the Infer-Retrieve-Rank program using the DSPy framework.

This program is designed to tackle extreme multi-label classification tasks, which involve assigning multiple labels to a given input from a very large set of possible labels.

The Infer-Retrieve-Rank approach uses language models (LMs) and retrievers in a modular and efficient manner to predict, retrieve, and rerank labels based on the input data.

Below is a detailed breakdown and explanation of each part of the code:

1 class InferRetrieveRank(dspy.Module):
2     def __init__(self, infer_sig, rank_sig, retr):
3         # Initialize LM modules with Signatures
4         self.infer = dspy.ChainOfThought(infer_sig)
5         self.rank = dspy.ChainOfThought(rank_sig)
6         self.retrieve = retr
7
8     def forward(self, text: str) -> Prediction:
9         # Predict with LM
10         preds = self.infer(text).completions.labels
11
12         # Parse LM output
13         preds = extract_labels_from_strings(preds)
14
15         # Use LM outputs to retrieve labels
16         labels = self.retrieve(preds)
17
18         # Use LM to rerank labels
19         labels = self.rank(text, labels)
20
21         return dspy.Prediction(labels=labels)

Explanation

Class Definition and Initialization

Class Definition:

1 class InferRetrieveRank(dspy.Module):
  • This line defines a new class named InferRetrieveRank that inherits from dspy.Module. This class represents the Infer-Retrieve-Rank program.

Initialization Method:

2     def __init__(self, infer_sig, rank_sig, retr):
3         # Initialize LM modules with Signatures
4         self.infer = dspy.ChainOfThought(infer_sig)
5         self.rank = dspy.ChainOfThought(rank_sig)
6         self.retrieve = retr
  • Line 2: Defines the __init__ method which initializes the object.

  • Lines 4-5: The infer and rank attributes are initialized using the dspy.ChainOfThought class, which takes infer_sig and rank_sig as signatures. These signatures likely define the configuration or parameters for the language models (LMs) used in the infer and rank steps.

  • Line 6: The retrieve attribute is initialized with the retriever module passed as an argument.

Forward Method

Forward Method Definition:

8     def forward(self, text: str) -> Prediction:
  • This line defines the forward method, which is the core logic for the Infer-Retrieve-Rank program. It takes a text input and returns a prediction.

Inference Step:

9         # Predict with LM
10         preds = self.infer(text).completions.labels
  • Line 10: Uses the infer LM to process the input text and generate predictions (preds). The completions.labels suggests that the predictions are labels generated by the LM.

Parsing LM Output:

12         # Parse LM output
13         preds = extract_labels_from_strings(preds)
  • Line 13: The raw predictions are parsed and extracted into a format suitable for retrieval. This involves cleaning and structuring the LM output into a list of labels.

Retrieval Step:

15         # Use LM outputs to retrieve labels
16         labels = self.retrieve(preds)
  • Line 16: The parsed labels (preds) are used as queries to the retrieve module, which returns a ranked list of labels based on their similarity to the queries.

Reranking Step:

18         # Use LM to rerank labels
19         labels = self.rank(text, labels)
  • Line 19: The initial list of labels is reranked using the rank LM, which takes both the original input text and the retrieved labels to produce a final ranked list of labels.

Return Prediction:

21         return dspy.Prediction(labels=labels)
  • Line 21: The final ranked labels are returned as a dspy.Prediction object.

Breakdown and Explanation of Seed-prompts Code Blocks

The provided code snippets define seed-prompts for the Infer and Rank modules for different datasets using the DSPy framework.

These prompts are organised using the DSPy Signature abstraction, which specifies the structure and behavior of each in-context learning module.

Seed-prompt for BioDEX Infer Module

The BioDEX Infer Module seed-prompt is constructed to address the following key challenges:

Identification of Adverse Drug Reactions

The primary goal is to accurately identify adverse drug reactions mentioned in medical article snippets. By clearly defining the input and output fields, the module is guided to focus on extracting these reactions from the provided text.

Structured Data Processing

The use of dspy.InputField and dspy.OutputField ensures that the data is processed in a structured manner. This structuring helps maintain consistency and accuracy in identifying and formatting the adverse drug reactions.

Efficiency and Accuracy

By providing a clear task description and structured input/output fields, the seed-prompt ensures that the Infer module operates efficiently. The module can quickly process the input text and accurately identify the relevant adverse drug reactions without requiring extensive manual intervention.

Adaptability

The modular approach facilitated by the DSPy framework allows for easy adaptation of the Infer module to different datasets or slightly varied tasks. By changing the input and output field definitions, the module can be tailored to new requirements, demonstrating flexibility and scalability.

The code block

1 class BiodexInferSignature(dspy.Signature):
2     """Given a snippet from a medical article, identify the adverse drug reactions affecting the patient. Always return reactions."""
3
4     text = dspy.InputField(prefix="Article:")
5     output = dspy.OutputField(
6         prefix="Reactions:",
7         desc="list of comma-separated adverse drug reactions"
8     )

Class Definition

1 class BiodexInferSignature(dspy.Signature):
  • Defines a new class BiodexInferSignature that inherits from dspy.Signature. This class specifies the signature for the Infer module on the BioDEX dataset.

Docstring

2     """Given a snippet from a medical article, identify the adverse drug reactions affecting the patient. Always return reactions."""
  • Provides a task description. This docstring explains the task: given a snippet from a medical article, the module should identify and return the adverse drug reactions affecting the patient.

Input Field

4     text = dspy.InputField(prefix="Article:")
  • Defines an input field with the prefix "Article:". This indicates that the input to the module will be a snippet from a medical article.

Output Field

5     output = dspy.OutputField(
6         prefix="Reactions:",
7         desc="list of comma-separated adverse drug reactions"
8     )
  • Defines an output field with the prefix "Reactions:". The description specifies that the output should be a list of comma-separated adverse drug reactions.

The seed-prompt for the BioDEX Infer Module is designed to streamline the identification of adverse drug reactions from medical article snippets.

Using the DSPy framework, this seed-prompt employs a structured and declarative approach to define the behavior of the Infer module, ensuring efficient and accurate performance within the BioDEX dataset.

By explicitly outlining the input and output fields, it facilitates a clear and consistent processing pipeline, enabling the module to reliably extract relevant information.

Seed-prompt for BioDEX Rank Module

The BioDEX Rank Module seed-prompt is constructed to address the following key challenges:

Selection of Relevant Adverse Drug Reactions

The primary goal is to accurately select the most relevant adverse drug reactions from a given list of options within a medical article snippet. By clearly defining the input and output fields, the module is guided to focus on picking the top 10 applicable reactions from the provided options, ensuring relevance and precision.

Structured Data Processing

The use of dspy.InputField and dspy.OutputField ensures that the data is processed in a structured manner. This structuring helps maintain consistency and accuracy in identifying and formatting the adverse drug reactions, facilitating a reliable extraction and ranking process.

Efficiency and Accuracy

By providing a clear task description and structured input/output fields, the seed-prompt ensures that the Rank module operates efficiently. The module can swiftly process the input text and accurately identify the most relevant adverse drug reactions from the options without requiring extensive manual intervention.

Adaptability

The modular approach facilitated by the DSPy framework allows for easy adaptation of the Rank module to different datasets or slightly varied tasks. By changing the input and output field definitions, the module can be tailored to new requirements, demonstrating flexibility and scalability.

1 class BiodexRankSignature(dspy.Signature):
2     """Given a snippet from a medical article, pick the 10 most applicable adverse reactions from the options that are directly expressed in the snippet."""
3
4     text = dspy.InputField(prefix="Article:")
5     options = dspy.InputField(
6         prefix="Options:",
7         desc="List of comma-separated options to choose from"
8     )
9     output = dspy.OutputField(
10         prefix="Reactions:",
11         desc="list of comma-separated adverse drug reactions"
12     )

Class Definition

1 class BiodexRankSignature(dspy.Signature):
  • Defines a new class BiodexRankSignature that inherits from dspy.Signature. This class specifies the signature for the Rank module on the BioDEX dataset.

Docstring

2     """Given a snippet from a medical article, pick the 10 most applicable adverse reactions from the options that are directly expressed in the snippet."""
  • Provides a task description. This docstring explains that the task is to pick the 10 most applicable adverse reactions from a given list of options.

Input Fields

4     text = dspy.InputField(prefix="Article:")
  • Defines an input field with the prefix "Article:". This indicates that the input to the module will be a snippet from a medical article.

5     options = dspy.InputField(
6         prefix="Options:",
7         desc="List of comma-separated options to choose from"
8     )
  • Defines another input field with the prefix "Options:". The description specifies that this field will contain a list of comma-separated options to choose from.

Output Field

9     output = dspy.OutputField(
10         prefix="Reactions:",
11         desc="list of comma-separated adverse drug reactions"
12     )
  • Defines an output field with the prefix "Reactions:". The description specifies that the output should be a list of comma-separated adverse drug reactions.

The seed-prompt for the BioDEX Rank Module is designed to assist in selecting the most relevant adverse drug reactions from a medical article snippet.

Using the DSPy framework, this seed-prompt defines the structure for both input and output fields, ensuring that the Rank module can accurately identify and rank the most applicable adverse reactions. This structured approach allows for efficient processing and ranking of labels within the context of the BioDEX dataset.

How the BioDEX Infer and Rank Modules Work Together

The BioDEX Infer and Rank Modules collaborate to efficiently and accurately identify and rank adverse drug reactions from medical article snippets.

Here’s how they interact and the workflow they follow:

BioDEX Infer Module

The BioDEX Infer Module is the first step in the process. Its primary function is to identify all adverse drug reactions mentioned in a given medical article snippet.

BioDEX Rank Module

Once the BioDEX Infer Module has identified all potential adverse drug reactions, the BioDEX Rank Module steps in to rank these reactions based on relevance. Here’s how it works:

Workflow Interaction

  1. Initial Extraction:

    • The BioDEX Infer Module processes the medical article snippet to extract all mentioned adverse drug reactions. This initial step ensures that all potentially relevant reactions are identified.

  2. Ranking and Selection:

    • The extracted reactions are then passed as options to the BioDEX Rank Module. The Rank Module evaluates these options against the same medical article snippet to select and rank the top 10 most applicable reactions.

  3. Final Output:

    • The ranked reactions are then outputted by the Rank Module, providing a concise and prioritized list of the most relevant adverse drug reactions for the given medical article snippet.

The BioDEX Infer and Rank Modules work together to efficiently and accurately identify and rank adverse drug reactions from medical article snippets.

The Infer Module extracts all mentioned reactions, while the Rank Module selects and prioritizes the top 10 most relevant reactions.

This collaborative workflow ensures a structured, consistent, and accurate identification and ranking process, leveraging the DSPy framework's modular and declarative approach for optimal performance within the BioDEX dataset.

Seed-prompt for ESCO Infer Module

The ESCO Infer Module seed-prompt is constructed to address the following key challenges:

  1. Identification of Relevant Skills:

    • The primary problem is to accurately identify all job skills mentioned in a job vacancy snippet. By clearly defining the input and output fields, the module is guided to focus on extracting skills from the provided text.

  2. Structured Data Processing:

    • The use of dspy.InputField and dspy.OutputField ensures that the data is processed in a structured manner. This structuring helps in maintaining consistency and accuracy in identifying and formatting the job skills.

  3. Efficiency and Accuracy:

    • By providing a clear task description and structured input/output fields, the seed-prompt ensures that the Infer module operates efficiently. The module can quickly process the input text and accurately identify the relevant skills without additional manual intervention.

  4. Adaptability:

    • The modular approach facilitated by the DSPy framework allows for easy adaptation of the Infer module to different datasets or slightly varied tasks. By changing the input and output field definitions, the module can be tailored to new requirements.

This is accomplished using the DSPy framework, which provides a structured and clear approach to defining the behavior of the Infer module.

Here's how the code block works and what it aims to achieve:

1 class EscoInferSignature(dspy.Signature):
2     """Given a snippet from a job vacancy, identify all the ESCO job skills mentioned. Always return skills."""
3
4     text = dspy.InputField(prefix="Vacancy:")
5     output = dspy.OutputField(
6         prefix="Skills:",
7         desc="list of comma-separated ESCO skills"
8     )

Class Definition

1 class EscoInferSignature(dspy.Signature):
  • Defines a new class EscoInferSignature that inherits from dspy.Signature. This class specifies the signature for the Infer module on the ESCO job vacancy dataset.

Docstring

2     """Given a snippet from a job vacancy, identify all the ESCO job skills mentioned. Always return skills."""
  • Provides a task description. This docstring explains that the task is to identify and return all the ESCO job skills mentioned in a job vacancy snippet.

Input Field

4     text = dspy.InputField(prefix="Vacancy:")
  • Defines an input field with the prefix "Vacancy:". This indicates that the input to the module will be a snippet from a job vacancy.

Output Field

5     output = dspy.OutputField(
6         prefix="Skills:",
7         desc="list of comma-separated ESCO skills"
8     )
  • Defines an output field with the prefix "Skills:". The description specifies that the output should be a list of comma-separated ESCO skills.

Seed-prompt for ESCO Rank Module

The ESCO Rank Module seed-prompt is constructed to address the following key challenges:

Selection of Relevant Job Skills

The primary goal is to accurately select the most relevant job skills from a given list of options within a job vacancy snippet.

By clearly defining the input and output fields, the module is guided to focus on picking the top 10 applicable skills from the provided options, ensuring relevance and precision.

Structured Data Processing

The use of dspy.InputField and dspy.OutputField ensures that the data is processed in a structured manner. This structuring helps maintain consistency and accuracy in identifying and formatting the job skills, facilitating a reliable extraction and ranking process.

Efficiency and Accuracy

By providing a clear task description and structured input/output fields, the seed-prompt ensures that the Rank module operates efficiently. The module can swiftly process the input text and accurately identify the most relevant job skills from the options without requiring extensive manual intervention.

Adaptability

The modular approach facilitated by the DSPy framework allows for easy adaptation of the Rank module to different datasets or slightly varied tasks. By changing the input and output field definitions, the module can be tailored to new requirements, demonstrating flexibility and scalability.

Code Block

1 class EscoRankSignature(dspy.Signature):
2     """Given a snippet from a job vacancy, pick the 10 most applicable skills from the options that are directly expressed in the snippet."""
3
4     text = dspy.InputField(prefix="Vacancy:")
5     options = dspy.InputField(
6         prefix="Options:",
7         desc="List of comma-separated options to choose from"
8     )
9     output = dspy.OutputField(
10         prefix="Skills:",
11         desc="list of comma-separated ESCO skills"
12     )

Class Definition

1 class EscoRankSignature(dspy.Signature):
  • Defines a new class EscoRankSignature that inherits from dspy.Signature. This class specifies the signature for the Rank module on the ESCO job vacancy dataset.

Docstring

2     """Given a snippet from a job vacancy, pick the 10 most applicable skills from the options that are directly expressed in the snippet."""
  • Provides a task description. This docstring explains that the task is to pick the 10 most applicable ESCO job skills from a given list of options.

Input Fields

4     text = dspy.InputField(prefix="Vacancy:")
  • Defines an input field with the prefix "Vacancy:". This indicates that the input to the module will be a snippet from a job vacancy.

5     options = dspy.InputField(
6         prefix="Options:",
7         desc="List of comma-separated options to choose from"
8     )
  • Defines another input field with the prefix "Options:". The description specifies that this field will contain a list of comma-separated options to choose from.

Output Field

9     output = dspy.OutputField(
10         prefix="Skills:",
11         desc="list of comma-separated ESCO skills"
12     )
  • Defines an output field with the prefix "Skills:". The description specifies that the output should be a list of comma-separated ESCO skills.

The seed-prompt for the ESCO Rank Module is designed to assist in selecting the most relevant job skills from a job vacancy snippet.

Using the DSPy framework, this seed-prompt defines the structure for both input and output fields, ensuring that the Rank module can accurately identify and rank the most applicable job skills. This structured approach allows for efficient processing and ranking of skills within the context of the ESCO dataset.

How the Infer and Rank Modules Work Together

The ESCO Infer and Rank Modules are designed to work in tandem to efficiently and accurately identify and rank job skills from job vacancy snippets.

Here's how they interact and the workflow they follow:

ESCO Infer Module

The ESCO Infer Module is the first step in the process. Its primary function is to identify all the job skills mentioned in a given job vacancy snippet.

ESCO Rank Module

Once the ESCO Infer Module has identified all potential job skills, the ESCO Rank Module steps in to rank these skills based on relevance. Here's how it works:

Workflow Interaction

  1. Initial Extraction:

    • The ESCO Infer Module processes the job vacancy snippet to extract all mentioned job skills. This initial step ensures that all potentially relevant skills are identified.

  2. Ranking and Selection:

    • The extracted skills are then passed as options to the ESCO Rank Module. The Rank Module evaluates these options against the same job vacancy snippet to select and rank the top 10 most applicable skills.

  3. Final Output:

    • The ranked skills are then outputted by the Rank Module, providing a concise and prioritized list of the most relevant job skills for the given job vacancy.

The ESCO Infer and Rank Modules work together to efficiently and accurately identify and rank job skills from job vacancy snippets. The Infer Module extracts all mentioned skills, while the Rank Module selects and prioritizes the top 10 most relevant skills.

This collaborative workflow ensures a structured, consistent, and accurate identification and ranking process, leveraging the DSPy framework's modular and declarative approach for optimal performance within the ESCO dataset.

Explanation of the Metrics

Rank-Precision (RP)

Rank-Precision (RP) is a metric used to evaluate the quality of a ranked list of labels produced by the model. It measures how accurately the top-ranked labels match the true (gold) labels.

Rank-Precision at K (RP@K)

Rank-Precision at K (RP@K) is a specific version of rank-precision that evaluates the precision of the ranking up to the top K positions. Here's a breakdown of the metric:

  1. K: This is the rank position up to which we measure the precision. For example, RP@5 would evaluate the precision of the top 5 ranked labels.

  2. Rn: The total number of gold (true) labels for the n-th input. This varies for each input.

  3. Rel(n,k): A relevance function that returns 1 if the k-th output label for input n is relevant (i.e., it matches one of the gold labels), and 0 otherwise.

Why the Authors Use These Metrics

  1. Relevance and Precision: RP@K directly measures the relevance of the top K predictions, which is crucial for multi-label classification tasks where the goal is to accurately rank the most relevant labels at the top.

  2. Adaptability to Varying Number of Labels: By considering both precision and recall depending on the relationship between K and Rn, RP@K provides a balanced evaluation metric that can adapt to varying numbers of gold labels across different inputs.

  3. Comprehensive Evaluation: RP@K allows for a detailed and nuanced assessment of the model's performance, taking into account the ranked nature of the outputs and the varying importance of correctly ranking relevant labels.

In summary, the authors use RP@K to comprehensively evaluate the effectiveness of their Infer-Retrieve-Rank program in producing relevant and accurately ranked labels across different multi-label classification tasks.

Dataset

The evaluation of the method and baselines was conducted on four extreme classification datasets, one in the biomedical field and three in the human resources field.

BioDEX Dataset

BioDEX (Biomedical Drug Event Extraction):

  • Source: The dataset is composed of biomedical papers that describe various adverse drug events and include expert-created labels for the specific types of medical reactions discussed.

  • Ontology Used: The labels are encoded using the MedDRA ontology (Medical Dictionary for Regulatory Activities), which is a standardized set containing approximately 24,300 medical reactions.

  • Data Characteristics:

    • Input Length: Inputs can be very long, with half of the inputs having upwards of approximately 20,000 characters.

    • Domain Knowledge: Requires biomedical domain knowledge to accurately infer the correct reactions, as only adverse reactions need to be reported, not all medical reactions.

  • Real-world Relevance: BioDEX models a crucial step in real-world drug safety pipelines.

  • Dataset Splits:

    • Training examples: 10

    • Validation examples: 50

    • Test examples: 250

  • Label Distribution:

    • Median number of labels per input: 3

    • 95th percentile number of labels per input: 14

ESCO Dataset

ESCO (European Skills, Competences, Qualifications and Occupations):

  • Source: The dataset comes from the ESCO ontology, which is managed by the European Commission Directorate-General for Employment, Social Affairs, and Inclusion. The ontology contains approximately 13,900 distinct concepts used to encode skills, competences, qualifications, and occupations.

  • Data Characteristics:

    • The datasets consist of snippets (typically one sentence) of online job vacancies in English with their relevant ESCO labels.

  • Sub-datasets:

    • HOUSE: Contains 262 test examples.

    • TECH: Contains 338 test examples.

    • TECHWOLF: Contains 326 test examples.

  • Dataset Splits:

    • HOUSE and TECH:

      • Training examples: 10 each from the validation sets

      • Validation examples: Remaining 51 (HOUSE) and 65 (TECH)

    • TECHWOLF: No specific validation or training split, uses the HOUSE training and validation split.

  • Label Distribution:

    • Median number of labels per input across these datasets: 1

    • 95th percentile number of labels per input: 4

The evaluation uses datasets that cover both the biomedical domain and the human resources domain.

The BioDEX dataset focuses on identifying adverse drug reactions from lengthy biomedical articles, while the ESCO dataset focuses on identifying job skills from short job vacancy snippets.

The structured data processing and use of well-defined ontologies ensure consistency, accuracy, and real-world relevance, making these datasets suitable for evaluating extreme multi-label classification methods.

Comprehensive Analysis of the Results

The results show the test performance of various models and tasks on the BioDEX and ESCO datasets, measured using the rank-precision (RP) metric at 5 and 10 (RP@5 and RP@10).

The analysis includes baseline methods, the proposed Infer–Retrieve–Rank method, and several fine-tuned systems from the literature.

Baseline Methods

Prior:

  • Description: This baseline uses the prior probability distribution of labels based on all training data to rank labels.

  • Performance: Generally low across all datasets, indicating the need for more sophisticated methods to achieve better accuracy.

Exact-Match:

  • Description: This method matches label names exactly within the input document.

  • Performance: Shows moderate improvement over the prior baseline, particularly effective in ESCO tasks, with RP@5 and RP@10 scores around 4-6.

Naive-Retrieve:

  • Description: Employs pre-trained retrievers (BioLORD for BioDEX and all-mpnet-base-v2 for ESCO tasks) to embed input documents and retrieve relevant labels.

  • Performance:

    • ESCO Tasks: Significantly outperforms the prior and exact-match baselines, with RP@5 and RP@10 scores ranging from 26 to 50.

    • BioDEX Task: Shows weaker performance due to the complexity and length of biomedical documents, with RP@5 and RP@10 scores around 11.

Proposed Method: Infer–Retrieve–Rank

Configuration:

  • Infer Module: Uses Llama-2-7b-chat as the student LM and GPT-3.5-turbo as the teacher LM.

  • Rank Module: Uses GPT-4 as both the student and teacher.

  • Retrieval Module: Employs BioLORD for BioDEX and all-mpnet-base-v2 for ESCO tasks.

Performance:

  • ESCO Datasets (HOUSE, TECH, TECHWOLF):

    • Achieves state-of-the-art performance, with RP@5 and RP@10 scores between 56 and 71.

    • Significantly outperforms both baselines and fine-tuned systems, demonstrating the effectiveness of the combined modules.

  • BioDEX Dataset:

    • Competitive performance with RP@5 and RP@10 scores of 24.73 and 27.67, respectively.

    • While not surpassing the best fine-tuned systems, the method shows substantial gains over baselines and indicates potential for further improvement with optimization.

Summary of Results

The results demonstrate the efficacy of the Infer–Retrieve–Rank approach in addressing extreme multi-label classification tasks, particularly for the ESCO datasets.

The method provides state-of-the-art performance with significantly less data and no finetuning, making it a cost-effective and scalable solution.

For BioDEX, while not the best, the approach shows promise and potential for further improvement with optimization. The clear advantages in efficiency, adaptability, and modular design underscore the robustness and versatility of the proposed method.

Conclusion

In this study, Infer–Retrieve–Rank was introduced, a program for extreme multi-label classification tasks.

By combining a frozen retriever with two in-context learning modules, the approach demonstrates state-of-the-art performance across multiple benchmarks, including ESCO and BioDEX datasets.

This methodology highlights the potential of modular and optimized programs in overcoming the complexities and limitations often associated with prompt and pipeline engineering.

The findings underscore that robust, general-purpose solutions can be achieved without the need for extensive finetuning or large amounts of data. The success of Infer–Retrieve–Rank not only sets a new standard for multi-label classification but also paves the way for future advancements in the field. This approach exemplifies how a well-structured and modular design, facilitated by frameworks like DSPy, can deliver high efficiency, adaptability, and scalability.

The promising results of Infer–Retrieve–Rank suggest a shift towards more resilient and efficient methods in prompt and pipeline engineering. As the landscape of machine learning and natural language processing continues to evolve, such modular programs offer a glimpse into a future where complex tasks can be managed with simplicity and precision.

Role of Large Language Models (LLMs) in the Infer–Retrieve–Rank Process

Large Language Models (LLMs) play a role in the Infer–Retrieve–Rank process, enabling efficient and accurate multi-label classification.

Here's a detailed explanation of their roles and interactions:

Key Components and Roles

  1. Infer Module:

    • Student LM (Llama-2-7b-chat): This model processes the input text and generates initial label predictions. It acts as the first layer of understanding, leveraging its trained knowledge to infer potential labels from the provided input.

    • Teacher LM (GPT-3.5-turbo): This model helps to optimise the student LM by providing guidance and feedback during the training phase. The teacher model's responses are used to improve the accuracy and relevance of the student LM's predictions.

  2. Retrieve Module:

    • Frozen Retriever (BioLORD for BioDEX and all-mpnet-base-v2 for ESCO tasks): This component uses embeddings to map the initial label predictions from the Infer module to the actual label space. The retriever helps refine the set of potential labels by comparing them against a pre-trained database of embeddings.

    • Role: The retriever is essential for narrowing down the wide range of possible labels to a more relevant subset based on the context provided by the Infer module.

  3. Rank Module:

    • Student LM (GPT-4): This model takes the initial predictions and retrieved labels and re-ranks them to prioritise the most relevant ones. It further refines the label set by considering both the original input text and the retrieved labels.

    • Teacher LM (GPT-4): Similar to the Infer module's teacher, this model helps optimise the ranking process by providing high-quality feedback and corrections during training.

How LLMs Work Together

  1. Inference:

    • The Infer module uses the student LM (Llama-2-7b-chat) to analyse the input text and predict possible labels.

    • The predicted labels are then parsed and structured for further processing.

  2. Retrieval:

    • The Retrieve module uses the frozen retriever to find the most relevant labels from a pre-trained embedding space based on the predictions from the Infer module.

    • This step ensures that the labels are not only relevant but also contextually accurate.

  3. Reranking:

    • The Rank module uses the student LM (GPT-4) to re-evaluate and reorder the retrieved labels, ensuring that the most relevant labels are prioritised.

    • The rank module considers both the input text and the initial predictions to fine-tune the final label set.

Optimization and Training

  • During the training phase, the teacher LMs (GPT-3.5-turbo for Infer and GPT-4 for Rank) guide the student LMs by providing examples and corrections.

  • The optimization process involves multiple calls to the teacher models to refine the predictions and rankings, reducing errors and improving overall performance.

Summary

In the Infer–Retrieve–Rank process:

  • LLMs are integral to each step, from initial inference to retrieval and reranking.

  • Student LMs handle the core tasks of prediction and ranking, while teacher LMs provide optimisation and guidance.

  • The combination of these models allows for a highly modular, adaptable, and efficient approach to extreme multi-label classification, capable of achieving state-of-the-art performance with minimal data and no finetuning.

LogoIn-Context Learning for Extreme Multi-Label ClassificationarXiv.org
The authors propose Infer-Retrieve-Rank, an efficient in-context learning program designed for multi-label classification tasks with over 10,000 classes. The program consists of three steps: first, an in-context learning module predicts queries that route to a frozen retriever. Second, a second in-context module re-ranks the retrieved documents. Third, a zero-shot Teacher LM bootstraps demonstrations from a minimal prompt to optimize the few-shot Student LM. Using approximately 50 labeled inputs, the optimization can achieve state-of-the-art results with around 20 Teacher and 1,500 Student calls. This optimization logic is implemented using the DSPy programming model.
The test results compare baselines, programs, and fine-tuned systems on the HOUSE, TECH, TECHWOLF, and BioDEX extreme multi-label classification tasks using rank-precision (RP) at 5 and 10. The Infer–Retrieve–Rank method, which employs a Llama-2-7b-chat model for inference, a frozen BioLORD or all-mpnet-base-v2 for retrieval, and a GPT-4 model for ranking, achieves state-of-the-art results without finetuning and with significantly less data. Each program requires several language model (LM) calls for bootstrapping, which is much fewer compared to the training sizes used by fine-tuned systems. The best results are highlighted in bold within a 0.5 interval, and second-best results are underlined. Fine-tuned system results are sourced from previous studies.
Page cover image