Reinforcement Learning from Human Feedback (RLHF)
Most often useful when creating domain specific models
Last updated
Copyright Continuum Labs - 2023
Most often useful when creating domain specific models
Last updated
Neural language models can generate text that is not only diverse but also contextually relevant and compelling has been remarkable. However, a significant challenge lies in defining what constitutes "good" text.
This is inherently subjective and varies widely depending on the context - be it creative writing, informative text, or executable code snippets.
Traditionally, language models have been trained using simple next-token prediction loss functions, such as cross-entropy. While this approach has its merits, it falls short in capturing the nuanced preferences of human readers.
A technique called Reinforcement Learning from Human Feedback (RLHF) was developed to train models to be more useful in human interactions - by using an approach incorporating human feedback directly into the model training process.
RLHF leverages methods from reinforcement learning to optimise language models based on actual human responses and preferences. This technique enables models to align more closely with complex human values, a feat that was previously unattainable with general corpus training alone.
RLHF begins with a language model that's already been pretrained using classical objectives.
For instance, OpenAI used a smaller version of GPT-3 for InstructGPT, while other organisations like Anthropic and DeepMind have employed models ranging from 10 million to 280 billion parameters.
These models can be further fine-tuned, but the primary requirement is their ability to respond effectively to diverse instructions.
Reward Model Training
Central to RLHF is the generation of a reward model calibrated with human preferences. The goal is to create a system that evaluates a sequence of text and outputs a scalar reward representing human preference.
This model can either be a fine-tuned language model or one trained from scratch on preference data.
The training data consists of prompt-generation pairs, with human annotators ranking the outputs. This ranking is crucial as it normalises various assessment methods into a singular reward signal.
Historically, training language models with RL was seen as a daunting task. However, recent advancements have made it possible to fine-tune a model using Proximal Policy Optimization (PPO), albeit with some parameters frozen due to the prohibitive costs of training extremely large models.
The RL process involves the model generating text based on a prompt, which is then evaluated by the reward model to assign a scalar 'preferability' score. Additionally, a penalty is applied for deviating too far from the initial pretrained model, ensuring coherence in the generated text.
RLHF is still a burgeoning field with many uncharted territories.
The choice of the base model, the dynamics of reward model training, and the specific implementation of the RL optimizer all present a vast landscape of research opportunities.
Advanced algorithms like ILQL, which align well with offline RL optimisation, are beginning to emerge, offering new pathways to refine the RLHF process further.
Despite its potential, RLHF does face challenges, particularly in generating human preference data, which can be costly and time-consuming. Additionally, human annotators often have varying opinions, adding to the complexity and potential variance in training data. Yet, these limitations only underscore the vast potential for innovation in RLHF.
In the dynamic field of Reinforcement Learning from Human Feedback (RLHF), a critical challenge is achieving an optimal balance between the safety and utility of models. This balance is crucial as increasing the harmlessness of a model often leads to a decrease in its helpfulness.
To address this issue, a nuanced approach, termed the 'hostage negotiator' model, is proposed.
This concept extends beyond the basic idea of harmlessness, which might traditionally involve a model refraining from responding to potentially harmful queries.
Instead, the 'hostage negotiator' model would enable the system to understand and articulate why certain requests might be harmful and engage in a dialogue that could lead users to reconsider their queries.
The challenge in training models to adopt this 'hostage negotiation' tactic lies in the nature of data collection for harmlessness. Typically, data collection processes may inadvertently focus on identifying clearly harmful responses without providing exposure to more sophisticated interactions.
As a result, models learn to avoid certain responses but do not necessarily learn how to navigate complex interactions constructively.
To evolve beyond this limitation, training efforts are shifting towards a higher emphasis on helpfulness prompts.
Future training strategies are expected to involve collecting harmlessness data where human annotators can identify the most constructive possible responses from models.
This method aims to enable models not just to avoid harmful interactions but to actively engage in nuanced and thoughtful dialogue.
While these concepts and methods are at an exploratory stage, they represent significant progress in the quest to develop RLHF models that are safe yet effectively useful.
These models are envisioned to adeptly handle the complexities of human interaction with both sophistication and sensitivity.
As this area of research advances, it is anticipated that RLHF systems will become more adept at balancing safety with practical utility, thereby enhancing their applicability and effectiveness in real-world scenarios.
One of the most significant challenges lies in effectively incorporating human feedback into the learning process.
This task is far from straightforward, as it demands a deep understanding of various factors, including the intricacies of the acquisition function and the nuances of human psychology.
The Critical Role of Acquisition Function in RLHF
At the core of RLHF training is the acquisition function. This function is pivotal in determining the quality of queries presented to human labellers for feedback.
Unlike traditional active learning, which typically operates under a supervised learning setting, RLHF involves reinforcement learning.
This means the agent not only influences the data distribution but also decides which data should be labeled. As the policy of the RL agent changes, it necessitates the active generation of queries that are more informative.
Let's consider an example. In a customer service chatbot scenario, the acquisition function must weigh factors like the complexity of customer queries, the diversity of responses needed, and the cost of generating these responses quickly and effectively.
Here, uncertainty plays a crucial role, as the chatbot must navigate through ambiguous customer requests and provide helpful and precise responses.
Adaptive Choice of Feedback Type in RLHF
Choosing the right feedback type is crucial in RLHF.
This choice can depend on a variety of factors, including the rationality of the human labeler and task-specific requirements, which may evolve over time. For instance, in a medical diagnosis assistant application, the feedback type might need to adapt to the evolving complexity of medical cases and the varying expertise levels of medical professionals providing feedback.
Navigating Human Labelling Challenges in RLHF
Human labelling in RLHF intersects with disciplines like psychology and social sciences, as it encompasses designing interactions for informative query responses.
Understanding human psychology is essential for effective preference elicitation.
For instance, consider a language model used for educational content creation. The preference elicitation process must account for cognitive biases of educators and varying emotional responses of different age groups of students.
Ensuring High-Quality Labels and Researcher-Labeller Agreement
A significant challenge in RLHF is the mismatch between the researcher’s goals and the labeller's actual labels, known as researcher-labeler disagreement.
To combat this, methods like on-boarding, maintaining open communication, and providing labellers with feedback are crucial.
Imagine a scenario in an e-commerce recommendation system where labellers might have differing opinions on what constitutes a 'good' recommendation. Ensuring alignment between the researchers' objectives and labellers' understanding is critical for the system’s accuracy and effectiveness.
Conclusion: A Multidisciplinary Approach for Optimizing RLHF Systems
Active Learning in RLHF is a multifaceted process that requires careful consideration of factors ranging from the technical aspects of acquisition functions to the psychological elements of human interaction.
Addressing the complexities of human feedback and integrating insights from various fields are essential for creating RLHF systems that align closely with human behavior and preferences. This approach not only enhances the efficacy of RLHF models but also paves the way for their broader and more effective application across diverse domains.