Continuum Website Continuum Applications Continuum Knowledge Axolotl Platform

LLama 2 - Analysis

Meta introduced Llama 2 during June 2023

PreviousFoundation Models NextAnalysis of Llama 3

Last updated 8 months ago

Continuum - Accelerated Artificial Intelligence

Continuum Website Axolotl Platform

LLama 2 - Analysis

Meta introduced Llama 2 during June 2023

The release of LLama 2 a landmark moment, a powerful open-source large language model, available for both research and commercial use at no cost.

Llama 2's is the next iteration of LLama 1, a model released January 2023. Llama 2, available for free, comes with model weights and starting code for both the pre-trained and conversational fine-tuned versions.

Llama 2 is a collection of pretrained and fine-tuned large language models (LLMs), ranging from 7 billion to 70 billion parameters. The models, specifically optimised for dialogue use cases, are known as Llama 2-Chat.

Pre Training Process

The pretraining process for the Llama 2 models incorporated several enhancements over its predecessor, Llama 1.

Pretraining Data and Sources

For the Llama 2 models, a new mix of publicly available online data sources was curated, explicitly excluding data from Meta’s products or services and removing content from sites known for containing personal information.

The training involved 2 trillion tokens, chosen for their balance of performance and cost-effectiveness.

This data selection was also aimed at improving model knowledge and reducing inaccurate predictions or 'hallucinations.' Comprehensive pretraining data investigations were conducted to understand the potential capabilities and limitations of the models.

Model Architecture and Training Details

The core architecture of Llama 2 models is based on the standard transformer architecture, as defined by Vaswani et al. (2017).

Key features of this architecture include:

Pre-Normalization: Utilizing RMSNorm (Zhang and Sennrich, 2019) for stabilising the training process.
SwiGLU Activation Function: Adopted from Shazeer (2020), enhancing the model's capability to capture complex patterns.
Rotary Positional Embeddings (RoPE): As proposed by Su et al. (2022), which helps the model understand the order of input tokens better.
Grouped-Query Attention (GQA): This is a significant architectural difference from Llama 1, improving inference scalability for larger models. GQA enables the model to process inputs more efficiently by grouping queries, which is especially beneficial for models with a high number of parameters.

Additionally, Llama 2 models have doubled the context length compared to Llama 1, allowing them to consider larger segments of text during training, thereby capturing more extensive contextual information.

Hyperparameters and Training Optimisation

The training used the AdamW optimizer (Loshchilov and Hutter, 2017), with specific settings for beta values, epsilon, learning rate schedule (cosine learning rate with warmup and decay), weight decay, and gradient clipping. These hyperparameters were fine-tuned to optimise the training process and achieve the desired model performance.

Tokenizer: The tokenizer employed for Llama 2 is consistent with the one used in Llama 1, using a bytepair encoding (BPE) algorithm (Sennrich et al., 2016) as implemented in SentencePiece (Kudo and Richardson, 2018). This tokenizer breaks down text into a set of 32,000 tokens, including individual digits and bytes for decomposing unknown UTF-8 characters, aiding in the effective processing of diverse linguistic inputs.

Comparative Analysis and Training Efficiency: A comparative analysis of Llama 2 models against Llama 1 models reveals the advancements in token counts, context length, and the introduction of GQA for larger models. The training loss data for Llama 2 models indicates no signs of saturation even after processing 2 trillion tokens, suggesting the potential for further model improvement.

Fine Tuning Phase

The fine-tuning process of Llama 2-Chat involved a combination of supervised fine-tuning (SFT) and Reinforcement Learning with Human Feedback (RLHF), supplemented by a novel technique known as Ghost Attention (GAtt).

This process, which demands significant computational and annotation resources, is aimed at aligning the model for specific use cases, particularly dialogue interactions.

Supervised Fine-Tuning (SFT)

Initial Data: The process began with publicly available instruction tuning data, which serves as a foundation for further fine-tuning.
Quality of Data: Recognising the limitations of third-party SFT data in terms of diversity and quality, especially for dialogue-style instructions, the team focused on collecting high-quality SFT data. This involved several thousand examples, with a total of 27,540 annotations collected.
Fine-Tuning Details: For the actual fine-tuning, the team used a cosine learning rate schedule, an initial learning rate of 2 × 10^−5, weight decay of 0.1, batch size of 64, and a sequence length of 4096 tokens. The process involved concatenating prompts and responses from the training set, separated by a special token. The training used an autoregressive objective, zeroing out loss on user prompt tokens, which meant only answer tokens were backpropagated. The model underwent fine-tuning for 2 epochs.

Reinforcement Learning with Human Feedback (RLHF)

Preference-Based Annotation

After Supervised Fine Tuning, the team shifted focus to RLHF, using preference-based annotation. This involved human annotators comparing model-generated samples against human-provided annotations to train a reward model. The output from the SFT model was found to be competitive with the human-written SFT data, indicating the potential for reprioritizing annotation efforts toward RLHF.
Human Preference Data Collection: The RLHF involved collecting human preference data, where annotators chose between two model responses to a prompt, providing feedback on which response was preferable and why. This data was used to train two separate reward models, one optimized for helpfulness and the other for safety.

Ghost Attention (GAtt)

Dialogue Flow Control: The GAtt technique, introduced in the fine-tuning process, was found to be effective in controlling dialogue flow over multiple turns, enhancing the coherence and consistency of the model’s responses in extended dialogues.

Performance

Comparative evaluations reveal that Llama 2-Chat models demonstrate superior performance against various benchmarks and rival models.

They outperform open-source models in single-turn and multi-turn prompts, and show competitive results against closed-source models like ChatGPT.

The 34B version of Llama 2-Chat, for instance, exhibited a win rate of over 75% against similar-sized models, and the 70B model surpassed the PaLM-bison chat model in performance.

Safety

The development of Llama 2 involved a process to ensure safety and responsibility, particularly during its pretraining phase.

This process was pivotal in understanding the content and implications of the pretraining data, crucial for identifying potential biases and downstream issues that might arise.

Steps for Responsible Pretraining

Meta's standard privacy and legal review processes were rigorously followed for each dataset used in training. Notably, no Meta user data were included in the training process. The team also excluded data from sites with high volumes of personal information to protect individual privacy.

Data Toxicity Measurement

Toxicity in the pretraining data was assessed using a HateBERT classifier fine-tuned on the ToxiGen dataset. This evaluation showed that a small percentage of the pretraining data contained toxic elements.

However, the decision not to overly scrub the data was made to ensure broader applicability of Llama 2, including tasks like hate speech detection.

Safety Benchmarks Evaluation

Llama 2 was tested against several automatic safety benchmarks to assess its truthfulness, toxicity, and bias.

These benchmarks provided insights into the model's ability to produce reliable, non-toxic content and its propensity to reproduce social biases.

The evaluations showed that Llama 2 performed variably across different metrics, with some increase in toxicity for larger models, likely due to the larger pretraining data or different dataset mixes.