# LLama 2 - Analysis

The release of LLama 2 a landmark moment, a powerful open-source large language model, available for both research and commercial use at no cost.&#x20;

Llama 2's is the next iteration of LLama 1, a model released January 2023.   Llama 2, available for free, comes with model weights and starting code for both the pre-trained and conversational fine-tuned versions.&#x20;

{% embed url="<https://arxiv.org/abs/2307.09288>" %}
Llaam2 paper
{% endembed %}

Llama 2 is a collection of <mark style="color:yellow;">pretrained and fine-tuned large language models (LLMs)</mark>, ranging from <mark style="color:yellow;">7 billion to 70 billion parameters</mark>. The models, specifically optimised for dialogue use cases, are known as Llama 2-Chat.&#x20;

### <mark style="color:purple;">Pre Training Process</mark>

The pretraining process for the Llama 2 models incorporated several enhancements over its predecessor, Llama 1.&#x20;

<mark style="color:green;">**Pretraining Data and Sources**</mark>

For the Llama 2 models, a new mix of publicly available online data sources was curated, explicitly excluding data from Meta’s products or services and removing content from sites known for containing personal information.&#x20;

The training involved 2 trillion tokens, chosen for their balance of performance and cost-effectiveness.&#x20;

This data selection was also aimed at improving model knowledge and reducing inaccurate predictions or 'hallucinations.'  Comprehensive pretraining data investigations were conducted to understand the potential capabilities and limitations of the models.

### <mark style="color:purple;">**Model Architecture and Training Details**</mark>

The core architecture of Llama 2 models is based on the standard transformer architecture, as defined by Vaswani et al. (2017).&#x20;

Key features of this architecture include:

* <mark style="color:green;">**Pre-Normalization**</mark><mark style="color:green;">:</mark> Utilizing RMSNorm (Zhang and Sennrich, 2019) for stabilising the training process.
* <mark style="color:green;">**SwiGLU Activation Function**</mark><mark style="color:green;">:</mark> Adopted from Shazeer (2020), enhancing the model's capability to capture complex patterns.
* <mark style="color:green;">**Rotary Positional Embeddings (RoPE)**</mark><mark style="color:green;">:</mark> As proposed by Su et al. (2022), which helps the model understand the order of input tokens better.
* <mark style="color:green;">**Grouped-Query Attention (GQA)**</mark><mark style="color:green;">:</mark> This is a significant architectural difference from Llama 1, improving inference scalability for larger models. GQA enables the model to process inputs more efficiently by grouping queries, which is especially beneficial for models with a high number of parameters.

Additionally, Llama 2 models have *<mark style="color:yellow;">**doubled the context length**</mark>* compared to Llama 1, allowing them to consider larger segments of text during training, thereby capturing more extensive contextual information.

### <mark style="color:purple;">**Hyperparameters and Training Optimisation**</mark>

The training used the AdamW optimizer (Loshchilov and Hutter, 2017), with specific settings for beta values, epsilon, learning rate schedule (cosine learning rate with warmup and decay), weight decay, and gradient clipping. These hyperparameters were fine-tuned to optimise the training process and achieve the desired model performance.

<mark style="color:green;">**Tokenizer**</mark><mark style="color:green;">:</mark> The tokenizer employed for Llama 2 is consistent with the one used in Llama 1, using a bytepair encoding (BPE) algorithm (Sennrich et al., 2016) as implemented in SentencePiece (Kudo and Richardson, 2018).  This tokenizer breaks down text into a set of 32,000 tokens, including individual digits and bytes for decomposing unknown UTF-8 characters, aiding in the effective processing of diverse linguistic inputs.

<mark style="color:green;">**Comparative Analysis and Training Efficiency**</mark><mark style="color:green;">:</mark> A comparative analysis of Llama 2 models against Llama 1 models reveals the advancements in token counts, context length, and the introduction of GQA for larger models. The training loss data for Llama 2 models indicates no signs of saturation even after processing 2 trillion tokens, suggesting the potential for further model improvement.

### <mark style="color:purple;">Fine Tuning Phase</mark>

\
The fine-tuning process of Llama 2-Chat involved a combination of <mark style="color:yellow;">supervised fine-tuning (SFT)</mark> and <mark style="color:yellow;">Reinforcement Learning with Human Feedback (RLHF)</mark>, supplemented by a novel technique known as <mark style="color:yellow;">Ghost Attention (GAtt)</mark>.&#x20;

This process, which demands significant computational and annotation resources, is aimed at aligning the model for specific use cases, *<mark style="color:green;">**particularly dialogue interactions.**</mark>*

### <mark style="color:purple;">**Supervised Fine-Tuning (SFT)**</mark>

* <mark style="color:green;">**Initial Data**</mark><mark style="color:green;">:</mark> The process began with publicly available instruction tuning data, which serves as a foundation for further fine-tuning.
* <mark style="color:green;">**Quality of Data**</mark><mark style="color:green;">:</mark> Recognising the limitations of third-party SFT data in terms of diversity and quality, especially for dialogue-style instructions, the team focused on collecting high-quality SFT data. This involved several thousand examples, *<mark style="color:yellow;">**with a total of 27,540 annotations collected.**</mark>*
* <mark style="color:green;">**Fine-Tuning Details**</mark><mark style="color:green;">:</mark> For the actual fine-tuning, the team used a cosine learning rate schedule, an initial learning rate of 2 × 10^−5, weight decay of 0.1, batch size of 64, and a sequence length of 4096 tokens.  The process involved concatenating prompts and responses from the training set, separated by a special token. The training used an autoregressive objective, zeroing out loss on user prompt tokens, which meant only answer tokens were backpropagated. The model underwent fine-tuning for 2 epochs.

### <mark style="color:purple;">**Reinforcement Learning with Human Feedback (RLHF)**</mark>

#### <mark style="color:green;">**Preference-Based Annotation**</mark>

* After Supervised Fine Tuning, the team shifted focus to RLHF, using preference-based annotation. This involved human annotators comparing model-generated samples against human-provided annotations to train a reward model. The output from the SFT model was found to be competitive with the human-written SFT data, indicating the potential for reprioritizing annotation efforts toward RLHF.
* **Human Preference Data Collection**: The RLHF involved collecting human preference data, where annotators chose between two model responses to a prompt, providing feedback on which response was preferable and why. This data was used to train two separate reward models, one optimized for helpfulness and the other for safety.

#### <mark style="color:green;">**Ghost Attention (GAtt)**</mark>

* **Dialogue Flow Control**: The GAtt technique, introduced in the fine-tuning process, was found to be effective in controlling dialogue flow over multiple turns, enhancing the coherence and consistency of the model’s responses in extended dialogues.

### <mark style="color:purple;">Performance</mark>

Comparative evaluations reveal that Llama 2-Chat models demonstrate superior performance against various benchmarks and rival models.&#x20;

They outperform open-source models in single-turn and multi-turn prompts, and show competitive results against closed-source models like ChatGPT.&#x20;

The 34B version of Llama 2-Chat, for instance, exhibited a win rate of over 75% against similar-sized models, and the 70B model surpassed the PaLM-bison chat model in performance.

### <mark style="color:purple;">Safety</mark>

The development of Llama 2 involved a  process to ensure safety and responsibility, particularly during its pretraining phase.

This process was pivotal in understanding the content and implications of the pretraining data, crucial for identifying potential biases and downstream issues that might arise.

<mark style="color:green;">**Steps for Responsible Pretraining**</mark>

Meta's standard privacy and legal review processes were rigorously followed for each dataset used in training. Notably, *<mark style="color:yellow;">**no Meta user data were included in the training process**</mark>*. The team also excluded data from sites with high volumes of personal information to protect individual privacy.

<mark style="color:green;">**Data Toxicity Measurement**</mark>

Toxicity in the pretraining data was assessed using a HateBERT classifier fine-tuned on the ToxiGen dataset. This evaluation showed that a small percentage of the pretraining data contained toxic elements.&#x20;

However, the decision not to overly scrub the data was made to ensure broader applicability of Llama 2, including tasks like hate speech detection.

<mark style="color:green;">**Safety Benchmarks Evaluation**</mark>

Llama 2 was tested against several automatic safety benchmarks to assess its truthfulness, toxicity, and bias.&#x20;

These benchmarks provided insights into the model's ability to produce reliable, non-toxic content and its propensity to reproduce social biases.&#x20;

The evaluations showed that Llama 2 performed variably across different metrics, with some increase in toxicity for larger models, likely due to the larger pretraining data or different dataset mixes.
