# Better Call Saul - SaulLM-7B - a legal large language model

This <mark style="color:blue;">**March 2024**</mark> paper introduces SaulLM-7B, a large language model (LLM) designed for the legal domain.&#x20;

The authors argue that while LLMs have made significant advancements in various fields, the legal domain has yet to fully benefit from this technology.&#x20;

Legal professionals are faced with an increasing volume of complex documents, and there is a growing need for a dedicated LLM to help navigate and interpret legal material.

{% embed url="<https://arxiv.org/abs/2403.03883>" %}
SaulLM-7B
{% endembed %}

### <mark style="color:purple;">Main Contributions</mark>

#### <mark style="color:green;">**A family of legal LLMs**</mark>

The authors introduce SaulLM-7B, a 7-billion-parameter language model trained on a large and diverse legal dataset. They also release SaulLM-7B-Instruct, an instruction-tuned variant that outperforms existing models on various legal tasks.

#### <mark style="color:green;">An improved evaluation protocol for legal LLMs</mark>

The authors introduce LegalBench-Instruct, an iteration of LegalBench designed to better assess the legal proficiency of language models. They also include legal tasks from the MMLU benchmark in their evaluation protocol.

### <mark style="color:purple;">The methodology for creating SaulLM-7B involves a two-step process</mark>

#### <mark style="color:green;">Enhancing Mistral's Legal Capabilities</mark>

The authors choose Mistral 7B, a high-performing open-source model, as the backbone for SaulLM-7B.&#x20;

They curate a *<mark style="color:yellow;">**high-quality legal dataset containing 30 billion tokens**</mark>* and perform continued pretraining to enhance the model's performance on legal tasks.

#### <mark style="color:green;">Improving Legal Instruction Following</mark>

To support user requests and conversational interaction, the authors fine-tune SaulLM-7B using both generic and legal instructions.&#x20;

The generic instructions help improve the model's understanding and following of commands, while the legal instructions cover tasks such as legal question answering and summarisation.

The authors note that many common LLMs include an additional step of aligning the model with human preferences. However, their early experiments did not show any meaningful improvement in performance, so they opted not to pursue this avenue for the present paper.

### <mark style="color:purple;">Data</mark>

In the "Data" section of the paper, the authors describe their data collection and cleaning processes for both the legal pretraining corpora and the instruction fine-tuning datasets.

#### <mark style="color:green;">Legal Pretraining Corpora</mark>

The authors collected legal texts from various English-speaking jurisdictions, including the U.S., Europe, and Australia, to *<mark style="color:yellow;">**capture the diversity of legal systems**</mark>*.&#x20;

They combined previously available datasets, such as subsets from The Pile and MultiLegal Pile, with data scraped from publicly available sources on the Web.&#x20;

The sources included FreeLaw, EDGAR, English EuroParl, GovInfo, Law Stack Exchange, Open Australian Legal Corpus, EULegislation, UKLegislation, Court Transcripts, and UPSTO. &#x20;

These sources contained noise and duplicated documents, which were filtered and deduplicated, resulting in a <mark style="color:blue;">**30 billion token dataset**</mark>.

To reduce the risk of catastrophic forgetting during continued pretraining, the authors incorporated "general" data from Wikipedia, StackExchange, and GitHub, comprising roughly 2% of the final training mix.&#x20;

Additionally, they included conversational data from the Super Natural Instruction and FLAN collection during pretraining, inspired by recent advances in neural machine translation.

The authors employed various data cleaning techniques to address issues in the collected data, such as text normalisation, rule-based filtering, and perplexity filtering.&#x20;

They also removed duplicates and near-duplicates using a deduplication tool, resulting in a high-quality 30B token dataset.

#### <mark style="color:green;">Instruction Fine-tuning Mixes</mark>

The authors emphasise the *<mark style="color:yellow;">**importance of instruction fine-tuning for optimal performance across different tasks.**</mark>* They used a mix of general and legal instructions to train the model, with a focus on legal expertise.

For general instructions, they gather data from four primary sources:

<mark style="color:blue;">**SlimOrca:**</mark> A subset of the [FLAN collection](#user-content-fn-1)[^1] comprising generic instructions for various tasks.

<mark style="color:blue;">**Meta Math Question Answering Instructions:**</mark> A dataset designed for mathematical inquiry, facilitating research in math-based natural language processing.

<mark style="color:blue;">**General Conversations from UltraChat:**</mark> A GPT-derived dataset capturing diverse conversational contexts to enhance natural language understanding and generation.

### <mark style="color:purple;">Evaluation</mark>

They employed three main benchmarks:

<mark style="color:green;">**Perplexity Measurement**</mark>

The authors evaluate the adaptability of the model to various legal documents by measuring [<mark style="color:blue;">**perplexity**</mark> ](https://app.gitbook.com/u/ipXLsI3nBBOMDgaksQDwSZ56WfD3)on benchmark datasets from four distinct legal domains: <mark style="color:yellow;">contracts</mark>, <mark style="color:yellow;">judicial decisions</mark>, <mark style="color:yellow;">opinion text</mark>, and <mark style="color:yellow;">legislation</mark>.&#x20;

They ensure the datasets are up-to-date and sourced after the collection cut-off date from LLM data to avoid data leakage.

<mark style="color:green;">**LegalBench-Instruct**</mark>

During their investigations, the authors found limitations in the original prompts of LegalBench.&#x20;

The complex nature of the prompts, combined with the challenges faced by open-source LLMs in adhering to instructions and handling formatting, led to a substantial drop in performance.&#x20;

To address this issue, they refined the prompts by removing distracting few-shot examples and concluding with a specific instruction for the model to generate tags. This refinement aimed to provide a more accurate assessment of the model's performance on legal tasks.

#### <mark style="color:green;">Massive Multitask Language Understanding (MMLU)</mark>

The authors also used the legal section of the MMLU benchmark to gain additional insights into the model's legal knowledge. They focused specifically on three legal domains: international law, professional law, and jurisprudence.

The authors used balanced accuracy as the primary metric for evaluating the model's performance on both LegalBench-Instruct and the legal tasks of MMLU.  Balanced accuracy is chosen to better handle imbalanced classification tasks present in both benchmarks.

By using diverse benchmarks and refining the evaluation process, they aimed to provide a comprehensive understanding of the model's strengths and limitations in the legal domain. The use of up-to-date datasets and the focus on specific legal domains further enhance the validity and relevance of their findings.

### <mark style="color:purple;">Experimental Setting</mark>

#### <mark style="color:green;">Baselines</mark>

The authors compare the SaulLM-7B family to other 7B and 13B open-source models, including instruction-tuned and DPO-finetuned variants of Mistral-7B, zephyr-7b-beta, and the Llama2 family (Llama2-7b-Chat and Llama2-13b-Chat).

#### <mark style="color:green;">Implementation Details</mark>

The codebase is built using open-source frameworks like PyTorch, DeepSpeed, and Flash Attention.

The models are available on the Huggingface hub.  Continuous pretraining uses 256 MI250 AMD GPUs, while instruction fine-tuning is distributed across 16 MI250 GPUs. Evaluation is conducted on a single MI250 GPU.

#### <mark style="color:green;">Results on Legal-MMLU</mark>

SaulLM-7B-Instruct consistently outperforms non-legal instruction-tuned models on the three legal tasks of MMLU, confirming its strong performance in the legal domain.

#### <mark style="color:green;">Perplexity Analysis</mark>

SaulLM-7B *<mark style="color:yellow;">**consistently outperforms Mistral-7B across all legal document categories**</mark>*, exhibiting lower average perplexity scores with reduced variance.&#x20;

Llama2-7B demonstrates lower perplexity specifically in legislation documents, suggesting a potentially higher proportion of legislative text in its training corpora.

Overall, the experimental setting and results demonstrate the effectiveness of SaulLM-7B and SaulLM-7B-Instruct in the legal domain, establishing them as strong foundations for building models tailored to legal workflows.&#x20;

The authors provide a comprehensive analysis of their models' performance compared to other state-of-the-art open-source models, highlighting the benefits of legal-specific pretraining and instruction fine-tuning.

### <mark style="color:purple;">Conclusion</mark>

In conclusion, SaulLM-7B and SaulLM-7B-Instruct represent an advancement in the application of large language models to the legal domain.&#x20;

By leveraging extensive pretraining on legal corpora and incorporating legal-specific instruction fine-tuning, these models demonstrate superior performance on legal benchmarks such as LegalBench-Instruct and Legal-MMLU compared to generic open-source models.&#x20;

SaulLM-7B and SaulLM-7B-Instruct serve as strong foundations for building models tailored to legal workflows, paving the way for further innovation and adoption of AI in the legal field.&#x20;

[^1]: FLAN (Fine-tuned LAnguage Net) collection refers to a set of models that have been fine-tuned for a variety of tasks using a technique known as instruction tuning. This approach involves training language models to follow instructions embedded within the input data, enhancing their ability to perform specific tasks as directed by those instructions.&#x20;
