Better Call Saul - SaulLM-7B - a legal large language model
Last updated
Copyright Continuum Labs - 2023
Last updated
This March 2024 paper introduces SaulLM-7B, a large language model (LLM) designed for the legal domain.
The authors argue that while LLMs have made significant advancements in various fields, the legal domain has yet to fully benefit from this technology.
Legal professionals are faced with an increasing volume of complex documents, and there is a growing need for a dedicated LLM to help navigate and interpret legal material.
The authors introduce SaulLM-7B, a 7-billion-parameter language model trained on a large and diverse legal dataset. They also release SaulLM-7B-Instruct, an instruction-tuned variant that outperforms existing models on various legal tasks.
The authors introduce LegalBench-Instruct, an iteration of LegalBench designed to better assess the legal proficiency of language models. They also include legal tasks from the MMLU benchmark in their evaluation protocol.
The authors choose Mistral 7B, a high-performing open-source model, as the backbone for SaulLM-7B.
They curate a high-quality legal dataset containing 30 billion tokens and perform continued pretraining to enhance the model's performance on legal tasks.
To support user requests and conversational interaction, the authors fine-tune SaulLM-7B using both generic and legal instructions.
The generic instructions help improve the model's understanding and following of commands, while the legal instructions cover tasks such as legal question answering and summarisation.
The authors note that many common LLMs include an additional step of aligning the model with human preferences. However, their early experiments did not show any meaningful improvement in performance, so they opted not to pursue this avenue for the present paper.
In the "Data" section of the paper, the authors describe their data collection and cleaning processes for both the legal pretraining corpora and the instruction fine-tuning datasets.
The authors collected legal texts from various English-speaking jurisdictions, including the U.S., Europe, and Australia, to capture the diversity of legal systems.
They combined previously available datasets, such as subsets from The Pile and MultiLegal Pile, with data scraped from publicly available sources on the Web.
The sources included FreeLaw, EDGAR, English EuroParl, GovInfo, Law Stack Exchange, Open Australian Legal Corpus, EULegislation, UKLegislation, Court Transcripts, and UPSTO.
These sources contained noise and duplicated documents, which were filtered and deduplicated, resulting in a 30 billion token dataset.
To reduce the risk of catastrophic forgetting during continued pretraining, the authors incorporated "general" data from Wikipedia, StackExchange, and GitHub, comprising roughly 2% of the final training mix.
Additionally, they included conversational data from the Super Natural Instruction and FLAN collection during pretraining, inspired by recent advances in neural machine translation.
The authors employed various data cleaning techniques to address issues in the collected data, such as text normalisation, rule-based filtering, and perplexity filtering.
They also removed duplicates and near-duplicates using a deduplication tool, resulting in a high-quality 30B token dataset.
The authors emphasise the importance of instruction fine-tuning for optimal performance across different tasks. They used a mix of general and legal instructions to train the model, with a focus on legal expertise.
For general instructions, they gather data from four primary sources:
SlimOrca: A subset of the comprising generic instructions for various tasks.
Meta Math Question Answering Instructions: A dataset designed for mathematical inquiry, facilitating research in math-based natural language processing.
General Conversations from UltraChat: A GPT-derived dataset capturing diverse conversational contexts to enhance natural language understanding and generation.
They employed three main benchmarks:
Perplexity Measurement
They ensure the datasets are up-to-date and sourced after the collection cut-off date from LLM data to avoid data leakage.
LegalBench-Instruct
During their investigations, the authors found limitations in the original prompts of LegalBench.
The complex nature of the prompts, combined with the challenges faced by open-source LLMs in adhering to instructions and handling formatting, led to a substantial drop in performance.
To address this issue, they refined the prompts by removing distracting few-shot examples and concluding with a specific instruction for the model to generate tags. This refinement aimed to provide a more accurate assessment of the model's performance on legal tasks.
The authors also used the legal section of the MMLU benchmark to gain additional insights into the model's legal knowledge. They focused specifically on three legal domains: international law, professional law, and jurisprudence.
The authors used balanced accuracy as the primary metric for evaluating the model's performance on both LegalBench-Instruct and the legal tasks of MMLU. Balanced accuracy is chosen to better handle imbalanced classification tasks present in both benchmarks.
By using diverse benchmarks and refining the evaluation process, they aimed to provide a comprehensive understanding of the model's strengths and limitations in the legal domain. The use of up-to-date datasets and the focus on specific legal domains further enhance the validity and relevance of their findings.
The authors compare the SaulLM-7B family to other 7B and 13B open-source models, including instruction-tuned and DPO-finetuned variants of Mistral-7B, zephyr-7b-beta, and the Llama2 family (Llama2-7b-Chat and Llama2-13b-Chat).
The codebase is built using open-source frameworks like PyTorch, DeepSpeed, and Flash Attention.
The models are available on the Huggingface hub. Continuous pretraining uses 256 MI250 AMD GPUs, while instruction fine-tuning is distributed across 16 MI250 GPUs. Evaluation is conducted on a single MI250 GPU.
SaulLM-7B-Instruct consistently outperforms non-legal instruction-tuned models on the three legal tasks of MMLU, confirming its strong performance in the legal domain.
SaulLM-7B consistently outperforms Mistral-7B across all legal document categories, exhibiting lower average perplexity scores with reduced variance.
Llama2-7B demonstrates lower perplexity specifically in legislation documents, suggesting a potentially higher proportion of legislative text in its training corpora.
Overall, the experimental setting and results demonstrate the effectiveness of SaulLM-7B and SaulLM-7B-Instruct in the legal domain, establishing them as strong foundations for building models tailored to legal workflows.
The authors provide a comprehensive analysis of their models' performance compared to other state-of-the-art open-source models, highlighting the benefits of legal-specific pretraining and instruction fine-tuning.
In conclusion, SaulLM-7B and SaulLM-7B-Instruct represent an advancement in the application of large language models to the legal domain.
By leveraging extensive pretraining on legal corpora and incorporating legal-specific instruction fine-tuning, these models demonstrate superior performance on legal benchmarks such as LegalBench-Instruct and Legal-MMLU compared to generic open-source models.
SaulLM-7B and SaulLM-7B-Instruct serve as strong foundations for building models tailored to legal workflows, paving the way for further innovation and adoption of AI in the legal field.
The authors evaluate the adaptability of the model to various legal documents by measuring on benchmark datasets from four distinct legal domains: contracts, judicial decisions, opinion text, and legislation.