# Getting the most out of your tokenizer for pre-training and domain adaptation

Tokenization is often an understudied and neglected component in the development of models.

In this <mark style="color:blue;">**February 2024**</mark> paper, the authors highlight that most published works use a single tokenizer for all experiments, often borrowed from another model, without performing rigorous analysis or ablations to optimise the tokenization process.&#x20;

Furthermore, when fine-tuning a pre-trained LLM for a specific task or domain, the tokenizer is generally kept unchanged, leading to sub-optimal performance and efficiency.

The authors argue that the <mark style="color:yellow;">**size of the tokenizer's vocabulary**</mark>, the <mark style="color:yellow;">**pre-tokenization regular expression**</mark>, and the <mark style="color:yellow;">**training data**</mark> used for the tokenizer can significantly impact the model's generation speed, effective context size, memory usage, and downstream performance.

To address this issue, the authors train specialised <mark style="color:blue;">**Byte-Pair Encoding (BPE)**</mark> code tokenizers and conduct extensive ablations to study the impact of tokenizer design on the performance of LLMs for code generation tasks such as HumanEval and MBPP.&#x20;

They provide recommendations for selecting appropriate tokenizer hyper-parameters and suggest switching the tokenizer when fine-tuning a pre-trained LLM.

The experiments are performed on models trained from scratch and on pre-trained models, verifying the applicability of their findings to a wide range of use-cases.&#x20;

The authors find that when fine-tuning on more than 50 billion tokens, it is possible to specialise the tokenizer of a pre-trained LLM to obtain significant gains in generation speed and effective context size.

{% embed url="<https://arxiv.org/abs/2402.01035>" %}
Getting the most out of your tokenizer for pre-training and domain adaptation
{% endembed %}

<figure><img src="https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2FS9mDHCOTwtyk8SeZbxNf%2Fimage.png?alt=media&#x26;token=41d8b4a1-b5a8-4c71-b57c-95b6b4604412" alt=""><figcaption><p>Three ways to increase in-domain compression in a BPE tokenizer with their respective trade-offs</p></figcaption></figure>

### <mark style="color:purple;">Aspects of tokenizer design</mark>

### <mark style="color:green;">Compression Trade-offs</mark>

<mark style="color:blue;">**Training Data:**</mark> Using data sampled from the target domain (e.g., code) will increase compression for that domain.

<mark style="color:blue;">**Pre-tokenization Scheme:**</mark> The regular expression used to split the text before applying BPE affects compression. Splitting on whitespaces prevents BPE from merging across words, leading to shorter tokens and worse compression.

<mark style="color:blue;">**Vocabulary Size:**</mark> A larger vocabulary size leads to higher compression but increases computational and memory costs.

### <mark style="color:green;">Compression Metrics</mark>

<mark style="color:blue;">**Normalized Sequence Length (NSL):**</mark> Measures the average tokenized sequence length of a tokenizer compared to a baseline (Llama tokenizer). An NSL of 0.75 means the tokenizer uses 25% fewer tokens on average.&#x20;

<mark style="color:blue;">**Bytes per Token:**</mark> Calculated by dividing the number of UTF-8 bytes by the number of tokens, providing another measure of compression.

### <mark style="color:green;">BPE Algorithm and Implementation</mark>

The authors use the BPE tokenization algorithm, implemented with the HuggingFace tokenizers library, which supports regular expression-based pre-tokenization and handles special formatting characters better.

### <mark style="color:green;">Impact of Training Data</mark>

Unsurprisingly, training the tokenizer on data from the target domain (code, English, multilingual) improves compression for that domain. Training on a mix of all three leads to the best overall compression.

### <mark style="color:green;">Vocabulary Size</mark>

<mark style="color:blue;">Compression vs. Vocabulary Size:</mark> Larger vocabularies improve compression, but gains diminish exponentially as the vocabulary size increases.&#x20;

<mark style="color:blue;">Inference Optimal Vocabulary Size</mark>: The authors calculate the optimal vocabulary size for inference time by considering the trade-off between compression gains and additional computation costs.&#x20;

<mark style="color:blue;">Memory Optimal Vocabulary Size:</mark> The authors derive an equation to find the memory-optimal vocabulary size, considering the model size, sequence length, batch size, and the memory savings from reduced attention cache size due to compression.

### <mark style="color:purple;">Ideas for creating tokenizers in different fields</mark>

<mark style="color:green;">**Biomedical Domain**</mark>

* Develop a tokenizer tailored for biomedical literature, such as research papers, clinical notes, and scientific reports.
* Train the tokenizer on a large corpus of biomedical texts, incorporating domain-specific vocabulary, abbreviations, and naming conventions.
* Use pre-tokenization schemes that preserve important biomedical entities, such as gene names, protein structures, and chemical compounds.
* Integrate domain-specific knowledge bases or ontologies to improve tokenization accuracy and semantic understanding.

<mark style="color:green;">**Legal and Regulatory Domain**</mark>

* Create a tokenizer specifically designed for legal documents, contracts, regulations, and legislative texts.
* Train the tokenizer on a diverse corpus of legal texts, including case laws, statutes, and regulatory guidelines.
* Implement pre-tokenization rules that preserve legal terminology, citations, and references to specific clauses or sections.
* Incorporate legal ontologies and dictionaries to accurately tokenize complex legal phrases and terms.

<mark style="color:green;">**Financial and Accounting Domain**</mark>

* Develop a tokenizer tailored for financial reports, accounting statements, and market data.
* Train the tokenizer on a corpus of financial documents, including annual reports, balance sheets, and market analyses.
* Implement pre-tokenization rules to handle financial notation, abbreviations, and numerical representations.
* Integrate domain-specific knowledge bases or lexicons to accurately tokenize financial terms and concepts.

<mark style="color:green;">**Cybersecurity Domain**</mark>

* Create a tokenizer specifically designed for cybersecurity logs, incident reports, and technical documentation.
* Train the tokenizer on a corpus of cybersecurity-related texts, including system logs, vulnerability reports, and security guidelines.
* Implement pre-tokenization rules to preserve technical terminology, IP addresses, and other relevant cybersecurity entities.
* Integrate domain-specific knowledge bases or ontologies to accurately tokenize cybersecurity-related terms and concepts.

<mark style="color:green;">**Social Media and Conversational Data**</mark>

* Develop a tokenizer tailored for social media data, such as tweets, online forums, and chat conversations.
* Train the tokenizer on a diverse corpus of social media data, including slang, abbreviations, and internet-specific language.
* Implement pre-tokenization rules to handle emoticons, hashtags, and other social media-specific constructs.
* Incorporate language models or lexicons specific to social media and conversational data to improve tokenization accuracy.

### <mark style="color:purple;">Key Takeaways</mark>

Based on the experiments described in this section, here are the key conclusions and best practices regarding tokenizer construction and usage:

#### <mark style="color:green;">Tokenizer Impact on Performance</mark>

* Changing the tokenizer of a pre-trained LLM during fine-tuning can have a negligible impact on downstream performance, provided that the fine-tuning is done on a sufficiently large amount of data (50 billion tokens or more).
* The authors demonstrate that models fine-tuned with alternative tokenizers like GPT-4 and Punct can achieve competitive or even better performance compared to models using the original Llama tokenizer.

#### <mark style="color:green;">Vocabulary Size and Performance</mark>

* The experiments suggest that the vocabulary size of the tokenizer (within the tested range of 32k to 256k) has a minimal impact on the downstream performance of the LLM.
* The authors found no statistically significant correlation between vocabulary size and performance metrics like Pass\@1 and Pass\@100 on code generation tasks.

#### <mark style="color:green;">Tokenizer Update Methods</mark>

* Using techniques like Fast Vocabulary Transfer (FVT) to initialize the new tokenizer's embeddings from the pre-trained model leads to noticeable performance improvements compared to not using FVT.
* Extending an existing tokenizer (e.g., Llama) by adding domain-specific tokens provides only small gains compared to using a completely different tokenizer like GPT-4.

#### <mark style="color:green;">Tokenizer Choice and Compression</mark>

* While highly compressed tokenizers like Identity can offer significant compression benefits, they may result in deteriorated downstream performance on code generation tasks.
* Tokenizers like Punct and GPT-4, which strike a balance between compression and preserving syntactic and semantic information, can achieve both better performance and better compression compared to the Llama tokenizer.

#### <mark style="color:green;">Scaling to Larger LLMs</mark>

* The authors demonstrate that their findings regarding tokenizer switching and its negligible impact on performance hold true for larger LLMs like Llama 2 7B when fine-tuned on a sufficient amount of data.

### <mark style="color:purple;">Summary</mark>

In this study, the authors investigated the impact of tokenizer design choices on the performance, compression, and efficiency of large language models (LLMs), with a focus on code generation tasks.&#x20;

Their findings highlight the importance of carefully considering tokenization strategies, as they can significantly influence model capabilities and resource utilization.

Through extensive experimentation, they demonstrated that changing the tokenizer of a pre-trained LLM during fine-tuning can have a negligible impact on downstream performance, provided that the fine-tuning is conducted on a sufficiently large dataset (50 billion tokens or more).&#x20;

This insight opens up opportunities for optimizing tokenizers for specific domains or tasks without sacrificing model accuracy.

Furthermore, their results suggest that the vocabulary size of the tokenizer, within a reasonable range (32k to 256k), has a minimal effect on the LLM's downstream performance. This finding allows for flexibility in balancing compression and memory/compute trade-offs based on the specific requirements of the application.

Overall, this study underscores the importance of carefully considering tokenization strategies in the development and fine-tuning of LLMs, as well as the potential for optimizing tokenizers to enhance model efficiency and domain-specific performance.

### <mark style="color:purple;">References</mark>

1. **01.AI Yi series models (2023)**: Discusses the Yi series of large language models available on the Hugging Face Model Repository, highlighting their advanced capabilities in language understanding.
2. **Ahmad et al. (2021)**: Explores unified pre-training methods for program understanding and generation, demonstrating a method to enhance model efficiency in understanding and generating programmatic content.
3. **Ainslie et al. (2023)**: Describes training generalized multi-query transformer models from multi-head checkpoints, advancing the flexibility of transformer architectures in handling diverse queries simultaneously.
4. **Allal et al. (2023)**: Introduces 'Santacoder', a concept for a playful, yet robust approach to coding assistance, enhancing code-related tasks without aiming for overly ambitious, unattainable goals.
5. **Almazrouei et al. (2023)**: Discusses the Falcon series of open language models, emphasizing the development of accessible and robust language model frameworks.
6. **Anthropic (2023)**: Details the release of 'Claude' by Anthropic, focusing on a new language model that aims to improve ethical considerations and robustness in AI.
7. **Austin et al. (2021)**: Investigates the use of large language models for program synthesis, showing their potential in automating coding tasks and generating programmatic content from high-level descriptions.
8. **Biderman et al. (2023)**: Presents 'Pythia', a suite designed for analyzing large language models across different stages of their training and scaling, aiming to understand their behavior and improve their design.
9. **Black et al. (2022)**: Discusses GPT-NeoX-20B, an open-source autoregressive language model, contributing to the open research and development of scalable language models.
10. **Chen et al. (2021)**: Evaluates large language models trained on code, providing insights into their effectiveness and areas for improvement in programming language understanding.
11. **Chirkova & Troshin (2023)**: Explores subtokenization options for pretraining large language models on source code, aiming to optimize model performance on coding tasks.
12. **Deci (2023)**: Introduces 'Decicoder', a model touted as a new standard in efficient and accurate code generation, enhancing the capabilities of AI in software development.
13. **DeepSeek AI (2023)**: Details a series of code language models, enhancing the tools available for developers and programmers in automated code generation.
14. **Devlin et al. (2019)**: Describes the training of BERT, a foundational model that uses deep bidirectional transformers for improved language understanding, setting a new standard in NLP.
15. **Elsen et al. (2023)**: Announces the release of Persimmon-8B, a language model designed to follow short-form instructions effectively, demonstrating its practical applications.
16. **Forsythe (2023)**: Introduces 'Tokenmonster', a tokenizer and vocabulary trainer designed to improve the efficiency of language processing in Python, Go, and JavaScript.
17. **Fried et al. (2023)**: Presents 'Incoder', a generative model for code infilling and synthesis, aimed at enhancing automated coding tasks by filling in gaps and generating syntactically correct code snippets.
18. **Gee et al. (2022, 2023)**: Discusses methods for fast vocabulary transfer and multi-word tokenization for language model compression and sequence compression, respectively, aiming to enhance model efficiency and manage larger vocabularies effectively.
19. **Gowda & May (2020)**: Investigates the optimal vocabulary size for neural machine translation, providing insights that help improve translation accuracy and efficiency.
20. **Goyal et al. (2023)**: Describes the training of language models with pause tokens, introducing a method to enhance natural language generation by incorporating thoughtful pauses in speech or text.
21. **guidance-ai (2023)**: Details a guidance language designed for controlling large language models, emphasizing the development of more responsive and controllable AI systems.
22. **Jiang et al. (2023)**: Discusses 'Mistral 7b', a model designed for multi-turn program synthesis, enhancing the interactive capabilities of AI in coding tasks.
23. **Kocetkov et al. (2022)**: Presents 'The Stack', a large dataset of permissively licensed source code, aimed at facilitating the training and development of code-oriented AI models.
24. **Kudo (2018)**: Investigates subword regularization techniques, aiming to improve translation models by managing subword variability effectively.
25. **Kudo & Richardson (2018)**: Introduces 'Sentencepiece', a tokenizer that simplifies text processing by providing a consistent subword tokenization method
