Tokenization Is More Than Compression
Craig W. Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, Chris Tanner
This February 2024 paper explores the role of tokenization in Natural Language Processing (NLP) tasks and challenges the common understanding of why certain tokenization methods, such as Byte-Pair Encoding (BPE), are effective.
Tokenization is the process of converting raw text into a sequence of distinct tokens that can be used by statistical models.
The authors divide tokenization into three stages:
Pre-tokenization: Optional initial rules that restrict or enforce the creation of certain tokens (e.g., splitting a corpus on whitespace).
Vocabulary Construction: The core algorithm that constructs a vocabulary of tokens (V) of size m from a given text corpus (C), while adhering to pre-tokenization rules.
Segmentation: The process of splitting a document (d) into a series of tokens (t1, ..., tKd) from the vocabulary (V), such that the concatenation of the tokens equals the original document.
The authors introduce a new metric called Corpus Token Count (CTC), which is the total number of tokens used in the segmentation of all documents in a corpus.
The paper challenges the hypothesis that the effectiveness of BPE stems from its ability to compress text into a short sequence of tokens.
To test this, they introduce a novel tokenizer called PATHPIECE, which finds a segmentation with the minimum possible number of tokens (Kd) for a given document and vocabulary. The PATHPIECE vocabulary construction routine is a top-down procedure that heuristically minimizes CTC on a training corpus.
The authors conduct experiments by training 64 language models (LMs) with varying tokenization methods and vocabulary sizes:
54 LMs with 350M parameters
6 LMs with 1.3B parameters
4 LMs with 2.4B parameters
They evaluate the impact of different tokenization stages and vocabulary sizes on downstream task performance. The paper provides open-source access to PATHPIECE, token vocabularies, and all 64 trained LMs.
Related Work
The related work section discusses pre-tokenization methods, vocabulary construction algorithms, and segmentation methods in detail.
Pre-tokenization Methods
Pre-tokenization is the process of breaking text into chunks, which are then tokenized independently.
Tokens are not allowed to cross pre-tokenization boundaries. The authors discuss three pre-tokenization methods:
FirstSpace: Used by BPE, WordPiece, and Unigram, it requires new chunks to begin whenever a space is encountered. If a space appears in a chunk, it must be the first character.
Space: Suggested by Gow-Smith et al. (2022), it treats spaces as individual tokens.
Digit: Popularized by Llama (Touvron et al., 2023), it treats each digit as an individual token.
Vocabulary Construction
The authors focus on byte-level, lossless subword tokenization algorithms that split text into word and subword units based on their frequency and co-occurrence patterns from their "training" data. They analyse four subword tokenizers:
Byte-Pair Encoding (BPE): A bottom-up method that starts with single bytes as tokens and merges the most commonly occurring pair of adjacent tokens in a training corpus into a single new token until the desired vocabulary size is reached.
WordPiece: Similar to BPE, but uses Pointwise Mutual Information (PMI) as the criteria to identify candidates to merge, prioritizing pairs that occur together more frequently than expected, relative to the individual token frequencies.
Unigram Language Model: A top-down approach that starts from a large initial vocabulary and progressively prunes groups of tokens that induce the minimum likelihood decrease of the corpus, selecting tokens to maximise the likelihood of the corpus according to a simple unigram language model.
SaGe: Proposed by Yehezkel and Pinter (2023), it incorporates contextual information into an ablation loss via a skip-gram objective and operates top-down, pruning from an initial vocabulary to a desired size.
Segmentation Methods
Segmentation converts text into a series of tokens, given a tokenizer and a vocabulary of tokens.
The authors ensure that all 256 single-byte tokens are included in the vocabulary to avoid out-of-vocabulary issues.
Some segmentation methods are tightly coupled to the vocabulary construction step, such as merge rules for BPE or the maximum likelihood approach for Unigram. Others, like the WordPiece approach of greedily taking the longest prefix token in the vocabulary at each point, can be applied to any vocabulary.
Alternative segmentation schemes include Dynamic Programming BPE, BPE-Dropout, and FLOTA.
Experiments
The experiments section provides details on the authors' experimental setup, the downstream evaluation tasks used, and the various tokenization stage variants tested.
The authors then present and analyse the results of their experiments.
Downstream Evaluation Tasks
The authors selected 10 benchmarks from the lm-evaluation-harness to evaluate the performance of their tokenization process.
These benchmarks are all multiple-choice tasks with 2, 4, or 5 options and were run with 5-shot prompting.
Tokenization Stage Variants
The authors conducted 18 experimental variants, each repeated at vocabulary sizes of 32,768, 40,960, and 49,152.
They used BPE, Unigram, WordPiece, and SaGe as baseline vocabulary creation methods, and two variants of PATHPIECE with different tie-breaking strategies (longest token and random). They also varied the initial vocabulary for PATHPIECE and SaGe, and the pre-tokenization schemes.
Results
The authors reports the downstream performance across all experimental settings. They make several key observations:
Vocabulary Size: The authors found a high correlation between downstream performance at different vocabulary sizes, indicating that vocabulary size is not a crucial decision over the range of 30k to 50k tokens.
Overall Performance: The top five tokenizers (PATHPIECEL-BPE, Unigram, BPE, BPE-Greedy, and WordPiece) do not have any statistically significant differences in performance. SaGe-BPE (rank 6) is only barely worse than PATHPIECEL-BPE. This suggests that there is no single tokenizer algorithm that is significantly better than the others.
Model Size: The authors built larger models (1.3B and 2.4B parameters) for a subset of the experiments. They found that the relative performance of the tokenizers varies by model size, but there is still a group of highly-performant tokenizers that yield comparable results.
Corpus Token Count vs Accuracy
The authors did not find a straightforward relationship between the corpus token count (CTC) versus the accuracy of each vocabulary size.
The range of CTC is quite narrow within each vocabulary construction method, even while changes in pre-tokenization and segmentation lead to significant accuracy differences. The authors suggest that there might be an inverted U-shaped curve with respect to the CTC and downstream performance.
In summary, the experiments and results demonstrate that there is no single tokenizer algorithm that significantly outperforms the others, and the relationship between corpus token count and downstream performance is not straightforward.
The authors' findings suggest that factors other than vocabulary size and corpus token count play a role in the effectiveness of tokenization for language modelling tasks.
Conclusion
In this paper, the authors investigate the hypothesis that reducing the corpus token count (CTC) would improve downstream performance in natural language processing tasks.
They compare various tokenization methods and analyse the impact of different stages of tokenization on downstream task performance.
The main conclusion is that the relationship between CTC and downstream accuracy is not straightforward, as five different tokenizers with varying CTCs perform comparably. This finding challenges the current understanding of why Byte-Pair Encoding (BPE) is particularly effective.
The authors also find that unigram tokenizers better align with morphological segmentation compared to BPE tokenizers, further suggesting that the effectiveness of a tokenizer cannot be explained by a single factor.
Last updated