Tokenization - SentencePiece
The Unsupervised Text Tokenizer for Neural Networks
Last updated
Copyright Continuum Labs - 2023
The Unsupervised Text Tokenizer for Neural Networks
Last updated
SentencePiece is a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, particularly Neural Machine Translation (NMT).
Its main goal is to provide a simple, efficient, and reproducible preprocessing and postprocessing tool that can be easily integrated into neural network-based NLP systems.
Its core strength lies in its ability to manage vocabulary size before training neural models, a critical factor in the efficiency and effectiveness of these systems.
SentencePiece is a language-independent subword tokenizer and detokenizer, engineered for neural-based text processing.
Unlike conventional tokenizers, it doesn't rely on whitespaces for tokenization, making it versatile for languages like Chinese and Japanese.
It implements subword units, such as byte-pair-encoding (BPE) and unigram language models, directly from raw sentences. This approach ensures that important words are captured within a fixed vocabulary list, minimising redundancy.
Tokenization, the process of breaking down text into words or subwords, is fundamental in NLP.
SentencePiece excels in splitting words into subwords, capturing frequent and diverse subwords within a predetermined vocabulary size.
Setting a vocabulary size limit is vital in preventing the inclusion of rare or complex words that may not be beneficial as separate vectors. This balance is key to efficient and effective models.
SentencePiece comprises four primary components:
Normalizer: Standardises words into equivalent NFKC Unicode.
Trainer: Builds vocabulary based on subword components using BPE and unigram language models.
Encoder and Decoder: Handle encoding and decoding processes, ensuring lossless tokenization.
SentencePiece implements two subword segmentation algorithms - byte-pair-encoding (BPE) and unigram language model. These algorithms allow the tokenizer to break down words into smaller units (subwords) to reduce the vocabulary size and handle out-of-vocabulary words effectively.
SentencePiece can directly train subword models from raw sentences without relying on language-specific pre-tokenization. This enables the creation of purely end-to-end and language-independent NMT systems.
SentencePiece treats the input text as a sequence of Unicode characters, including whitespace, which is escaped with a meta symbol (e.g., "_"). This allows for reversible encoding and decoding without losing information, making the process language-agnostic.
SentencePiece manages the vocabulary-to-id mapping, enabling direct conversion of text into an id sequence and vice versa. This is particularly useful for NMT systems, as their input and output are typically id sequences.
SentencePiece includes a normalizer module that canonicalizes semantically-equivalent Unicode characters, ensuring consistent input for the subword model training.
The paper demonstrates that SentencePiece can achieve comparable accuracy to direct subword training from raw sentences in an English-Japanese NMT task.
By providing a simple, language-independent, and reversible tokenization process, SentencePiece aims to standardise and simplify the preprocessing and postprocessing steps in neural network-based NLP systems.
SentencePiece is designed to be language-independent, meaning it can be applied to any language without requiring language-specific knowledge or preprocessing. This is particularly useful in multilingual NLP tasks or when working with low-resource languages. By treating the input text as a sequence of Unicode characters and directly learning subword units from raw sentences, SentencePiece eliminates the need for language-specific tokenization rules or tools.
One of the key challenges in tokenization is managing the vocabulary size. A large vocabulary can lead to increased model complexity and computational costs, while a small vocabulary may not capture important words or subwords. SentencePiece allows you to specify a desired vocabulary size, and it automatically learns the most frequent and informative subwords to include in the vocabulary. This helps in striking a balance between model efficiency and expressiveness.
SentencePiece implements subword segmentation algorithms, such as byte-pair encoding (BPE) and unigram language model, which break down words into smaller units (subwords). This approach has several advantages:
It reduces the vocabulary size by representing rare or out-of-vocabulary words as combinations of subwords.
It captures morphological and semantic information within words, as subwords often correspond to meaningful units like prefixes, suffixes, or roots.
It enables the model to handle unseen words by composing them from learned subwords.
SentencePiece provides a lossless tokenization process, meaning that the original text can be perfectly reconstructed from the tokenized representation. It achieves this by treating whitespace and other special characters as separate tokens and escaping them with a meta symbol. This reversibility ensures that no information is lost during tokenization and detokenization, making it easier to integrate SentencePiece into existing NLP pipelines.
SentencePiece offers a simple and intuitive API for tokenization and detokenization. It provides straightforward methods for training subword models, encoding text into subword sequences, and decoding subword sequences back into text. The library is well-documented and comes with pre-trained models for various languages, making it easy to get started with tokenization tasks.
SentencePiece promotes reproducibility by providing a standardised and deterministic tokenization process.
Given the same input text and trained model, SentencePiece guarantees consistent tokenization results across different platforms and implementations. This is crucial for reproducible research and ensures that models trained using SentencePiece can be easily shared and deployed.