Sample Packing

This October 2022 paper "Efficient Sequence Packing for Transformers" addresses the issue of padding tokens in large language models (LLMs) and presents new algorithms for efficient sequence packing to improve training throughput and accuracy.

The authors highlight that up to 50% of all tokens in common NLP datasets can be padding tokens, leading to significant inefficiency in processing variable-length sequences on hardware accelerators.

Key points and technical details

Padding tokens in NLP datasets

Common practice in LLMs is to introduce padding tokens to handle variable-length sequences on hardware accelerators.
The variation in sequence lengths in NLP datasets can result in up to 50% of all tokens being padding tokens.
In extreme cases (e.g., GLUE-cola with sequence length 128), the padding ratio can be as high as 89%.

Existing methods and their limitations

Naïve batching: Widely used but inefficient due to high percentage of padding tokens.
Separator tokens: Used to separate sequences from different documents (e.g., in RoBERTa) but can have a significant impact on performance.
Un-padding (Effective Transformer) and sorted batching (Faster Transformer, lingvo, fairseq): Require substantial hardware-specific low-level code optimizations, mostly available on GPUs.

Formalizing the packing problem

The authors frame the sequence packing problem in the context of the well-studied bin packing problem.
They present new deterministic and efficient packing algorithms based on established solvers that can efficiently pack datasets with millions of sequences in a matter of seconds.

Cross-contamination and model adjustments

Cross-contamination: The cause of accuracy reduction when sequences from different documents interact in self-attention, not mitigated by separator tokens.
The authors show how the BERT model can be adjusted to ensure mathematical equivalence between the original and packed models, avoiding cross-contamination with little overhead.

Empirical results

The proposed packing algorithms produce a nearly-optimal packing scheme for the Wikipedia pre-training dataset.
Experiments demonstrate that the convergence of the BERT-large model on the packed dataset is equivalent to that on the un-packed dataset, with a 2x throughput increase on the Wikipedia sequence length 512 pre-training dataset.

In summary, this paper addresses the inefficiency caused by padding tokens in LLMs and presents new algorithms for efficient sequence packing.

The authors formalize the packing problem, introduce techniques to avoid cross-contamination, and demonstrate improved training throughput and accuracy on the Wikipedia pre-training dataset.

The proposed methods can be adapted to existing models, ensuring mathematical equivalence between the original and packed models, and can be applied to various machine learning algorithms with differently sized data samples.

Sequence length distributions

Sequence length distributions refer to the frequency or probability of sequences of different lengths within a dataset.

In the context of natural language processing (NLP) and large language models (LLMs), sequence length refers to the number of tokens in a given input sequence, such as a sentence or a document.

The findings presented in the image show the sequence length distributions for various datasets:

Wikipedia BERT pre-training dataset

For a maximum sequence length of 128, 59.9% of the sequences are shorter than the maximum, leading to a theoretical maximum speed-up of 1.210 if padding tokens are not used.
For a maximum sequence length of 384, 30.6% of the sequences are shorter than the maximum, with a theoretical maximum speed-up of 1.742.
For a maximum sequence length of 512, 23.5% of the sequences are shorter than the maximum, with a theoretical maximum speed-up of 2.001.

GLUE datasets (cola, sst2, mrpc, qqp, stsb, mnli, rte, wnli)

The sequence length distributions for these datasets are shown in the top right graph, indicating that they have varying sequence lengths, with some datasets having a higher concentration of shorter sequences.

SQuAD 1.1 dataset

The sequence length distribution for the SQuAD 1.1 dataset is shown in the bottom left graph, with a theoretical maximum speed-up of 2.2x if padding tokens are not used.

The key findings from these sequence length distributions are

Many datasets have a significant portion of sequences that are shorter than the maximum sequence length, leading to a large number of padding tokens when using fixed-length input sequences.
The skewed sequence length distributions are not limited to text data (Wikipedia, GLUE, SQuAD) but also apply to audio data (LibriSpeech) and molecular data (QM9).
The theoretical maximum speed-up that can be achieved by not using padding tokens varies depending on the dataset and the maximum sequence length, ranging from 1.210 to 2.2x in the presented examples.

These findings highlight the potential for improving the efficiency of LLMs by using techniques like sequence packing to minimise the number of padding tokens and optimise the input sequences for faster processing.

The methods described in the paper consist of three main components

Efficient data packing during pre-processing

Two new heuristic offline algorithms are proposed: Shortest-pack-first histogram-packing (SPFHP) and Non-negative least squares histogram-packing (NNLSHP).
These algorithms efficiently pack the dataset to maximize the utilization of the maximum sequence length.

Model changes to preserve equivalence with the original BERT implementation (packedBERT)

Adjust positional embeddings: Replace the bias add operation with an embedding look-up to handle packed sequences.
Adjust attention masking: Use a block-diagonal mask before the attention softmax to prevent tokens from different sequences within a pack from attending to each other.
Adjust per-sequence loss and accuracy: Unpack the logits and labels to compute the loss on each sequence separately, ensuring consistency with the un-packed BERT implementation.

Hyperparameter adjustments for comparable convergence behavior

The primary consequence of packing is an increase in the effective batch size, which requires adjusting hyperparameters sensitive to the number of sequences and tokens.
One approach is to reduce the computational batch size by the packing factor and keep other hyperparameters the same.
Another approach is to preserve the batch size and update the decay parameters of the LAMB optimizer using the heuristic: $β1 := β1^p, β2 := β2^p$ , where $p$ is the packing factor.

Advice on setting hyperparameters

Determine the packing factor (average number of sequences per pack) based on your dataset and the chosen packing algorithm.
If you want to keep the convergence behavior as close as possible to the un-packed BERT implementation, reduce the computational batch size by the packing factor and keep other hyperparameters unchanged. This approach might lead to under-utilization of memory/compute resources.
If you want to preserve the batch size and optimize hardware utilization, update the decay parameters of the LAMB optimizer using the provided heuristic (β1 := β1^p, β2 := β2^p). Keep in mind that this is an approximate adjustment, and the convergence behavior might not be identical to the un-packed version.
Be cautious when scaling the learning rate with the batch size, as the experiments in the paper show that this can reduce convergence speed.
Monitor the convergence behavior of your model closely and be prepared to fine-tune the hyperparameters if necessary. The provided adjustments are heuristics and may not fully undo the impact of the increased batch size.
If you encounter any issues with convergence or performance, consider adjusting other hyperparameters, such as the learning rate, warmup steps, or the number of training epochs, based on your specific use case and dataset.

Remember that the optimal hyperparameter settings may vary depending on your dataset, model architecture, and hardware setup. It is always a good practice to experiment with different configurations and validate the performance of your model on a holdout dataset or through cross-validation.

Experiments

The experiments in this paper focus on evaluating the proposed sequence packing algorithms (SPFHP and NNLSHP) and their impact on the training of BERT models.

The main ideas and insights from these experiments are as follows:

Bin packing algorithm comparison

The authors compare their proposed packing algorithms (SPFHP and NNLSHP) with baseline methods like no packing (NONE), sorted batching (SORT), and greedy packing (GREEDY).
They evaluate the algorithms using metrics such as the number of packs, number of tokens, number of padding tokens, solution time, packing efficiency, speed-up achieved, and average number of sequences per sample (packing factor).
The results show that NNLSHP with packing depth 3 achieves the best speed-up (1.913) and packing efficiency (99.7%), close to the theoretical upper bound of 2.001.
The overhead from the attention masking and loss adjustment slightly increases with packing depth but is outweighed by the benefits of packing.
The packing factor and improvement in efficiency provide an accurate estimate of the speed-up.

MLPerf™ phase 2 pretraining setup

The authors compare the learning curves and hyperparameter adjustments for packed and unpacked BERT training on the MLPerf™ BERT pre-training benchmark.
They analyze the impact of packing on convergence behavior and the theoretical speed-up in practice.
The learning curves show that with the same hyperparameters and adjusted accumulation count, packed and unpacked training have almost identical performance when normalized by the number of samples processed.
Adjusting the hyperparameters (e.g., LAMB decay parameters) to compensate for the increased batch size in packed training helps match the performance at later training stages but cannot completely recover the early convergence behavior of the smaller batch size.
The realized total speed-up from packing exceeds 2x due to the reduction in computational work and latency of transferring data to the device.

Full pretraining and SQuAD finetuning

The authors validate that downstream performance (on SQuAD 1.1) is not impacted by packing after a full pre-training of BERT base and large models plus fine-tuning.
They use the same hyperparameters and number of training steps for packed and unpacked runs, reducing the gradient accumulation count for packed training to match the total number of sequences processed before each weight update.
The SQuAD scores for packed and unpacked models are comparable to the reference scores, confirming that packing does not degrade downstream performance.

Scaling analysis: Impact of accelerators count

The authors discuss the advantage of packing over un-padding approaches in terms of inherent load balancing.
Un-padding relies on dynamically launching custom kernels that ignore padding, which can lead to load-imbalance between devices and a decrease in speed-up as the number of accelerators increases.
Packing, on the other hand, is inherently load-balanced, with processing time on each accelerator being independent of the content inside the batch received by the device.

These experiments provide valuable insights into how to effectively apply sequence packing in the training of large language models like BERT.

The key takeaways are:

Use efficient packing algorithms like NNLSHP with an appropriate packing depth to maximize speed-up and efficiency.
Apply the necessary model adjustments (attention masking and positional embedding) to maintain performance.
Adjust hyperparameters (e.g., batch size, accumulation count, optimizer parameters) to compensate for the increased effective batch size in packed training.
Sequence packing can provide significant speed-up without compromising downstream performance, making it a valuable technique for efficient training of large language models.

Key Conclusions

Sequence length distributions

Visualising sequence length distributions of various datasets (language, audio, molecular) highlights the prevalence of padding and the potential benefits of packing.
Packing can lead to more than 2x acceleration by removing 50% or more padding in these datasets.

Efficient packing algorithms

The proposed packing approaches (SPFHP and NNLSHP) based on established solvers are highly efficient, leaving almost no padding and handling large datasets quickly.
These algorithms outperform existing approaches that are slow and suboptimal.

Model adjustments for packed sequences

Without adjusting the sequence processing algorithm (e.g., BERT) to the packed sequences, predictive performance is reduced.
The proposed model adjustments (attention masking, positional embedding, and loss/accuracy computation) are all necessary to maintain predictive performance.
These adjustments come with an overhead of less than 5% but enable significant speed-up while preserving performance.

Hyperparameter tuning

Packing increases the effective batch size, which requires adjusting hyperparameters such as the learning rate, optimizer parameters (e.g., LAMB decay parameters), and gradient accumulation count.
Carefully tuning these hyperparameters can help maintain convergence behavior and performance when using packed sequences.

Downstream performance and speed-up

Experiments demonstrate that downstream performance (e.g., on SQuAD) is not impacted by packing when the proposed model adjustments are applied.
The anticipated 2x acceleration can be achieved in practice, making packing a valuable technique for efficient training of large language models.

Future directions

Packing can be explored for other domains, such as computer vision, where images of different sizes and resolutions can be packed to accelerate training.
Improving the performance of other models (RoBERTa, GPT-3, T5) by avoiding contamination between non-contiguous segments from different documents is an interesting direction for future research.
Even BERT itself might benefit from avoiding contamination between the two concatenated segments.

In summary, the paper highlights the importance of considering sequence length distributions, applying efficient packing algorithms, and making necessary model adjustments to maintain performance while achieving significant speed-up through sequence packing.

Hyperparameter tuning plays a crucial role in ensuring convergence behavior and performance when using packed sequences. The insights and techniques presented in the paper can be valuable for accelerating the training of large language models and potentially other domains, such as computer vision.

PreviousRethinking Learning Rate Tuning in the Era of Language Models NextGradient accumulation

Last updated 7 months ago

Sample Packing

Key points and technical details

Padding tokens in NLP datasets

Common practice in LLMs is to introduce padding tokens to handle variable-length sequences on hardware accelerators.
The variation in sequence lengths in NLP datasets can result in up to 50% of all tokens being padding tokens.
In extreme cases (e.g., GLUE-cola with sequence length 128), the padding ratio can be as high as 89%.

Existing methods and their limitations

Naïve batching: Widely used but inefficient due to high percentage of padding tokens.
Separator tokens: Used to separate sequences from different documents (e.g., in RoBERTa) but can have a significant impact on performance.
Un-padding (Effective Transformer) and sorted batching (Faster Transformer, lingvo, fairseq): Require substantial hardware-specific low-level code optimizations, mostly available on GPUs.

Formalizing the packing problem

The authors frame the sequence packing problem in the context of the well-studied bin packing problem.
They present new deterministic and efficient packing algorithms based on established solvers that can efficiently pack datasets with millions of sequences in a matter of seconds.

Cross-contamination and model adjustments

Cross-contamination: The cause of accuracy reduction when sequences from different documents interact in self-attention, not mitigated by separator tokens.
The authors show how the BERT model can be adjusted to ensure mathematical equivalence between the original and packed models, avoiding cross-contamination with little overhead.

Empirical results

The proposed packing algorithms produce a nearly-optimal packing scheme for the Wikipedia pre-training dataset.
Experiments demonstrate that the convergence of the BERT-large model on the packed dataset is equivalent to that on the un-packed dataset, with a 2x throughput increase on the Wikipedia sequence length 512 pre-training dataset.

In summary, this paper addresses the inefficiency caused by padding tokens in LLMs and presents new algorithms for efficient sequence packing.

The authors formalize the packing problem, introduce techniques to avoid cross-contamination, and demonstrate improved training throughput and accuracy on the Wikipedia pre-training dataset.

Sequence length distributions

Sequence length distributions refer to the frequency or probability of sequences of different lengths within a dataset.

In the context of natural language processing (NLP) and large language models (LLMs), sequence length refers to the number of tokens in a given input sequence, such as a sentence or a document.

The findings presented in the image show the sequence length distributions for various datasets:

Wikipedia BERT pre-training dataset

For a maximum sequence length of 128, 59.9% of the sequences are shorter than the maximum, leading to a theoretical maximum speed-up of 1.210 if padding tokens are not used.
For a maximum sequence length of 384, 30.6% of the sequences are shorter than the maximum, with a theoretical maximum speed-up of 1.742.
For a maximum sequence length of 512, 23.5% of the sequences are shorter than the maximum, with a theoretical maximum speed-up of 2.001.

GLUE datasets (cola, sst2, mrpc, qqp, stsb, mnli, rte, wnli)

The sequence length distributions for these datasets are shown in the top right graph, indicating that they have varying sequence lengths, with some datasets having a higher concentration of shorter sequences.

SQuAD 1.1 dataset

The sequence length distribution for the SQuAD 1.1 dataset is shown in the bottom left graph, with a theoretical maximum speed-up of 2.2x if padding tokens are not used.

The key findings from these sequence length distributions are

Many datasets have a significant portion of sequences that are shorter than the maximum sequence length, leading to a large number of padding tokens when using fixed-length input sequences.
The skewed sequence length distributions are not limited to text data (Wikipedia, GLUE, SQuAD) but also apply to audio data (LibriSpeech) and molecular data (QM9).
The theoretical maximum speed-up that can be achieved by not using padding tokens varies depending on the dataset and the maximum sequence length, ranging from 1.210 to 2.2x in the presented examples.

The methods described in the paper consist of three main components

Efficient data packing during pre-processing

Two new heuristic offline algorithms are proposed: Shortest-pack-first histogram-packing (SPFHP) and Non-negative least squares histogram-packing (NNLSHP).
These algorithms efficiently pack the dataset to maximize the utilization of the maximum sequence length.

Model changes to preserve equivalence with the original BERT implementation (packedBERT)

Adjust positional embeddings: Replace the bias add operation with an embedding look-up to handle packed sequences.
Adjust attention masking: Use a block-diagonal mask before the attention softmax to prevent tokens from different sequences within a pack from attending to each other.
Adjust per-sequence loss and accuracy: Unpack the logits and labels to compute the loss on each sequence separately, ensuring consistency with the un-packed BERT implementation.

Hyperparameter adjustments for comparable convergence behavior

The primary consequence of packing is an increase in the effective batch size, which requires adjusting hyperparameters sensitive to the number of sequences and tokens.
One approach is to reduce the computational batch size by the packing factor and keep other hyperparameters the same.
Another approach is to preserve the batch size and update the decay parameters of the LAMB optimizer using the heuristic: $β1 := β1^p, β2 := β2^p$ , where $p$ is the packing factor.

Advice on setting hyperparameters

Determine the packing factor (average number of sequences per pack) based on your dataset and the chosen packing algorithm.
If you want to keep the convergence behavior as close as possible to the un-packed BERT implementation, reduce the computational batch size by the packing factor and keep other hyperparameters unchanged. This approach might lead to under-utilization of memory/compute resources.
If you want to preserve the batch size and optimize hardware utilization, update the decay parameters of the LAMB optimizer using the provided heuristic (β1 := β1^p, β2 := β2^p). Keep in mind that this is an approximate adjustment, and the convergence behavior might not be identical to the un-packed version.
Be cautious when scaling the learning rate with the batch size, as the experiments in the paper show that this can reduce convergence speed.
Monitor the convergence behavior of your model closely and be prepared to fine-tune the hyperparameters if necessary. The provided adjustments are heuristics and may not fully undo the impact of the increased batch size.
If you encounter any issues with convergence or performance, consider adjusting other hyperparameters, such as the learning rate, warmup steps, or the number of training epochs, based on your specific use case and dataset.

Experiments

The experiments in this paper focus on evaluating the proposed sequence packing algorithms (SPFHP and NNLSHP) and their impact on the training of BERT models.

The main ideas and insights from these experiments are as follows:

Bin packing algorithm comparison

The authors compare their proposed packing algorithms (SPFHP and NNLSHP) with baseline methods like no packing (NONE), sorted batching (SORT), and greedy packing (GREEDY).
They evaluate the algorithms using metrics such as the number of packs, number of tokens, number of padding tokens, solution time, packing efficiency, speed-up achieved, and average number of sequences per sample (packing factor).
The results show that NNLSHP with packing depth 3 achieves the best speed-up (1.913) and packing efficiency (99.7%), close to the theoretical upper bound of 2.001.
The overhead from the attention masking and loss adjustment slightly increases with packing depth but is outweighed by the benefits of packing.
The packing factor and improvement in efficiency provide an accurate estimate of the speed-up.

MLPerf™ phase 2 pretraining setup

The authors compare the learning curves and hyperparameter adjustments for packed and unpacked BERT training on the MLPerf™ BERT pre-training benchmark.
They analyze the impact of packing on convergence behavior and the theoretical speed-up in practice.
The learning curves show that with the same hyperparameters and adjusted accumulation count, packed and unpacked training have almost identical performance when normalized by the number of samples processed.
Adjusting the hyperparameters (e.g., LAMB decay parameters) to compensate for the increased batch size in packed training helps match the performance at later training stages but cannot completely recover the early convergence behavior of the smaller batch size.
The realized total speed-up from packing exceeds 2x due to the reduction in computational work and latency of transferring data to the device.

Full pretraining and SQuAD finetuning

The authors validate that downstream performance (on SQuAD 1.1) is not impacted by packing after a full pre-training of BERT base and large models plus fine-tuning.
They use the same hyperparameters and number of training steps for packed and unpacked runs, reducing the gradient accumulation count for packed training to match the total number of sequences processed before each weight update.
The SQuAD scores for packed and unpacked models are comparable to the reference scores, confirming that packing does not degrade downstream performance.

Scaling analysis: Impact of accelerators count

The authors discuss the advantage of packing over un-padding approaches in terms of inherent load balancing.
Un-padding relies on dynamically launching custom kernels that ignore padding, which can lead to load-imbalance between devices and a decrease in speed-up as the number of accelerators increases.
Packing, on the other hand, is inherently load-balanced, with processing time on each accelerator being independent of the content inside the batch received by the device.

These experiments provide valuable insights into how to effectively apply sequence packing in the training of large language models like BERT.

The key takeaways are:

Use efficient packing algorithms like NNLSHP with an appropriate packing depth to maximize speed-up and efficiency.
Apply the necessary model adjustments (attention masking and positional embedding) to maintain performance.
Adjust hyperparameters (e.g., batch size, accumulation count, optimizer parameters) to compensate for the increased effective batch size in packed training.
Sequence packing can provide significant speed-up without compromising downstream performance, making it a valuable technique for efficient training of large language models.

Key Conclusions

Sequence length distributions

Visualising sequence length distributions of various datasets (language, audio, molecular) highlights the prevalence of padding and the potential benefits of packing.
Packing can lead to more than 2x acceleration by removing 50% or more padding in these datasets.

Efficient packing algorithms

The proposed packing approaches (SPFHP and NNLSHP) based on established solvers are highly efficient, leaving almost no padding and handling large datasets quickly.
These algorithms outperform existing approaches that are slow and suboptimal.

Model adjustments for packed sequences

Without adjusting the sequence processing algorithm (e.g., BERT) to the packed sequences, predictive performance is reduced.
The proposed model adjustments (attention masking, positional embedding, and loss/accuracy computation) are all necessary to maintain predictive performance.
These adjustments come with an overhead of less than 5% but enable significant speed-up while preserving performance.

Hyperparameter tuning

Packing increases the effective batch size, which requires adjusting hyperparameters such as the learning rate, optimizer parameters (e.g., LAMB decay parameters), and gradient accumulation count.
Carefully tuning these hyperparameters can help maintain convergence behavior and performance when using packed sequences.

Downstream performance and speed-up

Experiments demonstrate that downstream performance (e.g., on SQuAD) is not impacted by packing when the proposed model adjustments are applied.
The anticipated 2x acceleration can be achieved in practice, making packing a valuable technique for efficient training of large language models.

Future directions

Packing can be explored for other domains, such as computer vision, where images of different sizes and resolutions can be packed to accelerate training.
Improving the performance of other models (RoBERTa, GPT-3, T5) by avoiding contamination between non-contiguous segments from different documents is an interesting direction for future research.
Even BERT itself might benefit from avoiding contamination between the two concatenated segments.

PreviousRethinking Learning Rate Tuning in the Era of Language Models NextGradient accumulation

Last updated 7 months ago