LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
Continuum Knowledge
Continuum Knowledge
  • Continuum
  • Data
    • Datasets
      • Pre Training Data
      • Types of Fine Tuning
      • Self Instruct Paper
      • Self-Alignment with Instruction Backtranslation
      • Systematic Evaluation of Instruction-Tuned Large Language Models on Open Datasets
      • Instruction Tuning
      • Instruction Fine Tuning - Alpagasus
      • Less is More For Alignment
      • Enhanced Supervised Fine Tuning
      • Visualising Data using t-SNE
      • UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
      • Training and Evaluation Datasets
      • What is perplexity?
  • MODELS
    • Foundation Models
      • The leaderboard
      • Foundation Models
      • LLama 2 - Analysis
      • Analysis of Llama 3
      • Llama 3.1 series
      • Google Gemini 1.5
      • Platypus: Quick, Cheap, and Powerful Refinement of LLMs
      • Mixtral of Experts
      • Mixture-of-Agents (MoA)
      • Phi 1.5
        • Refining the Art of AI Training: A Deep Dive into Phi 1.5's Innovative Approach
      • Phi 2.0
      • Phi-3 Technical Report
  • Training
    • The Fine Tuning Process
      • Why fine tune?
        • Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
        • Explanations in Fine Tuning
      • Tokenization
        • Tokenization Is More Than Compression
        • Tokenization - SentencePiece
        • Tokenization explore
        • Tokenizer Choice For LLM Training: Negligible or Crucial?
        • Getting the most out of your tokenizer for pre-training and domain adaptation
        • TokenMonster
      • Parameter Efficient Fine Tuning
        • P-Tuning
          • The Power of Scale for Parameter-Efficient Prompt Tuning
        • Prefix-Tuning: Optimizing Continuous Prompts for Generation
        • Harnessing the Power of PEFT: A Smarter Approach to Fine-tuning Pre-trained Models
        • What is Low-Rank Adaptation (LoRA) - explained by the inventor
        • Low Rank Adaptation (Lora)
        • Practical Tips for Fine-tuning LMs Using LoRA (Low-Rank Adaptation)
        • QLORA: Efficient Finetuning of Quantized LLMs
        • Bits and Bytes
        • The Magic behind Qlora
        • Practical Guide to LoRA: Tips and Tricks for Effective Model Adaptation
        • The quantization constant
        • QLORA: Efficient Finetuning of Quantized Language Models
        • QLORA and Fine-Tuning of Quantized Language Models (LMs)
        • ReLoRA: High-Rank Training Through Low-Rank Updates
        • SLoRA: Federated Parameter Efficient Fine-Tuning of Language Models
        • GaLora: Memory-Efficient LLM Training by Gradient Low-Rank Projection
      • Hyperparameters
        • Batch Size
        • Padding Tokens
        • Mixed precision training
        • FP8 Formats for Deep Learning
        • Floating Point Numbers
        • Batch Size and Model loss
        • Batch Normalisation
        • Rethinking Learning Rate Tuning in the Era of Language Models
        • Sample Packing
        • Gradient accumulation
        • A process for choosing the learning rate
        • Learning Rate Scheduler
        • Checkpoints
        • A Survey on Efficient Training of Transformers
        • Sequence Length Warmup
        • Understanding Training vs. Evaluation Data Splits
        • Cross-entropy loss
        • Weight Decay
        • Optimiser
        • Caching
      • Training Processes
        • Extending the context window
        • PyTorch Fully Sharded Data Parallel (FSDP)
        • Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
        • YaRN: Efficient Context Window Extension of Large Language Models
        • Sliding Window Attention
        • LongRoPE
        • Reinforcement Learning
        • An introduction to reinforcement learning
        • Reinforcement Learning from Human Feedback (RLHF)
        • Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  • INFERENCE
    • Why is inference important?
      • Grouped Query Attention
      • Key Value Cache
      • Flash Attention
      • Flash Attention 2
      • StreamingLLM
      • Paged Attention and vLLM
      • TensorRT-LLM
      • Torchscript
      • NVIDIA L40S GPU
      • Triton Inference Server - Introduction
      • Triton Inference Server
      • FiDO: Fusion-in-Decoder optimised for stronger performance and faster inference
      • Is PUE a useful measure of data centre performance?
      • SLORA
  • KNOWLEDGE
    • Vector Databases
      • A Comprehensive Survey on Vector Databases
      • Vector database management systems: Fundamental concepts, use-cases, and current challenges
      • Using the Output Embedding to Improve Language Models
      • Decoding Sentence-BERT
      • ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
      • SimCSE: Simple Contrastive Learning of Sentence Embeddings
      • Questions Are All You Need to Train a Dense Passage Retriever
      • Improving Text Embeddings with Large Language Models
      • Massive Text Embedding Benchmark
      • RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking
      • LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
      • Embedding and Fine-Tuning in Neural Language Models
      • Embedding Model Construction
      • Demystifying Embedding Spaces using Large Language Models
      • Fine-Tuning Llama for Multi-Stage Text Retrieval
      • Large Language Model Based Text Augmentation Enhanced Personality Detection Model
      • One Embedder, Any Task: Instruction-Finetuned Text Embeddings
      • Vector Databases are not the only solution
      • Knowledge Graphs
        • Harnessing Knowledge Graphs to Elevate AI: A Technical Exploration
        • Unifying Large Language Models and Knowledge Graphs: A Roadmap
      • Approximate Nearest Neighbor (ANN)
      • High Dimensional Data
      • Principal Component Analysis (PCA)
      • Vector Similarity Search - HNSW
      • FAISS (Facebook AI Similarity Search)
      • Unsupervised Dense Retrievers
    • Retrieval Augmented Generation
      • Retrieval-Augmented Generation for Large Language Models: A Survey
      • Fine-Tuning or Retrieval?
      • Revolutionising Information Retrieval: The Power of RAG in Language Models
      • A Survey on Retrieval-Augmented Text Generation
      • REALM: Retrieval-Augmented Language Model Pre-Training
      • Retrieve Anything To Augment Large Language Models
      • Generate Rather Than Retrieve: Large Language Models Are Strong Context Generators
      • Active Retrieval Augmented Generation
      • DSPy: LM Assertions: Enhancing Language Model Pipelines with Computational Constraints
      • DSPy: Compiling Declarative Language Model Calls
      • DSPy: In-Context Learning for Extreme Multi-Label Classification
      • Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
      • HYDE: Revolutionising Search with Hypothetical Document Embeddings
      • Enhancing Recommender Systems with Large Language Model Reasoning Graphs
      • Retrieval Augmented Generation (RAG) versus fine tuning
      • RAFT: Adapting Language Model to Domain Specific RAG
      • Summarisation Methods and RAG
      • Lessons Learned on LLM RAG Solutions
      • Stanford: Retrieval Augmented Language Models
      • Overview of RAG Approaches with Vector Databases
      • Mastering Chunking in Retrieval-Augmented Generation (RAG) Systems
    • Semantic Routing
    • Resource Description Framework (RDF)
  • AGENTS
    • What is agency?
      • Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves
      • Types of Agents
      • The risk of AI agency
      • Understanding Personality in Large Language Models: A New Frontier in AI Psychology
      • AI Agents - Reasoning, Planning, and Tool Calling
      • Personality and Brand
      • Agent Interaction via APIs
      • Bridging Minds and Machines: The Legacy of Newell, Shaw, and Simon
      • A Survey on Language Model based Autonomous Agents
      • Large Language Models as Agents
      • AI Reasoning: A Deep Dive into Chain-of-Thought Prompting
      • Enhancing AI Reasoning with Self-Taught Reasoner (STaR)
      • Exploring the Frontier of AI: The "Tree of Thoughts" Framework
      • Toolformer: Revolutionising Language Models with API Integration - An Analysis
      • TaskMatrix.AI: Bridging Foundational AI Models with Specialised Systems for Enhanced Task Completion
      • Unleashing the Power of LLMs in API Integration: The Rise of Gorilla
      • Andrew Ng's presentation on AI agents
      • Making AI accessible with Andrej Karpathy and Stephanie Zhan
  • Regulation and Ethics
    • Regulation and Ethics
      • Privacy
      • Detecting AI Generated content
      • Navigating the IP Maze in AI: The Convergence of Blockchain, Web 3.0, and LLMs
      • Adverse Reactions to generative AI
      • Navigating the Ethical Minefield: The Challenge of Security in Large Language Models
      • Navigating the Uncharted Waters: The Risks of Autonomous AI in Military Decision-Making
  • DISRUPTION
    • Data Architecture
      • What is a data pipeline?
      • What is Reverse ETL?
      • Unstructured Data and Generatve AI
      • Resource Description Framework (RDF)
      • Integrating generative AI with the Semantic Web
    • Search
      • BM25 - Search Engine Ranking Function
      • BERT as a reranking engine
      • BERT and Google
      • Generative Engine Optimisation (GEO)
      • Billion-scale similarity search with GPUs
      • FOLLOWIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
      • Neural Collaborative Filtering
      • Federated Neural Collaborative Filtering
      • Latent Space versus Embedding Space
      • Improving Text Embeddings with Large Language Models
    • Recommendation Engines
      • On Interpretation and Measurement of Soft Attributes for Recommendation
      • A Survey on Large Language Models for Recommendation
      • Model driven recommendation systems
      • Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations
      • Foundation Models for Recommender Systems
      • Exploring the Impact of Large Language Models on Recommender Systems: An Extensive Review
      • AI driven recommendations - harming autonomy?
    • Logging
      • A Taxonomy of Anomalies in Log Data
      • Deeplog
      • LogBERT: Log Anomaly Detection via BERT
      • Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection
      • Log-based Anomaly Detection with Deep Learning: How Far Are We?
      • Deep Learning for Anomaly Detection in Log Data: A Survey
      • LogGPT
      • Adaptive Semantic Gate Networks (ASGNet) for log-based anomaly diagnosis
  • Infrastructure
    • The modern data centre
      • Enhancing Data Centre Efficiency: Strategies to Improve PUE
      • TCO of NVIDIA GPUs and falling barriers to entry
      • Maximising GPU Utilisation with Kubernetes and NVIDIA GPU Operator
      • Data Centres
      • Liquid Cooling
    • Servers and Chips
      • The NVIDIA H100 GPU
      • NVIDIA H100 NVL
      • Lambda Hyperplane 8-H100
      • NVIDIA DGX Servers
      • NVIDIA DGX-2
      • NVIDIA DGX H-100 System
      • NVLink Switch
      • Tensor Cores
      • NVIDIA Grace Hopper Superchip
      • NVIDIA Grace CPU Superchip
      • NVIDIA GB200 NVL72
      • Hopper versus Blackwell
      • HGX: High-Performance GPU Platforms
      • ARM Chips
      • ARM versus x86
      • RISC versus CISC
      • Introduction to RISC-V
    • Networking and Connectivity
      • Infiniband versus Ethernet
      • NVIDIA Quantum InfiniBand
      • PCIe (Peripheral Component Interconnect Express)
      • NVIDIA ConnectX InfiniBand adapters
      • NVMe (Non-Volatile Memory Express)
      • NVMe over Fabrics (NVMe-oF)
      • NVIDIA Spectrum-X
      • NVIDIA GPUDirect
      • Evaluating Modern GPU Interconnect
      • Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)
      • Next-generation networking in AI environments
      • NVIDIA Collective Communications Library (NCCL)
    • Data and Memory
      • NVIDIA BlueField Data Processing Units (DPUs)
      • Remote Direct Memory Access (RDMA)
      • High Bandwidth Memory (HBM3)
      • Flash Memory
      • Model Requirements
      • Calculating GPU memory for serving LLMs
      • Transformer training costs
      • GPU Performance Optimisation
    • Libraries and Complements
      • NVIDIA Base Command
      • NVIDIA AI Enterprise
      • CUDA - NVIDIA GTC 2024 presentation
      • RAPIDs
      • RAFT
    • Vast Data Platform
      • Vast Datastore
      • Vast Database
      • Vast Data Engine
      • DASE (Disaggregated and Shared Everything)
      • Dremio and VAST Data
    • Storage
      • WEKA: A High-Performance Storage Solution for AI Workloads
      • Introduction to NVIDIA GPUDirect Storage (GDS)
        • GDS cuFile API
      • NVIDIA Magnum IO GPUDirect Storage (GDS)
      • Vectors in Memory
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Key points and technical details
  • Sequence length distributions
  • The methods described in the paper consist of three main components
  • Advice on setting hyperparameters
  • Experiments
  • Key Conclusions
  • Future directions

Was this helpful?

  1. Training
  2. The Fine Tuning Process
  3. Hyperparameters

Sample Packing

PreviousRethinking Learning Rate Tuning in the Era of Language ModelsNextGradient accumulation

Last updated 11 months ago

Was this helpful?

This October 2022 paper "Efficient Sequence Packing for Transformers" addresses the issue of padding tokens in large language models (LLMs) and presents new algorithms for efficient sequence packing to improve training throughput and accuracy.

The authors highlight that up to 50% of all tokens in common NLP datasets can be padding tokens, leading to significant inefficiency in processing variable-length sequences on hardware accelerators.

Key points and technical details

Padding tokens in NLP datasets

  • Common practice in LLMs is to introduce padding tokens to handle variable-length sequences on hardware accelerators.

  • The variation in sequence lengths in NLP datasets can result in up to 50% of all tokens being padding tokens.

  • In extreme cases (e.g., GLUE-cola with sequence length 128), the padding ratio can be as high as 89%.

Existing methods and their limitations

  • Naïve batching: Widely used but inefficient due to high percentage of padding tokens.

  • Separator tokens: Used to separate sequences from different documents (e.g., in RoBERTa) but can have a significant impact on performance.

  • Un-padding (Effective Transformer) and sorted batching (Faster Transformer, lingvo, fairseq): Require substantial hardware-specific low-level code optimizations, mostly available on GPUs.

Formalizing the packing problem

  • The authors frame the sequence packing problem in the context of the well-studied bin packing problem.

  • They present new deterministic and efficient packing algorithms based on established solvers that can efficiently pack datasets with millions of sequences in a matter of seconds.

Cross-contamination and model adjustments

  • Cross-contamination: The cause of accuracy reduction when sequences from different documents interact in self-attention, not mitigated by separator tokens.

  • The authors show how the BERT model can be adjusted to ensure mathematical equivalence between the original and packed models, avoiding cross-contamination with little overhead.

Empirical results

  • The proposed packing algorithms produce a nearly-optimal packing scheme for the Wikipedia pre-training dataset.

  • Experiments demonstrate that the convergence of the BERT-large model on the packed dataset is equivalent to that on the un-packed dataset, with a 2x throughput increase on the Wikipedia sequence length 512 pre-training dataset.

In summary, this paper addresses the inefficiency caused by padding tokens in LLMs and presents new algorithms for efficient sequence packing.

The authors formalize the packing problem, introduce techniques to avoid cross-contamination, and demonstrate improved training throughput and accuracy on the Wikipedia pre-training dataset.

The proposed methods can be adapted to existing models, ensuring mathematical equivalence between the original and packed models, and can be applied to various machine learning algorithms with differently sized data samples.

Sequence length distributions

Sequence length distributions refer to the frequency or probability of sequences of different lengths within a dataset.

In the context of natural language processing (NLP) and large language models (LLMs), sequence length refers to the number of tokens in a given input sequence, such as a sentence or a document.

The findings presented in the image show the sequence length distributions for various datasets:

Wikipedia BERT pre-training dataset

  • For a maximum sequence length of 128, 59.9% of the sequences are shorter than the maximum, leading to a theoretical maximum speed-up of 1.210 if padding tokens are not used.

  • For a maximum sequence length of 384, 30.6% of the sequences are shorter than the maximum, with a theoretical maximum speed-up of 1.742.

  • For a maximum sequence length of 512, 23.5% of the sequences are shorter than the maximum, with a theoretical maximum speed-up of 2.001.

GLUE datasets (cola, sst2, mrpc, qqp, stsb, mnli, rte, wnli)

  • The sequence length distributions for these datasets are shown in the top right graph, indicating that they have varying sequence lengths, with some datasets having a higher concentration of shorter sequences.

SQuAD 1.1 dataset

  • The sequence length distribution for the SQuAD 1.1 dataset is shown in the bottom left graph, with a theoretical maximum speed-up of 2.2x if padding tokens are not used.

The key findings from these sequence length distributions are

  1. Many datasets have a significant portion of sequences that are shorter than the maximum sequence length, leading to a large number of padding tokens when using fixed-length input sequences.

  2. The skewed sequence length distributions are not limited to text data (Wikipedia, GLUE, SQuAD) but also apply to audio data (LibriSpeech) and molecular data (QM9).

  3. The theoretical maximum speed-up that can be achieved by not using padding tokens varies depending on the dataset and the maximum sequence length, ranging from 1.210 to 2.2x in the presented examples.

These findings highlight the potential for improving the efficiency of LLMs by using techniques like sequence packing to minimise the number of padding tokens and optimise the input sequences for faster processing.

The methods described in the paper consist of three main components

Efficient data packing during pre-processing

  • Two new heuristic offline algorithms are proposed: Shortest-pack-first histogram-packing (SPFHP) and Non-negative least squares histogram-packing (NNLSHP).

  • These algorithms efficiently pack the dataset to maximize the utilization of the maximum sequence length.

Model changes to preserve equivalence with the original BERT implementation (packedBERT)

  • Adjust positional embeddings: Replace the bias add operation with an embedding look-up to handle packed sequences.

  • Adjust attention masking: Use a block-diagonal mask before the attention softmax to prevent tokens from different sequences within a pack from attending to each other.

  • Adjust per-sequence loss and accuracy: Unpack the logits and labels to compute the loss on each sequence separately, ensuring consistency with the un-packed BERT implementation.

Hyperparameter adjustments for comparable convergence behavior

  • The primary consequence of packing is an increase in the effective batch size, which requires adjusting hyperparameters sensitive to the number of sequences and tokens.

  • One approach is to reduce the computational batch size by the packing factor and keep other hyperparameters the same.

Advice on setting hyperparameters

  1. Determine the packing factor (average number of sequences per pack) based on your dataset and the chosen packing algorithm.

  2. If you want to keep the convergence behavior as close as possible to the un-packed BERT implementation, reduce the computational batch size by the packing factor and keep other hyperparameters unchanged. This approach might lead to under-utilization of memory/compute resources.

  3. If you want to preserve the batch size and optimize hardware utilization, update the decay parameters of the LAMB optimizer using the provided heuristic (β1 := β1^p, β2 := β2^p). Keep in mind that this is an approximate adjustment, and the convergence behavior might not be identical to the un-packed version.

  4. Be cautious when scaling the learning rate with the batch size, as the experiments in the paper show that this can reduce convergence speed.

  5. Monitor the convergence behavior of your model closely and be prepared to fine-tune the hyperparameters if necessary. The provided adjustments are heuristics and may not fully undo the impact of the increased batch size.

  6. If you encounter any issues with convergence or performance, consider adjusting other hyperparameters, such as the learning rate, warmup steps, or the number of training epochs, based on your specific use case and dataset.

Remember that the optimal hyperparameter settings may vary depending on your dataset, model architecture, and hardware setup. It is always a good practice to experiment with different configurations and validate the performance of your model on a holdout dataset or through cross-validation.

Experiments

The experiments in this paper focus on evaluating the proposed sequence packing algorithms (SPFHP and NNLSHP) and their impact on the training of BERT models.

The main ideas and insights from these experiments are as follows:

Bin packing algorithm comparison

  • The authors compare their proposed packing algorithms (SPFHP and NNLSHP) with baseline methods like no packing (NONE), sorted batching (SORT), and greedy packing (GREEDY).

  • They evaluate the algorithms using metrics such as the number of packs, number of tokens, number of padding tokens, solution time, packing efficiency, speed-up achieved, and average number of sequences per sample (packing factor).

  • The results show that NNLSHP with packing depth 3 achieves the best speed-up (1.913) and packing efficiency (99.7%), close to the theoretical upper bound of 2.001.

  • The overhead from the attention masking and loss adjustment slightly increases with packing depth but is outweighed by the benefits of packing.

  • The packing factor and improvement in efficiency provide an accurate estimate of the speed-up.

MLPerf™ phase 2 pretraining setup

  • The authors compare the learning curves and hyperparameter adjustments for packed and unpacked BERT training on the MLPerf™ BERT pre-training benchmark.

  • They analyze the impact of packing on convergence behavior and the theoretical speed-up in practice.

  • The learning curves show that with the same hyperparameters and adjusted accumulation count, packed and unpacked training have almost identical performance when normalized by the number of samples processed.

  • Adjusting the hyperparameters (e.g., LAMB decay parameters) to compensate for the increased batch size in packed training helps match the performance at later training stages but cannot completely recover the early convergence behavior of the smaller batch size.

  • The realized total speed-up from packing exceeds 2x due to the reduction in computational work and latency of transferring data to the device.

Full pretraining and SQuAD finetuning

  • The authors validate that downstream performance (on SQuAD 1.1) is not impacted by packing after a full pre-training of BERT base and large models plus fine-tuning.

  • They use the same hyperparameters and number of training steps for packed and unpacked runs, reducing the gradient accumulation count for packed training to match the total number of sequences processed before each weight update.

  • The SQuAD scores for packed and unpacked models are comparable to the reference scores, confirming that packing does not degrade downstream performance.

Scaling analysis: Impact of accelerators count

  • The authors discuss the advantage of packing over un-padding approaches in terms of inherent load balancing.

  • Un-padding relies on dynamically launching custom kernels that ignore padding, which can lead to load-imbalance between devices and a decrease in speed-up as the number of accelerators increases.

  • Packing, on the other hand, is inherently load-balanced, with processing time on each accelerator being independent of the content inside the batch received by the device.

These experiments provide valuable insights into how to effectively apply sequence packing in the training of large language models like BERT.

The key takeaways are:

  1. Use efficient packing algorithms like NNLSHP with an appropriate packing depth to maximize speed-up and efficiency.

  2. Apply the necessary model adjustments (attention masking and positional embedding) to maintain performance.

  3. Adjust hyperparameters (e.g., batch size, accumulation count, optimizer parameters) to compensate for the increased effective batch size in packed training.

  4. Sequence packing can provide significant speed-up without compromising downstream performance, making it a valuable technique for efficient training of large language models.

Key Conclusions

Sequence length distributions

  • Visualising sequence length distributions of various datasets (language, audio, molecular) highlights the prevalence of padding and the potential benefits of packing.

  • Packing can lead to more than 2x acceleration by removing 50% or more padding in these datasets.

Efficient packing algorithms

  • The proposed packing approaches (SPFHP and NNLSHP) based on established solvers are highly efficient, leaving almost no padding and handling large datasets quickly.

  • These algorithms outperform existing approaches that are slow and suboptimal.

Model adjustments for packed sequences

  • Without adjusting the sequence processing algorithm (e.g., BERT) to the packed sequences, predictive performance is reduced.

  • The proposed model adjustments (attention masking, positional embedding, and loss/accuracy computation) are all necessary to maintain predictive performance.

  • These adjustments come with an overhead of less than 5% but enable significant speed-up while preserving performance.

Hyperparameter tuning

  • Packing increases the effective batch size, which requires adjusting hyperparameters such as the learning rate, optimizer parameters (e.g., LAMB decay parameters), and gradient accumulation count.

  • Carefully tuning these hyperparameters can help maintain convergence behavior and performance when using packed sequences.

Downstream performance and speed-up

  • Experiments demonstrate that downstream performance (e.g., on SQuAD) is not impacted by packing when the proposed model adjustments are applied.

  • The anticipated 2x acceleration can be achieved in practice, making packing a valuable technique for efficient training of large language models.

Future directions

  • Packing can be explored for other domains, such as computer vision, where images of different sizes and resolutions can be packed to accelerate training.

  • Improving the performance of other models (RoBERTa, GPT-3, T5) by avoiding contamination between non-contiguous segments from different documents is an interesting direction for future research.

  • Even BERT itself might benefit from avoiding contamination between the two concatenated segments.

In summary, the paper highlights the importance of considering sequence length distributions, applying efficient packing algorithms, and making necessary model adjustments to maintain performance while achieving significant speed-up through sequence packing.

Hyperparameter tuning plays a crucial role in ensuring convergence behavior and performance when using packed sequences. The insights and techniques presented in the paper can be valuable for accelerating the training of large language models and potentially other domains, such as computer vision.

Another approach is to preserve the batch size and update the decay parameters of the LAMB optimizer using the heuristic: β1:=β1p,β2:=β2pβ1 := β1^p, β2 := β2^pβ1:=β1p,β2:=β2p, where ppp is the packing factor.

LogoEfficient Sequence Packing without Cross-contamination:...arXiv.org
Efficient Sequence Packing for Transformers
Page cover image