Sequence Length Warmup
The paper proposes a method called Sequence Length Warmup, which aims to solve the stability-efficiency dilemma by avoiding extreme gradient variance values.
It involves gradually increasing the sequence length during training to mitigate training instability while maintaining computational efficiency.
Sequence Length Warmup is a technique used to train language models by gradually increasing the sequence length (number of tokens per sentence) during the initial stages of training.
The implementation involves linearly increasing the sequence length over the first 30% of training and maintaining the maximum sequence length for the remainder of the training process.
To use Sequence Length Warmup, you can define a warmup schedule that increases the sequence length at each step until it reaches the maximum value. This can be done through a functional interface or using the Composer Trainer, specifying the maximum sequence length and other parameters.
Sequence Length Warmup is a form of curriculum learning, where example difficulty is determined by the length of the sequence. It presents sentences of increasing length for training, but it does not explicitly train on shorter sentences. The implementation may truncate or segment longer sentences to create shorter ones.
The practice of increasing batch sizes and learning rates in large-scale autoregressive language model pre-training can lead to a stability-efficiency dilemma. While it improves training efficiency, it can result in training instability, leading to poor generalization accuracy or failed runs.
Training instability is strongly correlated with extreme values of gradient variance. Samples with long sequence lengths contribute to these extreme gradient variance values, particularly at the beginning of training, indicating that long sequence lengths can be a significant source of training instability.
To address the stability-efficiency dilemma, the paper proposes a method called Sequence Length Warmup. This method aims to avoid extreme gradient variance values by gradually increasing the sequence length during training.
The authors also introduce a lightweight tuning strategy for the Sequence Length Warmup method, allowing it to be tuned with only a small portion of the full training.
Experimental results on GPT-2 (117M and 1.5B) and GPT-3 (125M) models demonstrate that the proposed approach enables stable training with significantly larger batch sizes and learning rates compared to the baseline approach. It achieves similar or better zero-shot evaluation results while reducing the required number of training tokens and wall clock time.
Last updated