Datasets
The data used to train neural language models has always been important, but as the industry has evolved it has learnt that more data is not always better.
Data Quantity and Scaling Laws
The relationship between model size, training dataset size, and data repetition has historically been considered crucial. Research illustrates that model performance can be systematically improved by scaling up model size and training data size concurrently.
However, recent research has show that quality of data, and how it is structured and ingested into the model training process is just as important.
Ensuring Data Quality
We have found quality assurance techniques are indispensable during pretraining and fine tuning phases.
Practices like deduplication, quality filtering, and toxicity filtering serve multiple purposes: they enhance training efficiency, minimise privacy risks, and reduce model memorisation.
The focus on deduplication is particularly noteworthy, as it prevents train-test overlap, improving model perplexity and reliability.
Domain Composition and Data Diversity
The diversity and domain composition of datasets are paramount.
Creating heterogeneous dataset compositions ensures models are equipped with a broad range of abilities - reflecting the expanse of knowledge. Techniques for domain re-weighting and composition highlight the necessity for datasets to be balanced and varied, underscoring the importance of inclusivity in training data.
But this is just the case for creating 'general models' - creating models for highly specific use cases does not necessarily require this level of diversity.
This section on data explores the history of datasets used for trainiing foundation models, as well as those use for fine tuning pre-trained models.
Last updated