# Datasets

The data used to train neural language models has always been important, but as the industry has evolved it has learnt that <mark style="color:green;">more data is not always better</mark>.

### <mark style="color:purple;">Data Quantity and Scaling Laws</mark>

The relationship between model size, training dataset size, and data repetition has historically been considered crucial.  Research illustrates that model performance can be systematically improved by scaling up model size and training data size concurrently.&#x20;

However, recent research has show that quality of data, and how it is structured and ingested into the model training process is just as important. &#x20;

### <mark style="color:purple;">Ensuring Data Quality</mark>

We have found quality assurance techniques are indispensable during pretraining and fine tuning phases.&#x20;

Practices like deduplication, quality filtering, and toxicity filtering serve multiple purposes: they enhance training efficiency, minimise privacy risks, and reduce model memorisation.&#x20;

The focus on deduplication is particularly noteworthy, as it prevents train-test overlap, improving model perplexity and reliability.

### <mark style="color:purple;">Domain Composition and Data Diversity</mark>

The diversity and domain composition of datasets are paramount.

Creating heterogeneous dataset compositions ensures models are equipped with a broad range of abilities - reflecting the expanse of knowledge. Techniques for domain re-weighting and composition highlight the necessity for datasets to be balanced and varied, underscoring the importance of inclusivity in training data.

But this is just the case for creating 'general models' - creating models for highly specific use cases does not necessarily require this level of diversity.

This section on data explores the history of datasets used for trainiing foundation models, as well as those use for fine tuning pre-trained models.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://training.continuumlabs.ai/data/datasets.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
