Pre Training Data

Training Foundation Models

Foundation models are the foundation of the generative AI industry. Post the release of a paper that led to the development of the Transformer architecture, foundation model development has continued to grow.

The first public release of a pre-trained large language model based on the Transformer was the 117 million parameter GPT model by OpenAI. Following this, various models of increasing size were developed by many different companies including OpenAI, Google, Meta, Microsoft and Nvidia.

Foundation models are trained on extensive corpora, often encompassing billions of documents. Most of this text data has been derived from public sources. but there is an increasing demand for proprietary or private datasets.

The capacity of these models to accurately predict subsequent elements in a sequence is based on their understanding of language patterns and context learnt from the datasets on which they are trained.

These models have been the bedrock of the generative AI revolution, offering both proprietary and increasingly open-source options for a range of applications.

Their development over recent years has underscored the potential of large-scale models to transform various sectors by providing advanced capabilities in language understanding and generation.

Sources of Training Data

The Internet

The Internet has been a rich resource for pre-training LLMs, offering a breadth of linguistic knowledge due to the wide range of content available online.

However, the quality of such data varies greatly, with high-quality sources like Wikipedia and lower-quality sources such as spam emails.

This source of data will continue to remain important, but it is critical this data is cleaned to improve its quality.

Conversational Text

Conversational text from sources like Reddit or social media platforms is used to improve model ability to engage in dialogues and perform on question-answering tasks.

The primary issue with this source of data is the invasion of privacy. Conversational text, while on public forums - was never conceived to be used as training data for artificial intelligence.

Books

Book data provides formal and coherent long texts, contributing to a model's ability to understand complex linguistic structures and dependencies.

Open-source datasets like Books3 and Bookcorpus2, found in the Pile dataset, are common sources for this type of data, which aids LLMs in generating narrative texts and understanding formal language.

The Pile

The most famous source of training data for foundation models is known as the "The Pile".

The Pile is a 825 gigabyte diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together. The Pile

The primary constituents of "The Pile"

Source

% of Dataset

Details

Pile-CC

18.11%

Common Crawl: Collection of website crawls, including web pages, metadata, and text extractions.

PubMedCentral

14.40%

PubMed Central: Subset of PubMed repository for biomedical articles, with open full-text access.

Books3†

12.07%

Dataset of books derived from Bibliotik private tracker, mix of fiction and non-fiction.

OpenWebText2

10.01%

OpenWebText2: Web scraped dataset inspired by WebText and OpenWebTextCorpus, with content from Reddit.

ArXiv

8.96%

Preprint server for research papers predominantly in Math, Computer Science, and Physics.

Github

7.59%

Large corpus of open-source code repositories, enabling code-related task improvements.

FreeLaw

6.12%

Access to and analytical tools for academic studies in the legal realm from the FreeLaw Project.

StackExchange

5.13%

User-contributed content on a network of question-answer websites covering various subjects.

USPTOBackgrounds

3.65%

Background sections from patents granted by the US Patent and Trademark Office.

PubMedAbstracts

3.07%

Abstracts from publications in PubMed, covering a wide range of biomedical topics.

Gutenberg (PG-19)†

2.17%

Classic Western literature from Project Gutenberg books pre-1919.

OpenSubtitles

1.55%

English dataset of subtitles from movies and TV shows, providing natural dialog

PreviousDatasets NextTypes of Fine Tuning

Last updated 1 year ago

Was this helpful?