> For the complete documentation index, see [llms.txt](https://training.continuumlabs.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://training.continuumlabs.ai/disruption/data-architecture.md).

# Data Architecture

We believe the impact of neural language models on the architecture of data will be profound.

This stems from the ability of these models to structure data rapidly, and allow its ingestion into databases for future retrieval.

Unstructured data is everywhere, it is streaming, and most of it is either useless, or unused. &#x20;

Organisations believe there is value in this data, that is why it is collected in data lakes and data warehouses, but the challenge is deriving timely insights from the data.&#x20;

Neural language models can be injected into the data pipeline process, trained to review streaming unstructured data and convert it into a form that can be used by the organisation.

### <mark style="color:purple;">What is unstructured data?</mark>

Unstructured data refers to information that either does not have a pre-defined data model or is not organised in a pre-defined manner.&#x20;

This type of data is usually text-heavy, but may also contain dates, numbers, and facts.&#x20;

Here are various types of unstructured data commonly encountered, particularly in the context of business and analytics:

<mark style="color:green;">**Social Media Content**</mark>

Posts, tweets, comments, likes, shares, and other forms of engagement on social media platforms are unstructured. They provide valuable insights into consumer behavior, preferences, and trends.

<mark style="color:green;">**Business Documents**</mark>

This includes documents like contracts, legal agreements, financial statements, and technical documentation. They are often text-heavy and lack a uniform format.

<mark style="color:green;">**Sensor Data**</mark>

Data generated by sensors (like IoT devices) can be unstructured, especially when it's in the form of signals or readings that are not organized in a predefined manner.

<mark style="color:green;">**Emails and Communication Records**</mark>

Business communications, customer service emails, and other forms of digital correspondence are usually unstructured and contain valuable insights into business processes, customer relations, and more.

<mark style="color:green;">**Web Pages and Blogs**</mark>

The content on websites and blogs is typically unstructured, containing text, images, links, and sometimes multimedia elements.

<mark style="color:green;">**Customer Feedback and Reviews**</mark>

Customer opinions, feedback forms, survey responses, and product reviews often come in unstructured formats.

<mark style="color:green;">**Scientific and Research Data**</mark>

This can include experiment results, research notes, and other forms of data collected during scientific work, which often lacks a standardized format.

<mark style="color:green;">**Health Records**</mark>

Patient records, doctor's notes, and medical imaging are unstructured and contain critical information for healthcare analysis.

<mark style="color:green;">**Geospatial Data**</mark>

This includes data related to geographic locations, which can be derived from GPS data, satellite imagery, etc.

<mark style="color:green;">**Machine-Generated Data**</mark>

Logs generated by computers, servers, network devices, and other technology infrastructure often contain unstructured data that can be crucial for IT operations and security analysis.

<mark style="color:green;">**Multimedia Data**</mark>

This encompasses images, audio files, and videos. For instance, videos uploaded on social media, audio recordings of customer service calls, or images captured by surveillance cameras. Analysing this data requires specialized techniques like image recognition and audio processing.

### <mark style="color:purple;">Current Data Strategy not fit for Purpose</mark>

Many enterprises are currently focusing on the hype surrounding generative AI technologies like large language models (LLMs), vector databases, and retrieval-augmented generation (RAG), while neglecting the foundational data issues that can derail their AI efforts.

We see several key problems that plague enterprise data ecosystems, including:

1. Data silos and <mark style="color:yellow;">poor data discoverability</mark>
2. Inadequate <mark style="color:yellow;">data governance</mark> and audit trails
3. Lack of data lineage, tracing, and <mark style="color:yellow;">observability</mark>
4. Issues with <mark style="color:yellow;">data preparation</mark>, <mark style="color:yellow;">entity resolution</mark>, and <mark style="color:yellow;">data quality</mark>
5. Absence of a clear data monetisation strategy

These issues become even more critical when deploying generative AI applications, as they often involve customer-facing use cases, integrate data from various sources, and introduce new complexities in data governance.

### <mark style="color:purple;">Issues with AI and Data Management</mark>

Generative AI applications heavily rely on the quality and accessibility of data. Poor data infrastructure can lead to delays, inaccuracies, and even data privacy violations.

Data silos and lack of interoperability hinder the development of AI applications that require data from multiple sources.

Inadequate data governance poses serious security and privacy risks, especially for customer-facing AI applications.

Without proper data lineage and observability, troubleshooting data quality issues in real-time becomes challenging, affecting the performance of AI applications.

Poor data quality, including issues with entity resolution and PII masking, can negatively impact the accuracy and safety of AI models.

### <mark style="color:purple;">Ideas for Recreating Data Architecture</mark>

1. Establish a central data catalog with a streamlined data sharing process to break down data silos and improve data discoverability.
2. Implement a robust data governance framework with precise permission boundaries, effective enforcement, and comprehensive audit trails.
3. Invest in data lineage, tracing, and observability solutions to enable quick identification and resolution of data quality issues.
4. Prioritize data quality initiatives, including entity resolution, PII masking, and data curation, to ensure the accuracy and safety of AI models.
5. Develop a clear data monetization strategy that aligns with the company's business objectives and identifies core datasets to protect, acquire, or monetize.
6. Create an enterprise-wide RAG service that abstracts the underlying vector databases, embedding services, and retrievers, allowing teams to focus on building AI applications.
7. Establish guidelines for data governance and access control specific to LLM agents, considering their unique characteristics and requirements.
8. Leverage LLMs to assist with data management tasks, such as identity resolution, entity resolution, and improving data pipeline definitions.

In conclusion, enterprises must prioritize their data strategy and infrastructure to successfully deploy and scale generative AI applications. By addressing the existing data challenges and adapting their data architecture to the specific needs of AI workloads, companies can unlock the full potential of generative AI while mitigating risks and ensuring long-term success.