LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
Continuum Knowledge
Continuum Knowledge
  • Continuum
  • Data
    • Datasets
      • Pre Training Data
      • Types of Fine Tuning
      • Self Instruct Paper
      • Self-Alignment with Instruction Backtranslation
      • Systematic Evaluation of Instruction-Tuned Large Language Models on Open Datasets
      • Instruction Tuning
      • Instruction Fine Tuning - Alpagasus
      • Less is More For Alignment
      • Enhanced Supervised Fine Tuning
      • Visualising Data using t-SNE
      • UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
      • Training and Evaluation Datasets
      • What is perplexity?
  • MODELS
    • Foundation Models
      • The leaderboard
      • Foundation Models
      • LLama 2 - Analysis
      • Analysis of Llama 3
      • Llama 3.1 series
      • Google Gemini 1.5
      • Platypus: Quick, Cheap, and Powerful Refinement of LLMs
      • Mixtral of Experts
      • Mixture-of-Agents (MoA)
      • Phi 1.5
        • Refining the Art of AI Training: A Deep Dive into Phi 1.5's Innovative Approach
      • Phi 2.0
      • Phi-3 Technical Report
  • Training
    • The Fine Tuning Process
      • Why fine tune?
        • Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
        • Explanations in Fine Tuning
      • Tokenization
        • Tokenization Is More Than Compression
        • Tokenization - SentencePiece
        • Tokenization explore
        • Tokenizer Choice For LLM Training: Negligible or Crucial?
        • Getting the most out of your tokenizer for pre-training and domain adaptation
        • TokenMonster
      • Parameter Efficient Fine Tuning
        • P-Tuning
          • The Power of Scale for Parameter-Efficient Prompt Tuning
        • Prefix-Tuning: Optimizing Continuous Prompts for Generation
        • Harnessing the Power of PEFT: A Smarter Approach to Fine-tuning Pre-trained Models
        • What is Low-Rank Adaptation (LoRA) - explained by the inventor
        • Low Rank Adaptation (Lora)
        • Practical Tips for Fine-tuning LMs Using LoRA (Low-Rank Adaptation)
        • QLORA: Efficient Finetuning of Quantized LLMs
        • Bits and Bytes
        • The Magic behind Qlora
        • Practical Guide to LoRA: Tips and Tricks for Effective Model Adaptation
        • The quantization constant
        • QLORA: Efficient Finetuning of Quantized Language Models
        • QLORA and Fine-Tuning of Quantized Language Models (LMs)
        • ReLoRA: High-Rank Training Through Low-Rank Updates
        • SLoRA: Federated Parameter Efficient Fine-Tuning of Language Models
        • GaLora: Memory-Efficient LLM Training by Gradient Low-Rank Projection
      • Hyperparameters
        • Batch Size
        • Padding Tokens
        • Mixed precision training
        • FP8 Formats for Deep Learning
        • Floating Point Numbers
        • Batch Size and Model loss
        • Batch Normalisation
        • Rethinking Learning Rate Tuning in the Era of Language Models
        • Sample Packing
        • Gradient accumulation
        • A process for choosing the learning rate
        • Learning Rate Scheduler
        • Checkpoints
        • A Survey on Efficient Training of Transformers
        • Sequence Length Warmup
        • Understanding Training vs. Evaluation Data Splits
        • Cross-entropy loss
        • Weight Decay
        • Optimiser
        • Caching
      • Training Processes
        • Extending the context window
        • PyTorch Fully Sharded Data Parallel (FSDP)
        • Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
        • YaRN: Efficient Context Window Extension of Large Language Models
        • Sliding Window Attention
        • LongRoPE
        • Reinforcement Learning
        • An introduction to reinforcement learning
        • Reinforcement Learning from Human Feedback (RLHF)
        • Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  • INFERENCE
    • Why is inference important?
      • Grouped Query Attention
      • Key Value Cache
      • Flash Attention
      • Flash Attention 2
      • StreamingLLM
      • Paged Attention and vLLM
      • TensorRT-LLM
      • Torchscript
      • NVIDIA L40S GPU
      • Triton Inference Server - Introduction
      • Triton Inference Server
      • FiDO: Fusion-in-Decoder optimised for stronger performance and faster inference
      • Is PUE a useful measure of data centre performance?
      • SLORA
  • KNOWLEDGE
    • Vector Databases
      • A Comprehensive Survey on Vector Databases
      • Vector database management systems: Fundamental concepts, use-cases, and current challenges
      • Using the Output Embedding to Improve Language Models
      • Decoding Sentence-BERT
      • ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
      • SimCSE: Simple Contrastive Learning of Sentence Embeddings
      • Questions Are All You Need to Train a Dense Passage Retriever
      • Improving Text Embeddings with Large Language Models
      • Massive Text Embedding Benchmark
      • RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking
      • LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
      • Embedding and Fine-Tuning in Neural Language Models
      • Embedding Model Construction
      • Demystifying Embedding Spaces using Large Language Models
      • Fine-Tuning Llama for Multi-Stage Text Retrieval
      • Large Language Model Based Text Augmentation Enhanced Personality Detection Model
      • One Embedder, Any Task: Instruction-Finetuned Text Embeddings
      • Vector Databases are not the only solution
      • Knowledge Graphs
        • Harnessing Knowledge Graphs to Elevate AI: A Technical Exploration
        • Unifying Large Language Models and Knowledge Graphs: A Roadmap
      • Approximate Nearest Neighbor (ANN)
      • High Dimensional Data
      • Principal Component Analysis (PCA)
      • Vector Similarity Search - HNSW
      • FAISS (Facebook AI Similarity Search)
      • Unsupervised Dense Retrievers
    • Retrieval Augmented Generation
      • Retrieval-Augmented Generation for Large Language Models: A Survey
      • Fine-Tuning or Retrieval?
      • Revolutionising Information Retrieval: The Power of RAG in Language Models
      • A Survey on Retrieval-Augmented Text Generation
      • REALM: Retrieval-Augmented Language Model Pre-Training
      • Retrieve Anything To Augment Large Language Models
      • Generate Rather Than Retrieve: Large Language Models Are Strong Context Generators
      • Active Retrieval Augmented Generation
      • DSPy: LM Assertions: Enhancing Language Model Pipelines with Computational Constraints
      • DSPy: Compiling Declarative Language Model Calls
      • DSPy: In-Context Learning for Extreme Multi-Label Classification
      • Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
      • HYDE: Revolutionising Search with Hypothetical Document Embeddings
      • Enhancing Recommender Systems with Large Language Model Reasoning Graphs
      • Retrieval Augmented Generation (RAG) versus fine tuning
      • RAFT: Adapting Language Model to Domain Specific RAG
      • Summarisation Methods and RAG
      • Lessons Learned on LLM RAG Solutions
      • Stanford: Retrieval Augmented Language Models
      • Overview of RAG Approaches with Vector Databases
      • Mastering Chunking in Retrieval-Augmented Generation (RAG) Systems
    • Semantic Routing
    • Resource Description Framework (RDF)
  • AGENTS
    • What is agency?
      • Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves
      • Types of Agents
      • The risk of AI agency
      • Understanding Personality in Large Language Models: A New Frontier in AI Psychology
      • AI Agents - Reasoning, Planning, and Tool Calling
      • Personality and Brand
      • Agent Interaction via APIs
      • Bridging Minds and Machines: The Legacy of Newell, Shaw, and Simon
      • A Survey on Language Model based Autonomous Agents
      • Large Language Models as Agents
      • AI Reasoning: A Deep Dive into Chain-of-Thought Prompting
      • Enhancing AI Reasoning with Self-Taught Reasoner (STaR)
      • Exploring the Frontier of AI: The "Tree of Thoughts" Framework
      • Toolformer: Revolutionising Language Models with API Integration - An Analysis
      • TaskMatrix.AI: Bridging Foundational AI Models with Specialised Systems for Enhanced Task Completion
      • Unleashing the Power of LLMs in API Integration: The Rise of Gorilla
      • Andrew Ng's presentation on AI agents
      • Making AI accessible with Andrej Karpathy and Stephanie Zhan
  • Regulation and Ethics
    • Regulation and Ethics
      • Privacy
      • Detecting AI Generated content
      • Navigating the IP Maze in AI: The Convergence of Blockchain, Web 3.0, and LLMs
      • Adverse Reactions to generative AI
      • Navigating the Ethical Minefield: The Challenge of Security in Large Language Models
      • Navigating the Uncharted Waters: The Risks of Autonomous AI in Military Decision-Making
  • DISRUPTION
    • Data Architecture
      • What is a data pipeline?
      • What is Reverse ETL?
      • Unstructured Data and Generatve AI
      • Resource Description Framework (RDF)
      • Integrating generative AI with the Semantic Web
    • Search
      • BM25 - Search Engine Ranking Function
      • BERT as a reranking engine
      • BERT and Google
      • Generative Engine Optimisation (GEO)
      • Billion-scale similarity search with GPUs
      • FOLLOWIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
      • Neural Collaborative Filtering
      • Federated Neural Collaborative Filtering
      • Latent Space versus Embedding Space
      • Improving Text Embeddings with Large Language Models
    • Recommendation Engines
      • On Interpretation and Measurement of Soft Attributes for Recommendation
      • A Survey on Large Language Models for Recommendation
      • Model driven recommendation systems
      • Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations
      • Foundation Models for Recommender Systems
      • Exploring the Impact of Large Language Models on Recommender Systems: An Extensive Review
      • AI driven recommendations - harming autonomy?
    • Logging
      • A Taxonomy of Anomalies in Log Data
      • Deeplog
      • LogBERT: Log Anomaly Detection via BERT
      • Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection
      • Log-based Anomaly Detection with Deep Learning: How Far Are We?
      • Deep Learning for Anomaly Detection in Log Data: A Survey
      • LogGPT
      • Adaptive Semantic Gate Networks (ASGNet) for log-based anomaly diagnosis
  • Infrastructure
    • The modern data centre
      • Enhancing Data Centre Efficiency: Strategies to Improve PUE
      • TCO of NVIDIA GPUs and falling barriers to entry
      • Maximising GPU Utilisation with Kubernetes and NVIDIA GPU Operator
      • Data Centres
      • Liquid Cooling
    • Servers and Chips
      • The NVIDIA H100 GPU
      • NVIDIA H100 NVL
      • Lambda Hyperplane 8-H100
      • NVIDIA DGX Servers
      • NVIDIA DGX-2
      • NVIDIA DGX H-100 System
      • NVLink Switch
      • Tensor Cores
      • NVIDIA Grace Hopper Superchip
      • NVIDIA Grace CPU Superchip
      • NVIDIA GB200 NVL72
      • Hopper versus Blackwell
      • HGX: High-Performance GPU Platforms
      • ARM Chips
      • ARM versus x86
      • RISC versus CISC
      • Introduction to RISC-V
    • Networking and Connectivity
      • Infiniband versus Ethernet
      • NVIDIA Quantum InfiniBand
      • PCIe (Peripheral Component Interconnect Express)
      • NVIDIA ConnectX InfiniBand adapters
      • NVMe (Non-Volatile Memory Express)
      • NVMe over Fabrics (NVMe-oF)
      • NVIDIA Spectrum-X
      • NVIDIA GPUDirect
      • Evaluating Modern GPU Interconnect
      • Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)
      • Next-generation networking in AI environments
      • NVIDIA Collective Communications Library (NCCL)
    • Data and Memory
      • NVIDIA BlueField Data Processing Units (DPUs)
      • Remote Direct Memory Access (RDMA)
      • High Bandwidth Memory (HBM3)
      • Flash Memory
      • Model Requirements
      • Calculating GPU memory for serving LLMs
      • Transformer training costs
      • GPU Performance Optimisation
    • Libraries and Complements
      • NVIDIA Base Command
      • NVIDIA AI Enterprise
      • CUDA - NVIDIA GTC 2024 presentation
      • RAPIDs
      • RAFT
    • Vast Data Platform
      • Vast Datastore
      • Vast Database
      • Vast Data Engine
      • DASE (Disaggregated and Shared Everything)
      • Dremio and VAST Data
    • Storage
      • WEKA: A High-Performance Storage Solution for AI Workloads
      • Introduction to NVIDIA GPUDirect Storage (GDS)
        • GDS cuFile API
      • NVIDIA Magnum IO GPUDirect Storage (GDS)
      • Vectors in Memory
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Unstructured Data Growth and Challenges
  • Evolution of Data Types and Management Technologies
  • Data Lakes and Management of Semi-Structured and Unstructured Data
  • Dealing with Unstructured Data Using LLMs
  • Using Data Lakes with LLMs
  • Are Data Lakes Redundant with Structured Unstructured Data?
  • Challenges and Solutions
  • Key Characteristics of an Effective Unstructured Data Management Solution
  • Integrating LLMs and Data Lakes

Was this helpful?

  1. DISRUPTION
  2. Data Architecture

Unstructured Data and Generatve AI

Unstructured Data Growth and Challenges

A significant portion of business-relevant information (approximately 80%) is unstructured.

This includes diverse formats like text (emails, reports, customer reviews), audio, video, and data from remote system monitoring.

Unstructured data, despite its abundance, is difficult to manage using traditional data tools due to its complexity, which makes searching, analysing, or querying particularly challenging.

Need for Modern Data Management Platforms

The limitations of legacy data management tools in handling unstructured data necessitate modern platforms capable of integrating unstructured, structured, and semi-structured data.

Such platforms should provide comprehensive data analysis, improve decision-making insights, and possess capabilities like breaking down data silos, offering fast and flexible data processing, and ensuring secure and easy data access.

Big Data Defined

Big Data refers to the enormous volume of data accumulated from various sources like social media, IoT devices, and software applications. It is characterized not just by its volume but by other attributes, commonly known as the "six Vs":

  • Volume: The large quantities of data generated continuously.

  • Variety: The diverse types of data, including structured, unstructured, and semi-structured data.

  • Velocity: The rapid rate at which data is created and needs to be processed.

  • Veracity: The accuracy and reliability of data.

  • Variability: Fluctuations in data flow rates and patterns.

  • Value: The importance of turning data into actionable insights that drive business value.

Impact of Neural Language Models

Neural language models (NLMs) are set to redefine how organisations use data, making interactions more intuitive and enhancing decision-making capabilities:

  • Enhanced Data Interaction: NLMs allow users to interact with data in natural language, making complex data analysis accessible to non-experts and promoting a democratisation of data within organizations.

  • Advanced Textual Analysis: These models can deeply analyse vast amounts of unstructured text to extract detailed insights, which are crucial for understanding market trends, customer feedback, and other qualitative data.

  • Improved Data Governance: NLMs help automate data management tasks, enhancing data quality and governance, thus ensuring data used in machine learning and analytics is clean and reliable.

  • Customised User Experiences: NLMs can personalise data interactions based on the user’s role and expertise level, improving the utility and accessibility of business intelligence tools.

  • Automation of Routine Data Tasks: Automating routine tasks with NLMs frees up resources for more strategic activities, accelerating the data analysis pipeline and enabling quicker insights.

  • Ethical and Responsible AI Use: With the growing reliance on AI and NLMs, ensuring these technologies are used responsibly and ethically is paramount to maintain privacy, security, and fairness in data handling and analysis.

In conclusion, neural language models are poised to significantly alter the landscape of data utilization in business, enhancing how data-driven insights are generated and applied for operational efficiency and strategic advantage.

Evolution of Data Types and Management Technologies

Structured Data

Initially, data management systems were designed for structured data, which came in predictable formats and fixed schemas. This data was typically stored in table-based data warehouses, and earlier data analyses were mostly confined to this type of structured data.

Semi-Structured Data

The decrease in data storage costs and growth in distributed systems led to a surge in machine-generated, semi-structured data, commonly in formats like JSON and Avro.

Unlike structured data, semi-structured data doesn’t fit neatly into tables but contains tags or markers that help in processing

Unstructured Data

Currently, a significant challenge is managing the vast amounts of unstructured data, which is growing rapidly. Estimates suggest that by 2025, 80% of all data will be unstructured, but only a tiny fraction (0.5%) is currently being analysed and used.

Unstructured data, generated predominantly by human interactions, lacks a predefined structure, making it difficult to manage with traditional systems.

Data ingestion

Data ingestion is the process of moving data from one location to another within the data engineering lifecycle.

It involves transferring data from source systems to storage systems as an intermediate step.

Data ingestion differs from data integration, which combines data from various sources into a new dataset. Data pipelines encompass both data ingestion and data integration, along with other data movement and processing patterns.

Source systems and ingestion are often the main bottlenecks in the data engineering lifecycle.

Ingestion can be done in batch or streaming

Batch ingestion processes data in large chunks at predetermined intervals or size thresholds, while streaming ingestion provides data in real-time or near real-time to downstream systems.

Choosing between batch and streaming ingestion depends on factors such as downstream system capabilities, the need for real-time data, and specific use cases.

Push and pull are two models of data ingestion

In the push model, source systems write data to a target system, while in the pull model, data is retrieved from the source system.

The line between push and pull can be blurred, as data may be pushed and pulled at different stages of the data pipeline.

The choice between push and pull depends on the ingestion workflow and technologies used, such as ETL (extract, transform, load) or CDC (change data capture).

Considerations for ingestion include the ability of downstream systems to handle data flow, the need for real-time ingestion, use cases for streaming, cost and complexity trade-offs, reliability of the ingestion pipeline, appropriate tools and technologies, and the impact on source systems.

When designing an ingestion system, there are several key considerations

Use case: Understand the purpose and intended use of the ingested data.

Data reuse: Determine if the data can be reused to avoid ingesting multiple versions of the same dataset.

Destination: Identify where the data will be stored or sent to for further processing.

Data update frequency: Decide how often the data needs to be updated from the source.

Data volume: Consider the expected volume of data to ensure the ingestion system can handle it efficiently.

Data format: Assess the format of the data and ensure compatibility with downstream storage and transformation processes.

Data quality: Evaluate the quality of the source data and determine if any post-processing or data cleansing is required.

In-flight processing: Determine if the data requires any processing or transformation during the ingestion process, particularly for streaming data sources.

In addition to these considerations, there are several technical aspects to address during the design of an ingestion architecture:

Bounded versus unbounded data: Understand the distinction between data that is already bounded and data that is continuously flowing (unbounded). Most business data falls into the unbounded category.

Frequency: Determine the frequency of data ingestion, whether it is batch, micro-batch, or real-time/streaming. Consider the specific use case and the technologies involved.

Serialization and deserialization: Consider the serialization and deserialization process for transferring data between systems, ensuring efficient and accurate data transfer.

Throughput and scalability: Design the ingestion system to handle the expected data throughput and be scalable to accommodate future growth.

Data Ingestion

Reliability and durability: Ensure the ingestion system is reliable and can handle failures gracefully, while also considering data durability and backup strategies.

Payload: Define the data payload, including metadata and any associated information that needs to be included during data ingestion.

Push versus pull versus poll patterns: Decide on the mechanism for data retrieval, whether it's a push-based approach, a pull-based approach, or a polling mechanism to periodically check for new data.

Overall, data ingestion plays a crucial role in the data engineering lifecycle, and careful consideration of these factors ensures an efficient and robust data ingestion process.

Types of Data Ingestion

Synchronous ingestion involves tightly coupled dependencies between the source, ingestion, and destination, where each stage relies on the completion of the previous stage.

This approach is common in older ETL systems and can lead to long processing times and the need to restart the entire process if any step fails.

In contrast, asynchronous ingestion allows for individual events to be processed independently as soon as they are ingested. Dependencies operate at the level of individual events, like microservices architecture. This approach enables parallel processing and faster data availability. Buffering is used to handle rate spikes and prevent data loss.

Serialization and deserialization are important considerations when moving data between source and destination systems. Data must be properly encoded and prepared for transmission and storage to ensure successful deserialization at the destination.

Throughput and scalability are critical factors in ingestion systems. Designing systems that can handle varying data volumes and bursts of data generation is essential. Managed services are recommended for handling throughput scaling, as they automate the process and reduce the risk of missing important considerations.

Schema

Schema and data types play a crucial role in data ingestion. Many data payloads have a schema that describes the fields and types of data within those fields. Understanding the underlying schema is essential for making sense of the data and designing effective ingestion processes. APIs and vendor data sources also present schema challenges that data engineers need to address.

Data Lakes and Management of Semi-Structured and Unstructured Data

Data lakes have been instrumental in managing semi-structured data.

However, for unstructured data, which encompasses complex formats like images, videos, audio, and various industry-specific file formats (e.g., DICOM, .vcf, .kdf, .hdf5), data lakes are less effective.

Efficient management of unstructured data is crucial as it holds significant potential for customer analytics and marketing intelligence, necessitating new strategies and systems for its handling.

Addressing the challenges of unstructured data management and analytics, particularly in the context of Large Language Models (LLMs) and data lakes, requires a multi-faceted approach that considers data processing, governance, security, and integration with modern data management systems.

Dealing with Unstructured Data Using LLMs

Data Transformation

LLMs can be instrumental in converting unstructured data into a structured format.

For instance, extracting key information from texts, transcribing audio files, and analysing video content. By doing so, LLMs make unstructured data more accessible for traditional analytics tools.

Enhanced Data Processing

LLMs can streamline the process of analysing unstructured data. They can quickly process large volumes of text, audio, and video files, identifying patterns, sentiments, and key insights that would be time-consuming and computationally intensive to extract using traditional methods.

Data Enrichment

LLMs can augment unstructured data with additional context or metadata, making it easier to integrate with other structured or semi-structured data sources. This enriched data can provide a more comprehensive view when analysed collectively.

Using Data Lakes with LLMs

Data Storage and Access

Data lakes can serve as a centralised repository for all types of data, including unstructured data. When combined with LLMs, data lakes can be transformed into more dynamic resources where data is not only stored but also actively processed and analysed.

Data Integration

Integrating LLMs with data lakes enables more efficient processing of diverse data formats. LLMs can extract and structure relevant information from the data lake, making it usable for various analytical purposes.

Query Performance Improvement

By leveraging LLMs for pre-processing and organizing data within data lakes, the query performance can be significantly enhanced, reducing the issues related to poor visibility and data inaccessibility.

Are Data Lakes Redundant with Structured Unstructured Data?

Complementary, Not Redundant

Data lakes remain relevant because they provide a scalable and flexible environment for storing massive volumes of diverse data. Even if unstructured data is structured using LLMs, data lakes still play a crucial role in storing and managing this data.

Integrated Ecosystem

A combined ecosystem of data lakes and LLMs facilitates a more efficient data management process. Data lakes store raw data, while LLMs can process and structure this data for advanced analytics.

Challenges and Solutions

Data Governance and Security

Implementing robust governance and security measures is critical. This includes managing permissions, ensuring data privacy compliance (like GDPR), and safeguarding against data breaches. Automated tools and LLMs can help in monitoring and managing these aspects efficiently.

Data Movement Risks

Minimising data movement and duplication is essential to reduce security risks. LLMs can process data in situ within data lakes, reducing the need for multiple copies.

Compliance with Data Privacy Laws

Tools that can identify and manage personal information within unstructured data are essential. LLMs can assist in identifying such data to comply with regulations like the GDPR's "right to be forgotten".

Integration with Existing Systems

Seamless integration of LLMs with existing data management architectures (like data lakes) is crucial to ensure streamlined operations and avoid siloed data.

In conclusion, while LLMs provide powerful tools for transforming and analysing unstructured data, data lakes remain an essential component of the data management ecosystem.

Together, they offer a comprehensive solution for managing the complexities of unstructured data, ensuring efficient processing, governance, and security.

Key Characteristics of an Effective Unstructured Data Management Solution

No Data Silos

A unified platform that supports all data formats (structured, semi-structured, and unstructured) is crucial. This system should enable cloud-agnostic storage and retrieval, allowing for seamless data accessibility across different clouds and regions, while enforcing unified policies.

Fast, Flexible Processing

The solution must have robust processing capabilities to transform, prepare, and enrich unstructured data. It should deliver high performance without manual tuning and handle a large number of users and data without contention. Flexibility in tool selection for data scientists and maintaining a continuous data pipeline are also vital.

Easy, Secure Access

Easy searchability and sharing of unstructured data, possibly through a built-in file catalogue, are important. Implementing scoped access to allow secure sharing without physical data copying or credential sharing is essential.

Governance at Scale with RBAC

Implementing cloud-agnostic Role-Based Access Control (RBAC) to manage access based on user roles is crucial for meeting zero-trust security requirements. This approach simplifies governance and avoids complexities associated with individual cloud provider policies.

Integrating LLMs and Data Lakes

LLMs can enhance the processing of unstructured data within this framework:

Data Transformation and Enrichment: LLMs can convert unstructured data into structured formats, making it more amenable for analysis. They can also enrich data with additional context, improving its usability.

Enhanced Analytics: By integrating LLMs with data lakes, unstructured data can be analysed more effectively, extracting insights that were previously difficult to obtain.

In summary, a modern solution for managing unstructured data should not only focus on storage and processing but also on governance, security, and accessibility, integrating tools like LLMs and data lakes.

This approach ensures that unstructured data becomes a valuable asset for insights and decision-making rather than a cumbersome challenge.

Data Mesh and Data Domains: Definitions and Concepts

Data Mesh

A data mesh is a decentralised approach to data architecture and management that treats data as a product.

It aims to make data more accessible and usable across an organisation by empowering individual domains to manage their own data while adhering to a set of shared principles and standards. The key characteristics of a data mesh include:

  1. Domain-oriented data ownership and architecture

  2. Data as a product

  3. Self-serve data infrastructure as a platform

  4. Federated computational governance

In a data mesh, the responsibility for data management is distributed among domain teams, each responsible for providing high-quality, well-documented, and easily accessible data products to the rest of the organisation.

This approach enables scalability, agility, and improved data quality by leveraging the domain expertise of each team.

Data Domains

Data domains are logical groupings of data based on their business context, such as finance, customer relations, human resources, or supply chain.

Each data domain represents a specific area of the business and contains data that is highly relevant to that area's activities.

Within a data domain, data is managed, maintained, and controlled by domain experts who have the best understanding of its context, meaning, and use cases.

Data domains serve as the building blocks of a data mesh, with each domain responsible for managing its own data as a product, ensuring quality, accessibility, and governance. This decentralised approach allows for greater flexibility, scalability, and faster data-driven decision-making across the organisation.

Impact of Generative AI and AI Agents on Data Mesh and Data Domains

Generative AI and AI agents have the potential to significantly disrupt and augment the concepts of data mesh and data domains. Here are some key ways in which they can influence these architectures:

Enhanced Data Understanding and Insights

Generative AI, particularly large language models (LLMs), can process and analyse vast amounts of unstructured data, extracting valuable insights, patterns, and relationships that might otherwise remain hidden.

By integrating generative AI into data domains, organisations can gain a deeper understanding of their data, enabling more informed decision-making and identifying new opportunities for growth and innovation.

Automated Data Discovery and Cataloging

AI agents can automatically discover, classify, and catalog data across various domains, making it easier for users to find and access relevant data products. By leveraging natural language processing and machine learning techniques, these agents can understand the context and semantics of data, enabling more accurate and efficient data discovery and cataloguing processes.

Intelligent Data Integration and Harmonization

Generative AI can facilitate the integration and harmonisation of data across different domains, breaking down silos and enabling a more holistic view of the organisation's data assets.

By understanding the relationships and dependencies between data elements, AI agents can automate the process of data integration, ensuring consistency, accuracy, and completeness of data across the mesh.

Augmented Data Governance and Quality

AI agents can support data governance and quality management within a data mesh by continuously monitoring data products, identifying anomalies, and suggesting improvements.

Generative AI can assist in creating and maintaining data documentation, data lineage, and metadata, ensuring that data products are well-described, trustworthy, and compliant with organizational policies and regulations.

Predictive Analytics and Scenario Planning

Generative AI can leverage historical data across multiple domains to enable advanced predictive analytics and scenario planning.

By identifying patterns and trends, AI agents can help organizations anticipate future challenges and opportunities, enabling proactive decision-making and risk management.

Conversational Data Access and Exploration

AI agents powered by generative AI can provide a conversational interface for accessing and exploring data products within a data mesh.

Users can interact with these agents using natural language queries, making it easier for non-technical users to find and utilize relevant data for their specific needs.

By incorporating generative AI and AI agents into data mesh and data domain architectures, organisations can unlock the true potential of their data assets.

These technologies enable a more intelligent, automated, and insight-driven approach to data management, empowering organisations to make faster, better-informed decisions and drive innovation across all areas of the business.

However, it is essential to approach the integration of generative AI and AI agents into data mesh and data domains with care and consideration.

Organisations must ensure that these technologies are implemented in a way that aligns with their overall data strategy, governance framework, and ethical principles.

This includes addressing concerns around data privacy, security, bias, and transparency, and ensuring that the use of AI is guided by clear objectives and measurable outcomes.

PreviousWhat is Reverse ETL?NextResource Description Framework (RDF)

Last updated 11 months ago

Was this helpful?

Page cover image