A Comprehensive Survey on Vector Databases
Last updated
Copyright Continuum Labs - 2023
Last updated
This October 2023 paper presents a comprehensive review of vector databases, specialised databases designed to store and manage high-dimensional data.
As opposed to traditional relational databases, vector databases are optimised for handling unstructured data, transforming them into high-dimensional vectors via embedding functions derived from various machine learning models or algorithms.
In vector databases, data storage and retrieval are optimised for handling high-dimensional vector representations of unstructured complex data - such as images, text, or audio.
Sharding
This process distributes the database across multiple servers or clusters, known as shards. There are two main sharding strategies:
Hash-based sharding: Here, data is allocated to different shards based on the hash value of a key column or a set of columns. This method ensures even data distribution and helps avoid load imbalances.
Range-based sharding: This method assigns data to shards based on value ranges of a key column or a set of columns. It allows efficient querying by enabling data retrieval based on specific shard names or value ranges.
Partitioning
Similar to sharding, partitioning divides the database into smaller, manageable segments, but it typically occurs within a single database system rather than across multiple systems. Two common partitioning methods are:
Range partitioning: Data is divided into partitions based on value ranges of a key column. This method is useful for time-series data or any scenario where data can be segmented into well-defined ranges.
List partitioning: Data is grouped into partitions based on the value lists of a key column. This approach is suitable when the partitioning criteria are categorical or when grouping by specific, discrete values.
Caching
To enhance data retrieval speed, frequently accessed or recently used data is stored in fast, accessible memory. Caching strategies in vector databases include:
Least Recently Used (LRU) caching: This policy removes the least recently accessed data when the cache is full, prioritising data that is more likely to be accessed again.
Partitioned caching: The cache is divided based on certain criteria, allowing for tailored cache management for different data segments, optimizing resource use and retrieval efficiency.
Replication
Creating multiple copies of data ensures higher availability, durability, and read performance. Two primary replication methods are:
Leaderless replication: This method allows any node to handle write and read requests, enhancing scalability and avoiding single points of failure. However, it can lead to consistency challenges that need resolution strategies.
Leader-follower replication: One node acts as the leader (handling write operations) and propagates changes to follower nodes. This method ensures strong consistency but requires effective failover strategies to maintain availability in case the leader node fails.
In the context of vector databases, nearest neighbour search (NNS) is crucial for finding the closest or most similar data points to a given query point, leveraging vector distances or similarities.
Brute Force Approach
This straightforward method involves scanning every point in the dataset, calculating the distance to the query point, and maintaining the closest one. While guaranteed to find the exact nearest neighbour, it's computationally expensive, with time complexity for data points.
Tree-Based Approaches
These methods enhance efficiency by structuring data in tree formats, enabling quicker nearest neighbour searches by pruning irrelevant sections of the data space.
KD-tree: Organises points in k-dimensional space using a binary tree. Each node represents a point that splits the space, reducing the search area by comparing distances within relevant partitions.
Ball-tree: Groups data points into hyperspheres, making it effective in high-dimensional spaces. It iteratively refines the search to the most promising hypersphere, enhancing search efficiency.
R-tree: Uses rectangles to encapsulate points, supporting efficient spatial queries. It's particularly beneficial for geographical or multidimensional data, enabling rapid identification of nearest neighbours within defined bounds.
M-tree: Similar to a ball-tree but supports dynamic updates (insertions and deletions), maintaining an efficient structure for ongoing nearest neighbour queries even as data changes.
These tree-based methods reduce the search space significantly compared to the brute force approach, enhancing efficiency, especially in large and high-dimensional datasets.
Approximate Nearest Neighbour Search (ANNS) in vector databases allows for quick and efficient searches for data points that are close or similar to a query point, albeit with some margin of error.
This is particularly useful in handling vast or high-dimensional datasets where exact nearest neighbour search (NNS) might be computationally intensive.
Hash-Based Approach
Locality-Sensitive Hashing (LSH): Transforms high-dimensional vectors into compact binary codes using hash functions, preserving the locality so that similar vectors have similar codes. This method boosts search speed and reduces memory usage by operating on these codes rather than the original vectors.
Spectral Hashing: Uses spectral graph theory to generate hash functions that minimise quantization error and maximise the variance of binary codes, effective when data points reside on a low-dimensional manifold within a high-dimensional space.
Deep Hashing: Employs deep neural networks to learn hash functions, converting high-dimensional vectors into binary codes. This approach retains semantic information, making it suitable for complex data like images or texts.
Tree-Based Approach
Approximate Nearest Neighbours Oh Yeah (Annoy): Creates a forest of binary trees to partition the vector space. It assigns vectors to leaf nodes based on random hyperplanes, collecting candidates from the same nodes for distance computations.
Best Bin First: Uses a kd-tree to partition data into bins, then prioritises searching in bins closer to the query point, improving search time and accuracy.
K-means Tree: Clusters data points into a hierarchical structure, where each node represents a cluster. It facilitates efficient search by navigating through the branches likely to contain the nearest neighbours.
These ANNS methods provide a balance between search accuracy and computational efficiency, making them invaluable in large-scale data environments typical of vector databases.
They enable applications like image retrieval, recommendation systems, and more, where rapid and efficient data retrieval is crucial.
The graph-based approach to approximate nearest neighbour search (ANNS) in vector databases employs graph structures to efficiently store and retrieve high-dimensional vectors based on their similarity or distance.
This approach includes the Navigable Small World (NSW) and Hierarchical Navigable Small World (HNSW) methods, both optimising search processes through graph structures.
NSW creates a graph where each vector is connected to its nearest neighbours and some randomly chosen distant vectors. These connections form shortcuts, enabling rapid traversal of the graph.
The addition of each new point involves a random walk to find the nearest neighbour and establish connections based on proximity. This method ensures that similar vectors are interconnected, facilitating a quick nearest neighbour search.
An advancement over NSW, HNSW constructs a multi-layered graph where each layer operates at a different scale.
Points are assigned to layers probabilistically, creating a pyramid-like structure with fewer points at higher levels. Searches begin at the top layer, utilizing the hierarchy to quickly narrow down the search space as they descend to the more densely populated lower layers.
The quantization-based approach in vector databases offers a method to compress high-dimensional vectors into compact codes, significantly enhancing the efficiency of approximate nearest neighbour searches.
This approach encompasses three main methods:
PQ divides each high-dimensional vector into several sub vectors and assigns each to the nearest centroid in a predefined codebook, effectively transforming the vector into a sequence of centroid indices.
This process significantly reduces the storage requirement and accelerates the search process by comparing these compact codes instead of the full-dimensional vectors. While PQ simplifies the implementation and enhances search efficiency, its effectiveness depends on factors like data dimensionality, the granularity of sub-vector segmentation, and the size of the codebook.
OPQ refines the PQ process by optimising the space decomposition and the codebook to reduce quantization errors, thereby preserving more information from the original vectors.
This optimisation usually involves a rotation of the data space to align it more effectively with the quantization grid, enhancing the discriminative power of the generated codes.
The method involves balancing the benefits of a more nuanced space decomposition with the increased computational complexity of the optimisation process.
O-PQ adapts the PQ framework to dynamic datasets by continuously updating the quantization codebook and the codes as new data arrives.
This method is particularly relevant for systems dealing with data streams or incremental datasets, maintaining the relevance of the quantization process over time without the need for complete retraining.
The adaptability of O-PQ provides a robust framework for evolving datasets but requires careful management of learning and forgetting rates to ensure the timely incorporation of new information while discarding outdated data.
These quantization methods transform the challenge of searching in a high-dimensional space into a more manageable problem of searching through a set of discrete codes, balancing between search accuracy and computational efficiency.
They offer a scalable and flexible approach to handling vast datasets, making them great tools in the domain of vector databases and their applications in various fields such as image retrieval, recommendation systems, and more.
Vector databases, along with the integration of these databases with large language models (LLMs), present multifaceted technical hurdles:
Dimensionality Catastrophe: Traditional indexing methods struggle with high-dimensional data due to the "curse of dimensionality," where the distance between points becomes less meaningful, complicating nearest neighbour searches.
Specialised Techniques: Advanced techniques like approximate nearest neighbour (ANN) search, hashing, quantization, and graph-based methods are essential for handling the complexity and improving the search accuracy for vector data.
Diverse Vector Characteristics: Vector databases must accommodate various vector types—dense, sparse, binary, etc.—each with unique traits such as dimensionality and sparsity.
Adaptive Indexing System: A flexible indexing system is required to efficiently manage different vector data types, optimising performance based on the specific characteristics of each data type.
Scalability: To manage large-scale vector data, vector databases should leverage distributed computing, distributing the data and processing across multiple nodes or clusters.
Distributed Processing Challenges: This involves overcoming issues related to data partitioning, load balancing, fault tolerance, and maintaining consistency across distributed systems.
Framework Compatibility: Seamless integration with prevalent machine learning frameworks is vital for the generation and utilization of vector embeddings within vector databases.
APIs and Connectors: Providing user-friendly APIs and connectors is essential for ensuring that vector databases can effectively interact with different machine learning frameworks and support various data formats and models.
The paper outlines potential applications of integrating vector databases with large language models (LLMs) and vice versa, highlighting the enhanced capabilities and interactive systems that can emerge from this synergy.
Long-term Memory: Vector databases can augment LLMs with a form of long-term memory. They store information in vector form, allowing quick retrieval of relevant or similar vectors when a user interacts with an LLM, thereby enabling the LLM to provide more personalised and informed responses.
Semantic Search: These databases empower LLMs to perform semantic searches, where users can query using natural language, and the system retrieves text based on meaning rather than mere keywords, allowing the LLM to provide concise summaries or paraphrases.
Recommendation Systems: By analysing vector representations, vector databases can help LLMs in recommending items (like movies) that align with user preferences, providing reasons for recommendations and additional context or reviews.
Text Generation: LLMs can generate texts corresponding to specific vectors from the database, aiding in content creation across various formats and styles.
Text Augmentation: LLMs can enhance texts by infusing additional relevant details from the vector database, improving text quality and diversity.
Text Transformation: Using vector databases, LLMs can transform texts across languages, domains, or formats, supporting tasks like translation, paraphrasing, and summarization.
Retrieval-Based LLM
Defined as a language model that integrates external datastores, particularly beneficial during inference.
Advantages: Such LLMs offer enhanced memory for long-tail knowledge, easy updates without extensive retraining, capabilities for source citation and fact-checking, improved privacy through encryption and anonymization, and cost-effectiveness by using external datastores and efficient retrieval mechanisms.
Inference Process: Involves a complex data flow where user queries trigger retrieval from a diverse datastore, employing algorithms to identify relevant data subsets, facilitating informed and context-rich responses.