# A Comprehensive Survey on Vector Databases

This <mark style="color:blue;">**October 2023**</mark> paper presents a comprehensive review of vector databases, specialised databases designed to store and manage high-dimensional data.

As opposed to traditional relational databases, *<mark style="color:yellow;">**vector databases are optimised for handling unstructured data,**</mark>* transforming them into high-dimensional vectors via embedding functions derived from various machine learning models or algorithms.&#x20;

{% embed url="<https://arxiv.org/abs/2310.11703>" %}
A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge Yikun Han, Chunjiang Liu and Pengfei Wang
{% endembed %}

### <mark style="color:purple;">Storage in Vector Databases</mark>

In vector databases, data storage and retrieval are optimised for handling high-dimensional vector representations of unstructured complex data - such as images, text, or audio.&#x20;

<mark style="color:green;">**Sharding**</mark>

This process *<mark style="color:yellow;">**distributes the database across multiple servers or clusters,**</mark>* known as shards. There are two main sharding strategies:

* <mark style="color:blue;">**Hash-based sharding**</mark><mark style="color:blue;">:</mark> Here, data is allocated to different shards based on the hash value of a key column or a set of columns. This method ensures even data distribution and helps avoid load imbalances.
* <mark style="color:blue;">**Range-based sharding**</mark><mark style="color:blue;">:</mark> This method assigns data to shards based on value ranges of a key column or a set of columns. It allows efficient querying by enabling data retrieval based on specific shard names or value ranges.

<mark style="color:green;">**Partitioning**</mark>

Similar to sharding, *<mark style="color:yellow;">**partitioning divides the database into smaller, manageable segments,**</mark>* but it typically occurs within a single database system rather than across multiple systems. Two common partitioning methods are:

* <mark style="color:blue;">**Range partitioning**</mark><mark style="color:blue;">:</mark> Data is divided into partitions based on <mark style="color:yellow;">value ranges of a key column</mark>. This method is useful for time-series data or any scenario where data can be segmented into well-defined ranges.
* <mark style="color:blue;">**List partitioning**</mark><mark style="color:blue;">:</mark> Data is grouped into partitions based on the <mark style="color:yellow;">value lists of a key column</mark>. This approach is suitable when the partitioning criteria are categorical or when grouping by specific, discrete values.

<mark style="color:green;">**Caching**</mark>

To *<mark style="color:yellow;">**enhance data retrieval speed,**</mark>* frequently accessed or recently used data is stored in fast, accessible memory. Caching strategies in vector databases include:

* <mark style="color:blue;">**Least Recently Used (LRU) caching**</mark><mark style="color:blue;">:</mark> This policy removes the least recently accessed data when the cache is full, prioritising data that is more likely to be accessed again.
* <mark style="color:blue;">**Partitioned caching**</mark><mark style="color:blue;">:</mark> The cache is divided based on certain criteria, allowing for tailored cache management for different data segments, optimizing resource use and retrieval efficiency.

<mark style="color:green;">**Replication**</mark>

Creating *<mark style="color:yellow;">**multiple copies of data ensures higher availability, durability, and read performance.**</mark>* Two primary replication methods are:

* <mark style="color:blue;">**Leaderless replication**</mark><mark style="color:blue;">:</mark> This method allows any node to handle write and read requests, enhancing scalability and avoiding single points of failure. However, it can lead to consistency challenges that need resolution strategies.
* <mark style="color:blue;">**Leader-follower replication**</mark><mark style="color:blue;">:</mark> One node acts as the leader (handling write operations) and propagates changes to follower nodes. This method ensures strong consistency but requires effective failover strategies to maintain availability in case the leader node fails.

### <mark style="color:purple;">Nearest Neighbour Search</mark>

In the context of vector databases, nearest neighbour search (NNS) is crucial for finding the closest or most similar data points to a given query point, leveraging vector distances or similarities.

<mark style="color:green;">**Brute Force Approach**</mark>

This straightforward method involves scanning every point in the dataset, calculating the distance to the query point, and maintaining the closest one.  While guaranteed to find the exact nearest neighbour, it's computationally expensive, with time complexity $$O(n)$$ for $$n$$ data points.

<mark style="color:green;">**Tree-Based Approaches**</mark>

These methods enhance efficiency by structuring data in tree formats, enabling quicker nearest neighbour searches by pruning irrelevant sections of the data space.

<mark style="color:blue;">**KD-tree**</mark><mark style="color:blue;">:</mark> Organises points in k-dimensional space using a binary tree. Each node represents a point that splits the space, reducing the search area by comparing distances within relevant partitions.

<mark style="color:blue;">**Ball-tree**</mark><mark style="color:blue;">:</mark> Groups data points into hyperspheres, making it effective in high-dimensional spaces. It iteratively refines the search to the most promising hypersphere, enhancing search efficiency.

<mark style="color:blue;">**R-tree**</mark><mark style="color:blue;">:</mark> Uses rectangles to encapsulate points, supporting efficient spatial queries. It's particularly beneficial for geographical or multidimensional data, enabling rapid identification of nearest neighbours within defined bounds.

<mark style="color:blue;">**M-tree**</mark><mark style="color:blue;">:</mark> Similar to a ball-tree but supports dynamic updates (insertions and deletions), maintaining an efficient structure for ongoing nearest neighbour queries even as data changes.

These tree-based methods reduce the search space significantly compared to the brute force approach, enhancing efficiency, especially in large and high-dimensional datasets.

### <mark style="color:purple;">Approximate Nearest Neighbour Search (ANNS)</mark>

Approximate Nearest Neighbour Search (ANNS) in vector databases allows for quick and efficient searches for data points that are close or similar to a query point, albeit with some margin of error.&#x20;

This is particularly useful in handling vast or high-dimensional datasets where exact nearest neighbour search (NNS) might be computationally intensive.&#x20;

<mark style="color:green;">**Hash-Based Approach**</mark>

* <mark style="color:blue;">**Locality-Sensitive Hashing (LSH):**</mark> Transforms high-dimensional vectors into compact binary codes using hash functions, preserving the locality so that similar vectors have similar codes. This method boosts search speed and reduces memory usage by operating on these codes rather than the original vectors.
* <mark style="color:blue;">**Spectral Hashing:**</mark> Uses spectral graph theory to generate hash functions that minimise quantization error and maximise the variance of binary codes, effective when data points reside on a low-dimensional manifold within a high-dimensional space.
* <mark style="color:blue;">**Deep Hashing:**</mark> Employs deep neural networks to learn hash functions, converting high-dimensional vectors into binary codes. This approach retains semantic information, making it suitable for complex data like images or texts.

<mark style="color:green;">**Tree-Based Approach**</mark>

* <mark style="color:blue;">**Approximate Nearest Neighbours Oh Yeah (Annoy):**</mark> Creates a forest of binary trees to partition the vector space. It assigns vectors to leaf nodes based on random hyperplanes, collecting candidates from the same nodes for distance computations.
* <mark style="color:blue;">**Best Bin First:**</mark> Uses a kd-tree to partition data into bins, then prioritises searching in bins closer to the query point, improving search time and accuracy.
* <mark style="color:blue;">**K-means Tree:**</mark> Clusters data points into a hierarchical structure, where each node represents a cluster. It facilitates efficient search by navigating through the branches likely to contain the nearest neighbours.

These ANNS methods provide a balance between search accuracy and computational efficiency, making them invaluable in large-scale data environments typical of vector databases.&#x20;

They enable applications like image retrieval, recommendation systems, and more, where rapid and efficient data retrieval is crucial.

### <mark style="color:purple;">Graph Based Approach to ANNS</mark>

The graph-based approach to approximate nearest neighbour search (ANNS) in vector databases employs graph structures to efficiently store and retrieve high-dimensional vectors based on their similarity or distance.&#x20;

This approach includes the Navigable Small World (NSW) and Hierarchical Navigable Small World (HNSW) methods, both optimising search processes through graph structures.

#### <mark style="color:green;">**Navigable Small World (NSW)**</mark>

NSW *<mark style="color:yellow;">**creates a graph where each vector is connected to its nearest neighbours**</mark>* and some randomly chosen distant vectors. These connections form shortcuts, enabling rapid traversal of the graph.&#x20;

The addition of each new point involves a random walk to find the nearest neighbour and establish connections based on proximity. This method ensures that similar vectors are interconnected, facilitating a quick nearest neighbour search.

#### <mark style="color:green;">**Hierarchical Navigable Small World (HNSW)**</mark>

An advancement over NSW, *<mark style="color:yellow;">**HNSW constructs a multi-layered graph where each layer operates at a different scale.**</mark>*&#x20;

Points are assigned to layers probabilistically, creating a pyramid-like structure with fewer points at higher levels. Searches begin at the top layer, utilizing the hierarchy to quickly narrow down the search space as they descend to the more densely populated lower layers.

### <mark style="color:purple;">The</mark> <mark style="color:purple;"></mark><mark style="color:purple;">**Quantization-Based Approach**</mark>&#x20;

The quantization-based approach in vector databases offers a method to compress high-dimensional vectors into compact codes, significantly enhancing the efficiency of approximate nearest neighbour searches.

&#x20;This approach encompasses three main methods:

#### <mark style="color:green;">**Product Quantization (PQ)**</mark>

PQ divides each high-dimensional vector into several sub vectors and assigns each to the nearest centroid in a predefined codebook, effectively transforming the vector into a sequence of centroid indices.&#x20;

This process *<mark style="color:yellow;">**significantly reduces the storage requirement and accelerates the search process**</mark>* by comparing these compact codes instead of the full-dimensional vectors. While PQ simplifies the implementation and enhances search efficiency, its effectiveness depends on factors like data dimensionality, the granularity of sub-vector segmentation, and the size of the codebook.

#### <mark style="color:green;">**Optimized Product Quantization (OPQ)**</mark>

OPQ refines the PQ process by optimising the space decomposition and the codebook to reduce quantization errors, thereby preserving more information from the original vectors.

This optimisation usually involves a rotation of the data space to align it more effectively with the quantization grid, enhancing the discriminative power of the generated codes.&#x20;

The method involves balancing the benefits of a more nuanced space decomposition with the increased computational complexity of the optimisation process.

#### <mark style="color:green;">**Online Product Quantization (O-PQ)**</mark>

O-PQ adapts the PQ framework to dynamic datasets by continuously updating the quantization codebook and the codes as new data arrives.&#x20;

This method is *<mark style="color:yellow;">**particularly relevant for systems dealing with data streams or incremental datasets,**</mark>* maintaining the relevance of the quantization process over time without the need for complete retraining.&#x20;

The adaptability of O-PQ provides a robust framework for evolving datasets but requires careful management of learning and forgetting rates to ensure the timely incorporation of new information while discarding outdated data.

These quantization methods transform the challenge of searching in a high-dimensional space into a more manageable problem of searching through a set of discrete codes, balancing between search accuracy and computational efficiency.&#x20;

They offer a scalable and flexible approach to handling vast datasets, making them great tools in the domain of vector databases and their applications in various fields such as image retrieval, recommendation systems, and more.

### <mark style="color:purple;">Challenges</mark>

Vector databases, along with the integration of these databases with large language models (LLMs), present multifaceted technical hurdles:

#### <mark style="color:green;">**Index Construction and Searching of High-Dimensional Vectors**</mark>

* <mark style="color:blue;">**Dimensionality Catastrophe:**</mark> Traditional indexing methods struggle with high-dimensional data due to the "curse of dimensionality," where the distance between points becomes less meaningful, complicating nearest neighbour searches.
* <mark style="color:blue;">**Specialised Techniques:**</mark> Advanced techniques like approximate nearest neighbour (ANN) search, hashing, quantization, and graph-based methods are essential for handling the complexity and improving the search accuracy for vector data.

#### <mark style="color:green;">**Support for Heterogeneous Vector Data Types**</mark>

* <mark style="color:blue;">**Diverse Vector Characteristics:**</mark> Vector databases must accommodate various vector types—dense, sparse, binary, etc.—each with unique traits such as dimensionality and sparsity.
* <mark style="color:blue;">**Adaptive Indexing System:**</mark> A flexible indexing system is required to efficiently manage different vector data types, optimising performance based on the specific characteristics of each data type.

#### <mark style="color:green;">**Distributed Parallel Processing Support**</mark>

* <mark style="color:blue;">**Scalability:**</mark> To manage large-scale vector data, vector databases should leverage distributed computing, distributing the data and processing across multiple nodes or clusters.
* <mark style="color:blue;">**Distributed Processing Challenges:**</mark> This involves overcoming issues related to data partitioning, load balancing, fault tolerance, and maintaining consistency across distributed systems.

#### <mark style="color:green;">**Integration with Mainstream Machine Learning Frameworks**</mark>

* <mark style="color:blue;">**Framework Compatibility:**</mark> Seamless integration with prevalent machine learning frameworks is vital for the generation and utilization of vector embeddings within vector databases.
* <mark style="color:blue;">**APIs and Connectors:**</mark> Providing user-friendly APIs and connectors is essential for ensuring that vector databases can effectively interact with different machine learning frameworks and support various data formats and models.

### <mark style="color:purple;">**Synergy with Large Language Models**</mark>

\
The paper outlines potential applications of integrating vector databases with large language models (LLMs) and vice versa, highlighting the enhanced capabilities and interactive systems that can emerge from this synergy.&#x20;

#### <mark style="color:green;">**Applications of Vector Databases on LLMs**</mark>

* <mark style="color:blue;">**Long-term Memory**</mark><mark style="color:blue;">:</mark> Vector databases can augment LLMs with a form of long-term memory. They store information in vector form, allowing quick retrieval of relevant or similar vectors when a user interacts with an LLM, thereby enabling the LLM to provide more personalised and informed responses.
* <mark style="color:blue;">**Semantic Search**</mark><mark style="color:blue;">:</mark> These databases empower LLMs to perform semantic searches, where users can query using natural language, and the system retrieves text based on meaning rather than mere keywords, allowing the LLM to provide concise summaries or paraphrases.
* <mark style="color:blue;">**Recommendation Systems**</mark><mark style="color:blue;">:</mark> By analysing vector representations, vector databases can help LLMs in recommending items (like movies) that align with user preferences, providing reasons for recommendations and additional context or reviews.

#### <mark style="color:green;">**Applications of LLMs on Vector Databases**</mark>

* <mark style="color:blue;">**Text Generation**</mark><mark style="color:blue;">:</mark> LLMs can generate texts corresponding to specific vectors from the database, aiding in content creation across various formats and styles.
* <mark style="color:blue;">**Text Augmentation**</mark><mark style="color:blue;">:</mark> LLMs can enhance texts by infusing additional relevant details from the vector database, improving text quality and diversity.
* <mark style="color:blue;">**Text Transformation**</mark><mark style="color:blue;">:</mark> Using vector databases, LLMs can transform texts across languages, domains, or formats, supporting tasks like translation, paraphrasing, and summarization.

<mark style="color:green;">**Retrieval-Based LLM**</mark>

* Defined as a language model that integrates external datastores, particularly beneficial during inference.
* **Advantages**: Such LLMs offer enhanced memory for long-tail knowledge, easy updates without extensive retraining, capabilities for source citation and fact-checking, improved privacy through encryption and anonymization, and cost-effectiveness by using external datastores and efficient retrieval mechanisms.
* **Inference Process**: Involves a complex data flow where user queries trigger retrieval from a diverse datastore, employing algorithms to identify relevant data subsets, facilitating informed and context-rich responses.
