# Vector Databases

Vector databases are a specialised type of database designed for <mark style="color:yellow;">handling vector embeddings</mark>, which are numerical representations of various data objects.&#x20;

Traditional databases store data in tabular forms and index it by assigning values, focusing on exact matches. Vector databases, in contrast, are designed for embeddings and similarity-based searches.

They offer capabilities like CRUD operations, metadata filtering, and horizontal scaling.&#x20;

### <mark style="color:purple;">Definition and Functionality of Vector Databases</mark>

<mark style="color:green;">**Storage of Vector Embedding**</mark>

Vector databases store information as vectors, known as vector embeddings. These embeddings represent data objects numerically and are generated by AI models, including large language models.

<mark style="color:green;">**Handling of Unstructured and Semi-Structured Data**</mark>

They are particularly adept at managing massive datasets of unstructured data (like images and text) and semi-structured data (like sensor data).

<mark style="color:green;">**Capabilities Beyond Vector Search Libraries**</mark>

Unlike vector search libraries or indexes, vector databases offer a comprehensive data management solution, including metadata storage, scalability, dynamic data handling, backups, and security features.

<mark style="color:green;">**High-Dimensional Vectors**</mark>

The data in these databases is organized through high-dimensional vectors, where each dimension represents a specific characteristic of the data object.

### <mark style="color:purple;">Vector Embeddings</mark>

<mark style="color:green;">**Numerical Representation**</mark>

Vector embeddings are numerical representations of various subjects like words, images, or data pieces.

<mark style="color:green;">**Distance Measurement for Similarity**</mark>

The distance between vector embeddings (using various mathematical measures) allows the database to determine the similarity between vectors, aiding in pattern recognition and AI's understanding of relationships.

### <mark style="color:purple;">Working Mechanism</mark>

1. <mark style="color:green;">**Indexing**</mark><mark style="color:green;">:</mark> Vectors are indexed using techniques like hashing, quantization, or graph-based methods, facilitating efficient searches.
2. <mark style="color:green;">**Querying**</mark><mark style="color:green;">:</mark> Queries are processed by comparing indexed vectors to the query vector using similarity measures like cosine similarity, Euclidean distance, and dot product similarity.
3. <mark style="color:green;">**Post-processing**</mark><mark style="color:green;">:</mark> After initial querying, there may be additional filtering or re-ranking using different similarity measures, focusing on metadata.

### <mark style="color:purple;">Importance of Vector Databases</mark>

<mark style="color:green;">**Specialisation in Unstructured Data Management**</mark>

They are crucial for managing unstructured data, providing capabilities like indexing, distance metrics, and similarity searches.

<mark style="color:green;">**Enabling Advanced AI and ML Applications**</mark>

Vector databases are fundamental in AI  learning applications, enabling efficient processing of complex data types.

### <mark style="color:purple;">How Vector Databases Differ from Vector Indices</mark>

While standalone vector indices like Facebook AI Similarity Search (FAISS) improve search and retrieval, they lack comprehensive data management capabilities.

Vector databases, on the other hand, are purpose-built for managing vector embeddings. They offer a suite of features that standalone indices simply don't, including:

* <mark style="color:green;">**Data Management**</mark><mark style="color:green;">:</mark> Easier and more efficient handling of vector data.
* <mark style="color:green;">**Metadata Storage and Filtering**</mark><mark style="color:green;">:</mark> Enhanced query capabilities through metadata.
* <mark style="color:green;">**Scalability**</mark><mark style="color:green;">:</mark> Optimised for growing data volumes and user demands.
* <mark style="color:green;">**Real-time Updates**</mark><mark style="color:green;">:</mark> Ability to dynamically update data.
* <mark style="color:green;">**Backups and Collections**</mark><mark style="color:green;">:</mark> Routine operations for data security and efficiency.
* <mark style="color:green;">**Ecosystem Integration**</mark><mark style="color:green;">:</mark> Seamless compatibility with other AI tools and data processing ecosystems.
* <mark style="color:green;">**Data Security and Access Control**</mark><mark style="color:green;">:</mark> Essential for protecting sensitive information.

### <mark style="color:purple;">Embeddings in Vector Databases</mark>

Embeddings are vectors representing semantic relationships. They place related objects close in the embedding space.  They allow for the comparison and retrieval of similar items based on their conceptual similarity.

### <mark style="color:purple;">Preprocessing Embeddings</mark>

* <mark style="color:green;">**Normalisation**</mark><mark style="color:green;">:</mark> Involves scaling data to unit length, which eliminates the impact of vector scale and allows for consistent high-dimensional data handling. It enables using dot product similarity, which is faster than cosine similarity.
* <mark style="color:green;">**Standardisation**</mark><mark style="color:green;">:</mark> This process shifts and scales data to have zero mean and unit variance, imparting Gaussian distribution properties to vectors and ensuring equal contribution in distance calculations.
* <mark style="color:green;">**Importance**</mark><mark style="color:green;">:</mark> Normalization is crucial for meaningful dot product results, especially when using various embedding generation tools like OpenAI, PaLM, or Simsce.

### <mark style="color:purple;">Defining a Vector Field</mark>

* <mark style="color:green;">**Vector Type**</mark><mark style="color:green;">:</mark> Vector fields are defined with a VECTOR type, with fixed dimensionality determined by the embedding model.
* <mark style="color:green;">**Embedding Model**</mark><mark style="color:green;">:</mark> The choice of embedding model is crucial for creating a structured embedding space where related objects are near each other.

### <mark style="color:purple;">Popular Embedding Models</mark>

* Various models like 'bge-large-en-v1.5', 'distiluse-base-multilingual-cased-v2', 'glove.6B.300d', etc., are available, each with specific dimensions and sources like Hugging Face or OpenAI.

### <mark style="color:purple;">Vector Search Process</mark>

* Involves creating a collection of embeddings, generating an embedding for new content, and conducting a similarity search to find similar existing content.

### <mark style="color:purple;">Best Practices for Vector Search</mark>

* <mark style="color:green;">**Metadata Storage**</mark><mark style="color:green;">:</mark> Store relevant metadata alongside vectors for more contextual searches.
* <mark style="color:green;">**Model Selection**</mark><mark style="color:green;">:</mark> Choose an embedding model based on your data type (text, images, audio, etc.) and query requirements.

### <mark style="color:purple;">Limitations of Vector Search</mark>

* <mark style="color:green;">**Readability**</mark><mark style="color:green;">:</mark> Vector embeddings are not human-readable.
* <mark style="color:green;">**Direct Retrieval**</mark><mark style="color:green;">:</mark> Not suited for direct data retrieval from a table.
* <mark style="color:green;">**Completeness**</mark><mark style="color:green;">:</mark> There is a risk of incomplete or incorrect results due to model limitations.

### <mark style="color:purple;">Common Use Cases</mark>

* <mark style="color:green;">**Retrieval-Augmented Generation (RAG)**</mark><mark style="color:green;">:</mark> Enhances LLM accuracy by incorporating relevant content into the LLM’s context window.
* <mark style="color:green;">**AI Agents**</mark><mark style="color:green;">:</mark> Capable of performing actions like Google searches and storing results and embeddings in a vector database for a persistent memory build-up.Overall Implications

Vector databases revolutionise how we handle and retrieve data in AI and machine learning by *<mark style="color:yellow;">**focusing on the similarity of content rather than traditional relational data structures**</mark>*.&#x20;

They are instrumental in enhancing the capabilities of large language models (LLMs), recommendation systems, and other AI-driven applications.&#x20;

However, their effectiveness heavily relies on the proper selection and preprocessing of embeddings, as well as understanding their limitations in comparison to traditional databases.
