# Vector Databases

Vector databases are a specialised type of database designed for <mark style="color:yellow;">handling vector embeddings</mark>, which are numerical representations of various data objects.&#x20;

Traditional databases store data in tabular forms and index it by assigning values, focusing on exact matches. Vector databases, in contrast, are designed for embeddings and similarity-based searches.

They offer capabilities like CRUD operations, metadata filtering, and horizontal scaling.&#x20;

### <mark style="color:purple;">Definition and Functionality of Vector Databases</mark>

<mark style="color:green;">**Storage of Vector Embedding**</mark>

Vector databases store information as vectors, known as vector embeddings. These embeddings represent data objects numerically and are generated by AI models, including large language models.

<mark style="color:green;">**Handling of Unstructured and Semi-Structured Data**</mark>

They are particularly adept at managing massive datasets of unstructured data (like images and text) and semi-structured data (like sensor data).

<mark style="color:green;">**Capabilities Beyond Vector Search Libraries**</mark>

Unlike vector search libraries or indexes, vector databases offer a comprehensive data management solution, including metadata storage, scalability, dynamic data handling, backups, and security features.

<mark style="color:green;">**High-Dimensional Vectors**</mark>

The data in these databases is organized through high-dimensional vectors, where each dimension represents a specific characteristic of the data object.

### <mark style="color:purple;">Vector Embeddings</mark>

<mark style="color:green;">**Numerical Representation**</mark>

Vector embeddings are numerical representations of various subjects like words, images, or data pieces.

<mark style="color:green;">**Distance Measurement for Similarity**</mark>

The distance between vector embeddings (using various mathematical measures) allows the database to determine the similarity between vectors, aiding in pattern recognition and AI's understanding of relationships.

### <mark style="color:purple;">Working Mechanism</mark>

1. <mark style="color:green;">**Indexing**</mark><mark style="color:green;">:</mark> Vectors are indexed using techniques like hashing, quantization, or graph-based methods, facilitating efficient searches.
2. <mark style="color:green;">**Querying**</mark><mark style="color:green;">:</mark> Queries are processed by comparing indexed vectors to the query vector using similarity measures like cosine similarity, Euclidean distance, and dot product similarity.
3. <mark style="color:green;">**Post-processing**</mark><mark style="color:green;">:</mark> After initial querying, there may be additional filtering or re-ranking using different similarity measures, focusing on metadata.

### <mark style="color:purple;">Importance of Vector Databases</mark>

<mark style="color:green;">**Specialisation in Unstructured Data Management**</mark>

They are crucial for managing unstructured data, providing capabilities like indexing, distance metrics, and similarity searches.

<mark style="color:green;">**Enabling Advanced AI and ML Applications**</mark>

Vector databases are fundamental in AI  learning applications, enabling efficient processing of complex data types.

### <mark style="color:purple;">How Vector Databases Differ from Vector Indices</mark>

While standalone vector indices like Facebook AI Similarity Search (FAISS) improve search and retrieval, they lack comprehensive data management capabilities.

Vector databases, on the other hand, are purpose-built for managing vector embeddings. They offer a suite of features that standalone indices simply don't, including:

* <mark style="color:green;">**Data Management**</mark><mark style="color:green;">:</mark> Easier and more efficient handling of vector data.
* <mark style="color:green;">**Metadata Storage and Filtering**</mark><mark style="color:green;">:</mark> Enhanced query capabilities through metadata.
* <mark style="color:green;">**Scalability**</mark><mark style="color:green;">:</mark> Optimised for growing data volumes and user demands.
* <mark style="color:green;">**Real-time Updates**</mark><mark style="color:green;">:</mark> Ability to dynamically update data.
* <mark style="color:green;">**Backups and Collections**</mark><mark style="color:green;">:</mark> Routine operations for data security and efficiency.
* <mark style="color:green;">**Ecosystem Integration**</mark><mark style="color:green;">:</mark> Seamless compatibility with other AI tools and data processing ecosystems.
* <mark style="color:green;">**Data Security and Access Control**</mark><mark style="color:green;">:</mark> Essential for protecting sensitive information.

### <mark style="color:purple;">Embeddings in Vector Databases</mark>

Embeddings are vectors representing semantic relationships. They place related objects close in the embedding space.  They allow for the comparison and retrieval of similar items based on their conceptual similarity.

### <mark style="color:purple;">Preprocessing Embeddings</mark>

* <mark style="color:green;">**Normalisation**</mark><mark style="color:green;">:</mark> Involves scaling data to unit length, which eliminates the impact of vector scale and allows for consistent high-dimensional data handling. It enables using dot product similarity, which is faster than cosine similarity.
* <mark style="color:green;">**Standardisation**</mark><mark style="color:green;">:</mark> This process shifts and scales data to have zero mean and unit variance, imparting Gaussian distribution properties to vectors and ensuring equal contribution in distance calculations.
* <mark style="color:green;">**Importance**</mark><mark style="color:green;">:</mark> Normalization is crucial for meaningful dot product results, especially when using various embedding generation tools like OpenAI, PaLM, or Simsce.

### <mark style="color:purple;">Defining a Vector Field</mark>

* <mark style="color:green;">**Vector Type**</mark><mark style="color:green;">:</mark> Vector fields are defined with a VECTOR type, with fixed dimensionality determined by the embedding model.
* <mark style="color:green;">**Embedding Model**</mark><mark style="color:green;">:</mark> The choice of embedding model is crucial for creating a structured embedding space where related objects are near each other.

### <mark style="color:purple;">Popular Embedding Models</mark>

* Various models like 'bge-large-en-v1.5', 'distiluse-base-multilingual-cased-v2', 'glove.6B.300d', etc., are available, each with specific dimensions and sources like Hugging Face or OpenAI.

### <mark style="color:purple;">Vector Search Process</mark>

* Involves creating a collection of embeddings, generating an embedding for new content, and conducting a similarity search to find similar existing content.

### <mark style="color:purple;">Best Practices for Vector Search</mark>

* <mark style="color:green;">**Metadata Storage**</mark><mark style="color:green;">:</mark> Store relevant metadata alongside vectors for more contextual searches.
* <mark style="color:green;">**Model Selection**</mark><mark style="color:green;">:</mark> Choose an embedding model based on your data type (text, images, audio, etc.) and query requirements.

### <mark style="color:purple;">Limitations of Vector Search</mark>

* <mark style="color:green;">**Readability**</mark><mark style="color:green;">:</mark> Vector embeddings are not human-readable.
* <mark style="color:green;">**Direct Retrieval**</mark><mark style="color:green;">:</mark> Not suited for direct data retrieval from a table.
* <mark style="color:green;">**Completeness**</mark><mark style="color:green;">:</mark> There is a risk of incomplete or incorrect results due to model limitations.

### <mark style="color:purple;">Common Use Cases</mark>

* <mark style="color:green;">**Retrieval-Augmented Generation (RAG)**</mark><mark style="color:green;">:</mark> Enhances LLM accuracy by incorporating relevant content into the LLM’s context window.
* <mark style="color:green;">**AI Agents**</mark><mark style="color:green;">:</mark> Capable of performing actions like Google searches and storing results and embeddings in a vector database for a persistent memory build-up.Overall Implications

Vector databases revolutionise how we handle and retrieve data in AI and machine learning by *<mark style="color:yellow;">**focusing on the similarity of content rather than traditional relational data structures**</mark>*.&#x20;

They are instrumental in enhancing the capabilities of large language models (LLMs), recommendation systems, and other AI-driven applications.&#x20;

However, their effectiveness heavily relies on the proper selection and preprocessing of embeddings, as well as understanding their limitations in comparison to traditional databases.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://training.continuumlabs.ai/knowledge/vector-databases.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
