Vector Databases

Not only does it sound cool, but a vector database is a specialized type of database designed to store, index, and search high-dimensional vectors. These are numerical representations of data such as text, images, audio, or video. In the AI era, these vectors are usually embeddings generated by machine learning models, where each dimension captures a specific semantic feature. Instead of matching data via exact keywords (like SQL text search), vector databases allow semantic similarity search, finding items that are conceptually close, even if they don’t share exact words or identifiers.

Query: “cute small dog”
- Keyword search might miss “adorable puppy”
- Vector search finds it because the embeddings are close in vector space.

Why Are They Used?

Semantic Search & Retrieval-Augmented Generation (RAG)
- Powering AI assistants to find contextually relevant documents.
Image, Video, and Audio Search
- Finding visually or aurally similar items in huge datasets.
Recommendation Systems
- Matching users with products, music, or content they’re likely to enjoy.
Anomaly Detection
- Identifying unusual patterns in security logs, transactions, or IoT data.
Multi-Modal AI Applications
- Linking text to images, images to audio, etc., via shared embedding spaces.

When Were They Invented?

Late 1990s–2000s (Academic Foundations)
High-dimensional nearest neighbor search (ANN) algorithms, such as KD-trees, LSH (Locality-Sensitive Hashing), and product quantization, were researched extensively for computer vision and information retrieval.
Papers like “Indyk and Motwani (1998)” introduced the theory behind fast approximate similarity search.
2010s (Practical AI Demand)
Deep learning models like Word2Vec (2013) and ResNet (2015) made embeddings mainstream, but existing relational databases weren’t built to handle billions of vectors efficiently.
This led to the emergence of purpose-built systems.
2017–2020 (Modern Vector DBs Appear)
The rise of companies like Milvus (2019), Pinecone (2020), and Weaviate (2019) marked the start of the dedicated vector database era.
These products combined:
- ANN indexes (e.g., HNSW, IVF-PQ, Faiss, Annoy, ScaNN)
- Scalable distributed architectures
- API-first designs for AI workflows.
2022–Present (AI Explosion)
The LLM boom supercharged adoption.
Vector databases became the memory layer for Retrieval-Augmented Generation, enabling ChatGPT-like applications to pull from enterprise knowledge bases.

The Core Difference from Relational Databases

Feature Relational DB Vector DB Query Type Exact match, filtering, joins Approximate nearest neighbor (ANN) search Indexing B-trees, hash indexes Graph-based (HNSW), quantization (IVF-PQ) Data Type Text, numbers, structured data Dense vectors (float32/float16) Best For Transactional and structured queries Similarity search across unstructured data

Key Players

Open Source: Milvus, Weaviate, Vespa, Qdrant
Managed: Pinecone, Chroma, Azure Cognitive Search (vector mode)
Frameworks: FAISS (Meta), Annoy (Spotify), ScaNN (Google)

Why They Matter Now

Vector databases are to AI what SQL databases were to business software in the 1980s — the foundational infrastructure that turns theory into production at scale.
As AI continues to integrate into enterprise search, recommendation, and decision-making systems, vector databases will underpin how machines “remember” and “understand” information.

Worked Example: Using a Vector Database for Semantic Search

Scenario

["red running shoes", "blue hiking boots", "black leather sandals", "green trail sneakers"]

["red running shoes", "blue hiking boots", "black leather sandals", "green trail sneakers"]

We want a user to be able to search for scarlet jogging trainers and get the red running shoes result, even though no keywords match.

Step 1: Convert Data into Vectors (Embedding)

We use a pre-trained embedding model, e.g. sentence-transformers/all-MiniLM-L6-v2 (384-dimensional).

red running shoes [0.12, -0.03, 0.45, 0.67, -0.22, ...]
blue hiking boots [0.02, 0.48, 0.35, -0.41, 0.19, ...]
black leather sandals [-0.33, 0.12, 0.14, 0.55, 0.08, ...]
green trail sneakers [0.15, -0.01, 0.51, 0.60, -0.20, ...]

The embedding model transforms text into points in high-dimensional space where semantically similar items are close together.

Step 2: Store in a Vector Database

We insert these vectors into a vector database like Milvus, Pinecone, or Weaviate.

from pymilvus import CollectionSchema, FieldSchema, DataType, Collection

id_field = FieldSchema(name="product_id", dtype=DataType.INT64, is_primary=True)
vector_field = FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=384)
schema = CollectionSchema(fields=[id_field, vector_field], description="Product catalog")

collection = Collection("products", schema)

from pymilvus import CollectionSchema, FieldSchema, DataType, Collection

id_field = FieldSchema(name="product_id", dtype=DataType.INT64, is_primary=True)
vector_field = FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=384)
schema = CollectionSchema(fields=[id_field, vector_field], description="Product catalog")

collection = Collection("products", schema)

Step 3: Index for Fast Search

We build an Approximate Nearest Neighbor (ANN) index — e.g., HNSW (Hierarchical Navigable Small World graph).

index_params = {
    "metric_type": "cosine",
    "index_type": "HNSW",
    "params": {"M": 16, "efConstruction": 200}
}
collection.create_index("embedding", index_params)

index_params = {
    "metric_type": "cosine",
    "index_type": "HNSW",
    "params": {"M": 16, "efConstruction": 200}
}
collection.create_index("embedding", index_params)

M: number of connections per node (graph density)

efConstruction: size of candidate list during indexing (accuracy vs. speed)

Step 4: Search with a Query

We embed the user query “scarlet jogging trainers”:

query_vec = [0.11, -0.05, 0.46, 0.68, -0.23, ...]

# We run ANN search:

search_params = {"metric_type": "cosine", "params": {"ef": 64}}
results = collection.search([query_vec], "embedding", search_params, limit=2)

query_vec = [0.11, -0.05, 0.46, 0.68, -0.23, ...]

# We run ANN search:

search_params = {"metric_type": "cosine", "params": {"ef": 64}}
results = collection.search([query_vec], "embedding", search_params, limit=2)

Step 5: Distance Calculation

The database computes cosine similarity:

Query vs. “red running shoes”: 0.94

Query vs. “green trail sneakers”: 0.88

Query vs. others: <0.60

Result: “red running shoes” is top-ranked.

Step 6: Why This Is Fast

Without ANN, we’d compare the query to all vectors — O(N) complexity.
With HNSW, we navigate a prebuilt small-world graph, visiting only a fraction of nodes — often O(log N) — making billion-scale search feasible in milliseconds.

Vector Databases

Why Are They Used?

When Were They Invented?

The Core Difference from Relational Databases

Key Players

Why They Matter Now

Worked Example: Using a Vector Database for Semantic Search

Step 1: Convert Data into Vectors (Embedding)

Step 2: Store in a Vector Database

Step 3: Index for Fast Search

Step 4: Search with a Query

Step 5: Distance Calculation

Step 6: Why This Is Fast

Log-Structured Merge Trees

FinOps Maturity Curve for Data

Normal Forms

Anchor Modeling

Model Context Protocol

Why Are They Used?

When Were They Invented?

The Core Difference from Relational Databases

Key Players

Why They Matter Now

Worked Example: Using a Vector Database for Semantic Search

Step 1: Convert Data into Vectors (Embedding)

Step 2: Store in a Vector Database

Step 3: Index for Fast Search

Step 4: Search with a Query

Step 5: Distance Calculation

Step 6: Why This Is Fast

Share this:

Related Posts

Trending now

Discover more from Data Lingua. Where Data Engineering Meets Agentic Business Strategy