This tutorial explains how Embedding Studio integrates with vector databases, specifically focusing on the PostgreSQL/pgvector implementation. You'll learn how vectors are stored, indexed, and queried to power semantic search capabilities.
Embedding Studio uses a flexible plugin-based architecture for storing and retrieving vector embeddings. At its core, the system is designed around these key concepts:
- Collections: Logical groups of vectors with common dimensionality and metric type
- Objects: Items that contain metadata and one or more vector parts
- Object Parts: Individual vector embeddings that represent a piece of an object
- Metrics: Different distance measurements (cosine, dot product, Euclidean)
The main implementation of the vector database in Embedding Studio uses PostgreSQL with the pgvector extension, providing an enterprise-grade foundation for vector operations.
The vector database system follows a layered architecture:
┌─────────────────────────────────────┐
│ Vector Database │
├─────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Collections │ │ Query │ │
│ │ │ │ Collections │ │
│ └─────────────┘ └─────────────┘ │
├─────────────────────────────────────┤
│ Collection Info Cache │
├─────────────────────────────────────┤
│ ┌───────────────┐ ┌───────────────┐ │
│ │ PostgreSQL │ │ MongoDB │ │
│ │ (pgvector) │ │ (metadata) │ │
│ └───────────────┘ └───────────────┘ │
└─────────────────────────────────────┘
- VectorDb Interface: Abstract interface defining vector database operations
- PgvectorDb Implementation: PostgreSQL implementation using pgvector
- Collection: Interface for storing and querying vectors
- PgvectorCollection: PostgreSQL-specific collection implementation
- CollectionInfoCache: Manages metadata about collections
In Embedding Studio, each object can have multiple vector parts, allowing for fine-grained representation:
┌─────────────────────────────────────┐
│ Object │
│ ID: "product_123" │
│ Payload: { │
│ "name": "Ergonomic Chair", │
│ "category": "Office Furniture" │
│ } │
├─────────────────────────────────────┤
│ ┌─────────────┐ ┌───────────────┐ │
│ │ Part 1 │ │ Part 2 │ │
│ │ ID: "p1" │ │ ID: "p2" │ │
│ │ Vector: [...│ │ Vector: [...] │ │
│ └─────────────┘ └───────────────┘ │
└─────────────────────────────────────┘
This design enables:
- Storage of chunked text embeddings
- Representation of multi-modal objects
- Part-level similarity search
- Aggregation across parts
The vector database uses two primary tables for each collection:
-
Object Table (
dbo_{collection_id}):object_id: Primary keypayload: JSONB for metadatastorage_meta: JSONB for system metadatauser_id: Owner identificationoriginal_id: Reference to origin (for derived objects)session_id: Session tracking for queries
-
Object Part Table (
dbop_{collection_id}):part_id: Primary keyobject_id: Foreign key to objectvector: pgvector type storing the embeddingis_average: Flag for aggregated vectors
- PostgreSQL 14+ with pgvector extension
- MongoDB for metadata storage
# Example configuration in app context
from embedding_studio.vectordb.pgvector.vectordb import PgvectorDb
from embedding_studio.db import postgres, mongo
# Initialize the vector database
vectordb = PgvectorDb(
pg_database=postgres.pg_database,
embeddings_mongo_database=mongo.embeddings_mongo_database,
prefix="basic",
)
# Initialize a separate database for categories if needed
categories_vectordb = PgvectorDb(
pg_database=postgres.pg_database,
embeddings_mongo_database=mongo.embeddings_mongo_database,
prefix="categories",
)Collections are created based on embedding models. Each collection is optimized for a specific vector dimensionality and distance metric.
from embedding_studio.models.embeddings.models import EmbeddingModelInfo, MetricType
# Define an embedding model
embedding_model = EmbeddingModelInfo(
id="text-embeddings-ada-002",
dimensions=1536,
metric_type=MetricType.COSINE,
)
# Create or get a collection
collection = vectordb.get_or_create_collection(embedding_model)
# Create a query collection for storing user queries
query_collection = vectordb.get_or_create_query_collection(embedding_model)# List all collections
collections = vectordb.list_collections()
# Set active "blue" collection
vectordb.set_blue_collection("text-embeddings-ada-002")
# Get active collections
blue_collection = vectordb.get_blue_collection()
blue_query_collection = vectordb.get_blue_query_collection()
# Delete a collection
vectordb.delete_collection("old-embedding-model")from embedding_studio.models.embeddings.objects import Object, ObjectPart
# Create an object with multiple vector parts
object = Object(
object_id="product_123",
parts=[
ObjectPart(part_id="product_123_title", vector=[0.1, 0.2, ...], is_average=False),
ObjectPart(part_id="product_123_description", vector=[0.3, 0.4, ...], is_average=False),
],
payload={"name": "Ergonomic Chair", "category": "Office Furniture"},
)
# Insert into collection
collection.insert([object])# Simple vector similarity search
results = collection.find_similarities(
query_vector=[0.1, 0.2, ...],
limit=10,
max_distance=0.3,
)
# Search with payload filtering
from embedding_studio.models.payload.models import PayloadFilter, TermQuery
results = collection.find_similarities(
query_vector=[0.1, 0.2, ...],
limit=10,
payload_filter=PayloadFilter(
query=TermQuery(term={"field": "category", "value": "Office Furniture"})
),
)
# Advanced search with sorting
from embedding_studio.models.sort_by.models import SortByOptions
results = collection.find_similarities(
query_vector=[0.1, 0.2, ...],
limit=10,
sort_by=SortByOptions(field="payload.popularity", order="desc"),
)Embedding Studio uses HNSW (Hierarchical Navigable Small World) indexes for efficient similarity search.
from embedding_studio.models.embeddings.models import HnswParameters, SearchIndexInfo, MetricType, MetricAggregationType
# Example index configuration from a plugin
def get_search_index_info(self) -> SearchIndexInfo:
return SearchIndexInfo(
dimensions=384,
metric_type=MetricType.COSINE,
metric_aggregation_type=MetricAggregationType.AVG,
hnsw=HnswParameters(m=16, ef_construction=96),
)# Create an index for a collection
collection.create_index()You can customize index parameters in your fine-tuning plugins:
# In your custom fine-tuning method
def get_search_index_info(self) -> SearchIndexInfo:
return SearchIndexInfo(
dimensions=1024, # For E5-large
metric_type=MetricType.DOT, # Use dot product
metric_aggregation_type=MetricAggregationType.MIN, # Take minimum distance
hnsw=HnswParameters(
m=16, # Graph connections
ef_construction=128, # Build-time precision
),
)Embedding Studio provides an optimization framework for vector databases. Optimization strategies can be applied to collections to enhance performance.
from plugins.custom.optimizations.indexes import CreateOrderingIndexesOptimization
from embedding_studio.vectordb.pgvector.optimization import PgvectorObjectsOptimization
# Apply optimizations to collections
vectordb.add_optimization(CreateOrderingIndexesOptimization())
vectordb.apply_optimizations()You can create custom optimizations by subclassing PgvectorObjectsOptimization:
from sqlalchemy import text
from embedding_studio.vectordb.pgvector.optimization import PgvectorObjectsOptimization
class VacuumAnalyzeOptimization(PgvectorObjectsOptimization):
def __init__(self):
super().__init__(name="VacuumAnalyzeOptimization")
def _get_statement(self, tablename: str):
return text(f"VACUUM ANALYZE {tablename}")from embedding_studio.models.payload.models import PayloadFilter, RangeQuery
# Find objects with popularity score between 4.5 and 5.0
filter = PayloadFilter(
query=RangeQuery(
field="popularity",
range={"gte": 4.5, "lte": 5.0}
)
)
results = collection.find_by_payload_filter(
payload_filter=filter,
limit=10
)from embedding_studio.models.payload.models import PayloadFilter, BoolQuery, TermQuery, MatchQuery
# Complex boolean query
filter = PayloadFilter(
query=BoolQuery(
must=[
TermQuery(term={"field": "category", "value": "Office Furniture"}),
],
should=[
MatchQuery(match={"field": "description", "value": "ergonomic"}),
MatchQuery(match={"field": "description", "value": "comfortable"}),
],
must_not=[
TermQuery(term={"field": "in_stock", "value": False}),
]
)
)
results = collection.find_by_payload_filter(
payload_filter=filter,
limit=10
)PgvectorCollection uses connection pooling and prepared statements to maintain performance:
# Configuration parameters (from settings.py)
SQLALCHEMY_POOL_SIZE = 10 # Number of connections in the pool
SQLALCHEMY_MAX_OVERFLOW = 20 # Max extra connections
SQLALCHEMY_POOL_TIMEOUT = 30 # Seconds to wait for connectionFor large-scale operations, use batching:
import itertools
def batch_upsert(collection, objects, batch_size=1000):
for i in range(0, len(objects), batch_size):
batch = objects[i:i+batch_size]
collection.upsert(batch)- Use
similarity_first=Truefor pure similarity-based ranking - Apply
max_distancethresholds to limit search scope - Use appropriate
MetricAggregationType(AVG/MIN) based on your use case
Vector DB integration is a key component of the fine-tuning process in Embedding Studio:
- Before Fine-Tuning: Original vectors are stored and indexed
- During Fine-Tuning: Clickstream data is analyzed for query-item pairs
- After Fine-Tuning: Updated vectors are stored with the same object IDs
Example from a fine-tuning plugin:
def get_vectordb_optimizations(self) -> List[Optimization]:
"""
Return a list of vector DB optimizations to apply.
In this case: index ordering by similarity or freshness.
"""
return [CreateOrderingIndexesOptimization()]Common issues and solutions:
If you encounter database connection problems:
# Check connection health
try:
with vectordb._pg_database.connect() as conn:
result = conn.execute(text("SELECT 1")).scalar()
print(f"Connection test: {result == 1}")
except Exception as e:
print(f"Connection error: {e}")If similarity searches are slow:
- Check that indexes are created:
collection.get_state_info().index_created - Verify index parameters:
collection.get_info().embedding_model.hnsw - Consider rebuilding the index:
collection.create_index()
To debug complex queries:
# Enable PostgreSQL query logging
with vectordb._pg_database.connect() as conn:
conn.execute(text("SET log_statement = 'all'"))
# Run your query
# Check PostgreSQL logsVector database integration is a core feature of Embedding Studio, providing the foundation for semantic search and recommendation capabilities. Understanding the PostgreSQL/pgvector implementation helps you optimize and extend these capabilities for your specific use cases.
For more information, refer to:
- The
vectordbmodule documentation - The pgvector extension documentation
- PostgreSQL performance tuning guides