System Architecture Overview

This document provides a comprehensive overview of the Embedding Studio architecture, explaining how different components work together to create, fine-tune, and serve embedding models.

High-Level Architecture

Core API Service

The central API service (embedding_studio container) provides:

REST API endpoints for application integration
Plugin management and discovery
Session and clickstream data collection
Task scheduling and coordination

This service acts as the entry point for applications using Embedding Studio and orchestrates the workflow between components.

Worker Services

Fine-Tuning Worker

The fine_tuning_worker container:

Handles model fine-tuning tasks
Runs training jobs for embedding models
Integrates with MLflow for experiment tracking
Requires GPU acceleration for efficient training
Uses the selected plugin's fine-tuning method

Inference Worker

The inference_worker container:

Serves embedding models via Triton Inference Server
Handles real-time embedding generation
Supports model versioning and A/B testing
Provides gRPC and HTTP endpoints
Manages model deployment lifecycle

Improvement Worker

The improvement_worker container:

Processes incremental vector adjustments
Applies post-training optimizations to embeddings
Handles small improvements without full fine-tuning
Works on embedding quality enhancement

Upsertion Worker

The upsertion_worker container:

Manages embedding generation for new content
Handles batch processing of items
Updates vector database with new embeddings
Processes deletion and reindexing tasks

Data Storage

Vector Database

Embedding Studio uses PostgreSQL with the pgvector extension as its primary vector store:

Stores embedding vectors with metadata
Provides fast approximate nearest neighbor search
Supports various distance metrics (cosine, dot product, Euclidean)
Handles index optimization for performance

Document Storage

MongoDB is used for storing:

Fine-tuning task metadata
Session and clickstream data
Improvement and upsertion task tracking
Reindexing task management

Model Storage

MLflow, backed by MinIO and MySQL, manages:

Model versioning and artifacts
Training metrics and parameters
Experiment tracking
Model registry for deployment

Queue System

Redis serves as the task queue and provides:

Distributed task scheduling
Worker coordination
Job priority management
Failure handling and retries

Data Flow

The typical data flow in Embedding Studio follows these stages:

Content Ingestion:
- Content is loaded via data loaders from S3, GCP, or databases
- Documents are preprocessed and split into appropriate chunks
- Initial embeddings are generated using base models
User Interaction:
- Users search or interact with content
- Clickstream data is collected via API endpoints
- Sessions are processed and converted to training signals
Fine-Tuning Process:
- Training data is prepared from user interactions
- Models are fine-tuned using the specified method
- Experiments are tracked in MLflow
- The best model version is selected for deployment
Model Deployment:
- The fine-tuned model is packaged for Triton
- The inference service is updated with the new model
- Content is reindexed with the improved model
- A/B testing may be performed to validate improvements
Search and Retrieval:
- Queries are embedded using the fine-tuned model
- Vector similarity search is performed
- Results are ranked and returned to users
- The cycle continues with new interactions

Plugin Integration Points

Embedding Studio's architecture is highly extensible through plugins that can customize:

Data Ingestion: Custom data loaders for specific sources
Text Processing: Specialized text processors and tokenizers
Image Processing: Custom image transformations and models
Fine-Tuning Methods: Application-specific training approaches
Vector Adjustments: Custom embedding improvement techniques
Query Processing: Specialized query understanding and expansion
Search Optimization: Custom ranking and filtering logic

Resource Requirements

The system has different resource needs for different components:

Fine-Tuning Worker: Requires GPU acceleration (NVIDIA CUDA)
Inference Worker: Benefits from GPU for high throughput
Vector Database: Needs sufficient memory for index performance
API and Other Workers: CPU-bound, moderate memory requirements

In the next section, we'll explore the environment variables and configuration options that control this architecture.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System Architecture Overview

High-Level Architecture

Core API Service

Worker Services

Fine-Tuning Worker

Inference Worker

Improvement Worker

Upsertion Worker

Data Storage

Vector Database

Document Storage

Model Storage

Queue System

Data Flow

Plugin Integration Points

Resource Requirements

FilesExpand file tree

architecture_overview.md

Latest commit

History

architecture_overview.md

File metadata and controls

System Architecture Overview

High-Level Architecture

Core API Service

Worker Services

Fine-Tuning Worker

Inference Worker

Improvement Worker

Upsertion Worker

Data Storage

Vector Database

Document Storage

Model Storage

Queue System

Data Flow

Plugin Integration Points

Resource Requirements