Skip to content

Latest commit

 

History

History
382 lines (287 loc) · 18.3 KB

File metadata and controls

382 lines (287 loc) · 18.3 KB

Vector Database Architecture

Status: Updated October 2025 | Current implementation Last Validated: October 15, 2025 against actual codebase Architecture Version: 5-Collection Multi-Embedding with Split Captions


📋 Document Summary

This document describes LaunchAgencyBot's 5-collection multi-embedding vector database architecture for optimal RAG performance:

What This Document Covers:

  • 5 separate ChromaDB collections (1 image + 4 caption types)
  • Weighted multi-embedding search strategy with independent weights
  • Split caption system for semantic separation and RAG quality
  • 768-dimensional CLIP embeddings via Replicate API
  • Minimal ChromaDB metadata schema (6 fields only)
  • Token address as universal UUID across all systems

Sections in This Document:

Related Documentation:

Context Tags: #database #vector-store #chromadb #rag #embeddings #clip #multi-embedding #captions


1. Current Architecture Overview

LaunchAgencyBot uses a sophisticated 5-collection multi-embedding vector database architecture combining ChromaDB with file-based data management for optimal RAG performance.

5-Collection Multi-Embedding Architecture

The system maintains 5 separate ChromaDB collections to store different types of embeddings:

  1. meme_image_embeddings - CLIP image embeddings from Base64 images
  2. entity_caption_embeddings - Entity captions (characters, objects, actions) - Weight: 4
  3. context_caption_embeddings - Context captions (cultural/meme references) - Weight: 3
  4. visual_caption_embeddings - Visual style captions (aesthetic, composition) - Weight: 2 (optional)
  5. emotions_caption_embeddings - Emotional tone captions (mood, expressions) - Weight: 1 (optional)

Key Characteristics:

  • Single unified vector database - All memecoins exist across all 5 collections (not dual pending/confirmed)
  • Token address as universal UUID - Blockchain addresses used consistently across all systems
  • Weighted multi-embedding search - Different embeddings contribute with different weights (4/3/2/1)
  • Vector dimensions: 768 (OpenAI CLIP embeddings via Replicate API)
  • Location: res/memecoins/rag_db/ (ChromaDB persistent storage)

3-Tier Storage System

The architecture implements a 3-tier workflow for quality control and processing:

Tier Location Purpose Storage Format Status
Tier 1: Pending res/memecoins/pending/ Manual curation queue JSON + JPG Filesystem-only
Tier 2: Approved res/memecoins/approved/ Ready for AI processing JSON + JPG Filesystem-only
Tier 3: Vector DB res/memecoins/rag_db/ Fully processed with embeddings ChromaDB + files 5 collections

Workflow Stages:

Pending → Approved → Vector DB
   ↓         ↓           ↓
Delete    Delete    Degrade (back to Approved)

See Also: DATABASE_STORAGE.md - Complete degradation workflow details

Hybrid Storage Strategy

  • ChromaDB Collections: Store 768-dimensional CLIP embeddings + minimal metadata
  • Image Files: res/memecoins/rag_db/images/{token_address}.jpg (512x512px JPG)
  • Metadata Files: res/memecoins/rag_db/metadata/{token_address}.json (complete metadata including captions)

See Also: DATABASE_STORAGE.md - File management implementation details


2. Collection Architecture Deep Dive

Collection Purposes and Weights

Implementation: src/vector_store/memecoin_store.py:42-47

Each collection serves a specific semantic purpose in the RAG system:

Collection Content Purpose Weight Example Use Case / Rationale
meme_image_embeddings CLIP embeddings of Base64 images Visual similarity Independent N/A Find visually similar memecoins
entity_caption_embeddings Characters, objects, actions "What is in the image" 4 (highest) "Shiba Inu dog wearing space suit, holding rocket, on moon" Most concrete aspect, critical for matching characters/objects
context_caption_embeddings Cultural/meme references "What does it mean" 3 (high) "Reference to Doge meme, 'to the moon' expression, satirical crypto hype" Critical for understanding meme context
visual_caption_embeddings Aesthetic style, composition "How does it look" 2 (medium) "Pixel art, neon colors, centered composition, Comic Sans, retro 8-bit" Style matching (optional if unremarkable)
emotions_caption_embeddings Emotional tone, mood "How does it feel" 1 (low) "Playful and humorous, optimistic, ironic, lighthearted absurdism" Supplementary (optional if neutral)

Weighted Search Strategy

Implementation: src/vector_store/memecoin_store.py:410-494

The system performs weighted multi-embedding search to find the most relevant memecoins:

# Query all 4 caption collections with different weights
entity_results = entity_collection.query(query_embedding, n_results=n_results)    # weight: 4
context_results = context_collection.query(query_embedding, n_results=n_results)  # weight: 3
visual_results = visual_collection.query(query_embedding, n_results=n_results)    # weight: 2
emotions_results = emotions_collection.query(query_embedding, n_results=n_results) # weight: 1

# Combine results with weighted scoring
final_score = (entity_similarity * 4) + (context_similarity * 3) +
              (visual_similarity * 2) + (emotions_similarity * 1)

Benefits:

  • Precision: Entity and context captions (weights 4+3=7) dominate the search
  • Flexibility: Visual and emotion embeddings (weights 2+1=3) provide nuance
  • Robustness: Optional embeddings don't penalize memecoins that lack them
  • Semantic Separation: Each aspect contributes independently to the final match

Metadata Schema (Stored in All Collections)

ChromaDB Metadata Fields:

{
  "name": "Token display name",
  "ticker": "Token symbol/ticker",
  "uuid": "Token address (Solana mint address - primary key)",
  "tags": "tag1,tag2,tag3 (comma-separated, NOT semicolon)",
  "tags_categories": "category1,category2 (comma-separated subcategories)",
  "created_at": 1234567890  // UNIX timestamp (integer)
}

Important Notes:

  • ✅ Tags are comma-separated (not semicolon as in legacy docs)
  • No description field in ChromaDB metadata (stored only in metadata files)
  • No confirmed field (all memecoins in vector DB are implicitly confirmed)
  • ✅ Token address (uuid) is the universal identifier across all systems

3. Storage Architecture

ChromaDB Vector Storage

Location: res/memecoins/rag_db/ (defined in src/constants.py:29)

Collections:

  • meme_image_embeddings - 768-dim image vectors
  • entity_caption_embeddings - 768-dim text vectors (entity)
  • context_caption_embeddings - 768-dim text vectors (context)
  • visual_caption_embeddings - 768-dim text vectors (visual)
  • emotions_caption_embeddings - 768-dim text vectors (emotions)

Characteristics:

  • Persistent storage - Data survives application restarts
  • Minimal metadata - Only 6 fields stored per memecoin
  • Fast similarity search - Optimized for cosine similarity queries
  • Atomic operations - All 5 collections updated together or rolled back

File-Based Storage

Image Files

  • Location: res/memecoins/rag_db/images/{token_address}.jpg
  • Format: 512x512px RGB JPG files (quality: 90%)
  • Manager: FileImageManager (src/vector_store/file_managers/image_manager.py)
  • Operations: Atomic save/load/delete with temp file pattern
  • Processing: Automatic resize, RGB conversion, RGBA → white background

See Also: DATABASE_STORAGE.md - Complete file manager implementation

Metadata Files (Complete Memecoin Data)

  • Location: res/memecoins/rag_db/metadata/{token_address}.json
  • Manager: TokenMetadataFileManager (src/vector_store/file_managers/metadata_manager.py)
  • Thread Safety: Async-safe with asyncio.Lock
  • Validation: Enforces required fields and caption structure

Metadata JSON Structure:

{
  "token_name": "Doge Coin",
  "ticker": "DOGE",
  "description": "Much wow, very crypto, such memecoin",
  "token_address": "9WzDXwBbmkg8ZTbNMqUxvQRAyrZzDsGYdLVL9zYtAWWM",
  "created_at": 1693440000,
  "caption": {
    "entity": "Shiba Inu dog wearing space suit, holding rocket, standing on moon surface",
    "context": "Reference to Doge meme and cryptocurrency culture, 'to the moon' expression, satirical take on crypto investment hype",
    "visual": "Pixel art style with bright neon colors, centered composition, bold Comic Sans font overlay",
    "emotions": "Playful and humorous tone, optimistic energy, ironic and satirical mood"
  }
}

Required Fields: token_name, ticker, description, token_address, created_at, caption Required Caption Fields: entity, context Optional Caption Fields: visual, emotions

Filesystem-Only Collections

Pending Collection

  • Location: res/memecoins/pending/
  • Format: {token_address}.json + {token_address}.jpg
  • Purpose: Manual curation queue - new memecoins awaiting review
  • Characteristics: No ChromaDB storage, no captions yet, minimal metadata

Approved Collection

  • Location: res/memecoins/approved/
  • Format: {token_address}.json + {token_address}.jpg
  • Purpose: Approved memecoins ready for AI processing into vector DB
  • Characteristics: No ChromaDB storage, no captions yet, can be bulk-inserted

See Also: DATABASE_STORAGE.md - Complete 3-tier workflow details

Storage Path Constants

Defined in: src/constants.py:27-31

MEMECOINS_DATA_ROOT = "res/memecoins/rag_db/"     # ChromaDB + files
IMAGES_ROOT = "res/memecoins/rag_db/images/"      # JPG image files
METADATA_ROOT = "res/memecoins/rag_db/metadata/"  # JSON metadata files
TRENDS_IMAGES_ROOT = "res/memecoins/trends/"      # Trend analysis images

5. Split Caption Architecture (RAG Quality Enhancement)

Why Split Captions?

Traditional RAG systems use single monolithic captions per image, which creates several problems:

Problems with Single Captions:

  1. Semantic conflation - Content, context, style, and emotion mixed together
  2. Poor similarity matching - Can't weight different aspects independently
  3. Search inflexibility - Can't prioritize "what" over "how" or vice versa
  4. Information loss - Important details buried in long paragraphs
  5. One-size-fits-all - Same embedding must serve all query types

Split Caption Solution

LaunchAgencyBot uses 4 separate caption types, each generating its own embedding:

{
  "caption": {
    "entity": "WHAT is in the image - concrete visual elements",
    "context": "WHAT DOES IT MEAN - cultural/meme context",
    "visual": "HOW does it look - aesthetic style (optional)",
    "emotions": "HOW does it feel - emotional tone (optional)"
  }
}

Caption Type Deep Dive

Caption Type Purpose Weight Status Guidelines Example Rationale
Entity WHAT is in image 4 Required Focus on factual visual content, describe entities, spatial relationships, actions "Shiba Inu dog wearing space suit with NASA logo, holding rocket with flames, on cratered moon, Earth in background, golden DOGE coin floating" Most concrete/factual, critical for matching characters/objects, least ambiguous, foundation for other caption types
Context WHAT it means 3 Required Identify meme references/origins, explain crypto culture, note ironic elements, connect to trends "Reference to Doge meme + 'to the moon' expression, satirical take on crypto hype/HODL mentality, space race parody for memecoin speculation" Critical for understanding meme meaning, essential for thematic similarity, captures cultural context, helps match by concept not just visuals
Visual HOW it looks 2 Optional Art style (pixel/3D/photo), color palette/lighting, composition/framing, typography "Pixel art 8-bit aesthetic, neon cyan/magenta, centered composition, Comic Sans overlay, retro vaporwave gradient" Important for style matching, less critical than content/context, useful for visual similarity. Omit if: generic photos, unremarkable composition, standard templates
Emotions HOW it feels 1 Optional Emotional tone, humor type (ironic/absurdist/sarcastic), mood/energy, intended response "Playful humorous tone, optimistic energy, ironic/satirical poking fun at crypto culture, lighthearted absurdism, evoke amusement + relatable frustration" Most subjective, supplementary to content/context, useful for "vibe" matching. Omit if: emotionally neutral, purely informational, unclear intent

RAG Quality Improvements

Improvement Before After Impact
Semantic Separation Mixed concepts in single caption Dedicated embedding per concept Queries match specific aspects without semantic bleed (e.g., "space-themed characters" matches entity, not confused by pixel art style)
Independent Weighting Equal or manual tuning Mathematical weights (4/3/2/1) Content/context dominate (7), style/emotions add nuance (3)
Optional Embeddings Forced filler text Skip optional captions No penalty, more precise matches, no "filler text" polluting embeddings
Query Flexibility Single embedding serves all Target specific collections Search by entity, context, or weighted combination
Embedding Efficiency One 768-dim vector (information loss) 4-5 vectors (3072-3840 dims) Richer representation, better discrimination

Caption Generation Pipeline

Stage: src/domain/processor/stages/caption_generation_stage.py:37-92

Process:

  1. Receive MemecoinEntry with populated tags from previous stage
  2. AI generates captions using LLM (LiteLLM service)
  3. Parse AI response into 4-part caption dict
  4. Validate required fields (entity, context)
  5. Populate tags_categories using TagManager
  6. Store caption data in context for next stage (actual save happens atomically)

AI Prompt Structure:

Generate 4 caption types for this memecoin image:
1. ENTITY: Describe concrete visual elements...
2. CONTEXT: Explain cultural references and meme lore...
3. VISUAL (optional): Describe aesthetic style if notable...
4. EMOTIONS (optional): Describe emotional tone if present...

Tags to consider: {tag_list}

Storage Location

Captions stored in unified metadata JSON files (NOT separate caption files):

Location: res/memecoins/rag_db/metadata/{token_address}.json

Why Unified Storage:

  • Atomic operations - Single file write prevents caption/metadata mismatch
  • Data integrity - Caption always tied to correct metadata
  • Efficient loading - Single JSON read gets everything
  • Version control friendly - Complete state in one file
  • No orphaned files - No sync issues between separate files

Alternative Considered (Rejected):

  • ❌ Separate caption files ({token_address}_caption.txt)
  • Issues: Sync problems, orphaned files, complex rollback, slower loads

See Also: DATABASE_STORAGE.md - Atomic file operation details

Metadata Validation

Implementation: src/vector_store/file_managers/metadata_manager.py:104-107

# Required keys
required_keys = ["token_name", "ticker", "description", "token_address", "created_at", "caption"]

# Required caption structure
required_caption_keys = ["entity", "context"]  # visual and emotions are optional

Validation Rules:

  • All 6 top-level keys must exist
  • caption must be a dict
  • caption.entity and caption.context must be non-empty strings
  • caption.visual and caption.emotions are optional

Usage in RAG Generation

Workflow: src/workflows/rag_memecoin_generation_workflow.py

Process:

  1. User provides generation prompt (e.g., "dog with sunglasses in cyberpunk style")
  2. Generate CLIP embedding from prompt text
  3. Query all 4 caption collections with weighted search
  4. Retrieve top N matching memecoins (e.g., N=5)
  5. Load full metadata + images for matched memecoins
  6. Build RAG context from retrieved examples:
    Example 1:
    Entity: {entity_caption}
    Context: {context_caption}
    Visual: {visual_caption}
    Emotions: {emotions_caption}
    Image: [Base64 image data]
    
    Example 2:
    ...
    
  7. Generate new memecoin metadata using RAG context
  8. Create new memecoin with similar style/theme to successful examples

Benefits:

  • Retrieved examples are semantically similar across multiple dimensions
  • Weighted search prioritizes content match over style match
  • AI generator learns from contextually relevant examples
  • Higher quality outputs that match successful patterns

Related Documentation

  • DATABASE_STORAGE.md - Hybrid storage strategy, degradation system, data flow workflows, file managers
  • DATABASE_API.md - API integration patterns, performance optimization, best practices, troubleshooting

Last Updated: 2025-10-15 Architecture Version: 5-Collection Multi-Embedding with Split Captions Implementation Files: src/vector_store/memecoin_store.py, src/domain/processor/stages/caption_generation_stage.py