Skip to content

Latest commit

 

History

History
280 lines (203 loc) · 11.4 KB

File metadata and controls

280 lines (203 loc) · 11.4 KB

AI Generation Services

📋 Document Summary

What This Document Covers:

  • LiteLLMService architecture for multi-model LLM integration
  • Support for OpenAI, Anthropic, Google Gemini, and other providers
  • Prompt engineering patterns and best practices
  • Token usage optimization strategies
  • Error handling, retry logic, and fallback mechanisms
  • Integration with domain processors and RAG workflows

Sections in This Document:

Related Documentation:

Context Tags: #ai #llm #litelllm #prompt-engineering #token-optimization


Overview

LaunchAgencyBot uses a sophisticated LiteLLMService as the unified interface for all Large Language Model (LLM) interactions. This service provides:

  • Multi-provider support: OpenAI, Anthropic, Google Gemini, and 100+ other models via LiteLLM
  • Automatic fallbacks: Retry logic with alternative models on failure
  • Token optimization: Intelligent token counting and cost management
  • Streaming support: Real-time response streaming for long generations
  • Observability: LangFuse integration for monitoring and debugging

Primary Use Cases:

  1. Tag Classification - 356-tag multimodal classification using Gemini 2.0 Flash Thinking
  2. Caption Generation - 4-part caption structure (entity, context, visual, emotions)
  3. Tag Refinement - Confidence-based tag evaluation and filtering
  4. Metadata Generation - Token names, tickers, descriptions
  5. RAG Generation - Context-enhanced memecoin creation
  6. Edit Planning - AI-powered metadata edit proposals

LiteLLMService Architecture

Location: src/ai_services/lite_llm_service.py

Core Design

LiteLLMService provides unified LLM interface with multi-provider support (100+ models: OpenAI, Anthropic, Gemini, etc.), configurable temperature, and optional LangFuse observability.

Key Features

Feature Details
Multi-Model Support 100+ models via LiteLLM: OpenAI (gpt-4o, o1), Anthropic (claude-3-5-sonnet/haiku), Gemini (2.0-flash-thinking)
Retry Logic RetryManager with exponential backoff (1s, 2s, 4s), max 3 attempts, optional fallback
Token Optimization Pre-call token counting, cost management, warnings for large prompts
Streaming Real-time response streaming for long generations
Observability LangFuse integration for monitoring and debugging

LangFuse Observability

Purpose: Monitor LLM usage, costs, and performance

Features:

  • Trace all LLM calls with metadata
  • Track token usage and costs
  • Monitor latency and errors
  • Analyze prompt performance

Configuration:

service = LiteLLMService(
    model="gpt-4o-mini",
    langfuse_secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
    langfuse_public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
    enable_observability=True
)

Supported Models

OpenAI Models

Model Context Cost (per 1M tokens) Best For
gpt-4o-mini 128K Input: $0.15
Output: $0.60
General tasks, low cost
gpt-4o 128K Input: $2.50
Output: $10.00
Complex reasoning, high quality
o1 200K Input: $15.00
Output: $60.00
Advanced reasoning, math

Anthropic Models

Model Context Cost (per 1M tokens) Best For
claude-3-5-sonnet-20241022 200K Input: $3.00
Output: $15.00
Long context, reasoning
claude-3-5-haiku-20241022 200K Input: $1.00
Output: $5.00
Fast responses, low cost

Google Gemini Models

Model Context Cost (per 1M tokens) Best For
gemini-2.0-flash-thinking-exp 1M Input: Free
Output: Free
Tag classification, reasoning
gemini-2.0-flash-exp 1M Input: Free
Output: Free
General tasks, vision

Note: Gemini 2.0 Flash models are currently free during preview period.


Service Integration

Domain Processor Integration

RAGMemecoinInsertionProcessor uses two LiteLLMService instances: Gemini 2.0 Flash Thinking for 356-tag classification (advanced reasoning), GPT-4o-mini for general tasks (captions, refinement - faster, cheaper).

Processing Stage Integration

Stages Using LLM Services:

  1. CategoryClassificationStage - Main category detection
  2. TagClassificationStage - 356-tag multimodal classification
  3. ImageCaptioningStage - 4-part caption generation
  4. TagsRefinementStage - Tag confidence evaluation
  5. MetadataOnlyEditPlanningStage - Edit proposal generation
  6. CaptionAndTagsRegenerationStage - Targeted metadata updates

Prompt Engineering

Structured Output Pattern

Best Practice: Use JSON schema for structured responses

# Define expected JSON structure in prompt
system_prompt = """
Generate a JSON response with this structure:
{
  "entity": "description of entities",
  "context": "cultural context",
  "visual": "visual style description",
  "emotions": "emotional tone"
}

CRITICAL: Return ONLY valid JSON, no additional text.
"""

response = await llm_service.generate(
    system_prompt=system_prompt,
    user_prompt=user_message
)

# Parse JSON response
try:
    result = json.loads(response)
except json.JSONDecodeError:
    # Handle malformed JSON
    logger.error(f"Invalid JSON response: {response}")

Few-Shot Examples

Pattern: Provide 2-3 examples for consistency

system_prompt = """
Classify images into tags. Examples:

Example 1:
Image: Shiba Inu dog in space suit
Tags: ["dog", "space", "cryptocurrency", "meme"]

Example 2:
Image: Pepe frog with sunglasses
Tags: ["frog", "sunglasses", "meme", "internet-culture"]

Now classify this image:
"""

Temperature Tuning

Guidelines:

  • 0.0-0.3: Factual tasks (tag classification, validation)
  • 0.4-0.7: General tasks (captions, descriptions)
  • 0.8-1.0: Creative tasks (name generation, artistic descriptions)
# Tag classification - low temperature for consistency
tag_service = LiteLLMService(model="gemini-2.0-flash-thinking", temperature=0.3)

# Caption generation - medium temperature for variation
caption_service = LiteLLMService(model="gpt-4o-mini", temperature=0.7)

# Creative generation - high temperature for diversity
creative_service = LiteLLMService(model="gpt-4o", temperature=0.9)

Token Optimization

Optimization Techniques

Technique Method Example Token Savings
Dynamic Context Loading Load only relevant context based on task filter_tags_by_category(category) returns 500-1000 tokens vs load_all_tags() at 5000+ tokens 80-90% reduction
Prompt Compression Remove unnecessary whitespace and formatting lines = [line.strip() for line in prompt.split('\n')]; '\n'.join(line for line in lines if line) 10-20% reduction
Response Length Limits Set max_tokens based on expected output Tag classification: max_tokens=300, Caption generation: max_tokens=600 Prevents over-generation

Implementation: CategoryClassificationStage uses dynamic context loading to filter tags before classification.

Model Selection

Strategy: Use smallest capable model for each task

Task Model Reasoning
Tag Classification Gemini 2.0 Flash Thinking Free, advanced reasoning
Caption Generation GPT-4o-mini Low cost, good quality
Tag Refinement GPT-4o-mini Low cost, sufficient quality
Complex Editing GPT-4o High quality reasoning needed

Error Handling

Error Handling Patterns

Pattern Code Configuration
API Exceptions try:
response = await llm_service.generate(prompt)
except APIServiceError as e:
logger.error(f"LLM API error: {e}")
Catch APIServiceError from src.domain.exceptions
Retry with Backoff retry_manager = RetryManager(config=API_RETRY_CONFIG)
response = await retry_manager.execute_with_retry(
lambda: llm_service.generate(prompt),
fallback=lambda: {"error": "LLM unavailable"})
Max 3 attempts, exponential backoff: 1s → 2s → 4s
Timeout Handling response = await asyncio.wait_for(
llm_service.generate(prompt),
timeout=30.0)
30 second timeout recommended for LLM calls

Best Practices

Practice ✅ Good ❌ Bad Notes
Dedicated Services tag_service = LiteLLMService(model="gemini-2.0-flash-thinking")
caption_service = LiteLLMService(model="gpt-4o-mini")
service = LiteLLMService(model="gemini-2.0-flash-thinking")
service.model = "gpt-4o-mini" # Don't!
Create separate service instances for each model, don't reuse with switching
JSON Parsing try:
result = json.loads(response)
except json.JSONDecodeError:
result = {"error": "Invalid format"}
result = json.loads(response) # Crashes! Always wrap JSON parsing in try-catch with fallback
Temperature LiteLLMService(model="gemini-2.0-flash-thinking", temperature=0.3) # Factual
LiteLLMService(model="gpt-4o-mini", temperature=0.7) # Balanced
LiteLLMService(model="gpt-4o", temperature=0.9) # Creative
LiteLLMService(model="gemini-2.0-flash-thinking", temperature=0.9) # Wrong! Match temperature to task: 0.0-0.3 factual, 0.4-0.7 general, 0.8-1.0 creative
Token Monitoring token_count = llm_service.count_tokens(prompt)
if token_count > 100000:
logger.warning(f"Large: {token_count} tokens")
await llm_service.generate(prompt) # Expensive! Count tokens before API call to avoid unexpected costs
Retry Logic retry_manager = RetryManager(config=API_RETRY_CONFIG)
await retry_manager.execute_with_retry(
lambda: llm_service.generate(prompt),
fallback=lambda: default_response)
await llm_service.generate(prompt) # Fails! Use retry with fallback for critical operations to handle API failures

Implementation References

Key Files

Component File Path
LiteLLMService src/ai_services/lite_llm_service.py
RetryManager src/domain/exceptions/retry_manager.py
Tag Classification Stage src/domain/processor/stages/tag_classification_stage.py
Caption Generation Stage src/domain/processor/stages/image_captioning_stage.py
Tag Refinement Stage src/domain/processor/stages/tags_refinement_stage.py
Edit Planning Stage src/domain/processor/stages/metadata_only_edit_planning_stage.py

Document Status: Complete Last Updated: October 16, 2025 Implementation: Multi-provider LLM service with 100+ model support