What This Document Covers:
- LiteLLMService architecture for multi-model LLM integration
- Support for OpenAI, Anthropic, Google Gemini, and other providers
- Prompt engineering patterns and best practices
- Token usage optimization strategies
- Error handling, retry logic, and fallback mechanisms
- Integration with domain processors and RAG workflows
Sections in This Document:
- Overview
- LiteLLMService Architecture
- Supported Models
- Service Integration
- Prompt Engineering
- Token Optimization
- Error Handling
- Usage Examples
- Best Practices
Related Documentation:
- → RAG_SYSTEM.md - RAG workflows and generation
- → entity_extraction_enhancement.md - Entity extraction feature
- → ../architecture/DOMAIN_ARCHITECTURE.md - Domain processor integration
- → ../../.claude/ARCHITECTURE.md - System architecture
Context Tags: #ai #llm #litelllm #prompt-engineering #token-optimization
LaunchAgencyBot uses a sophisticated LiteLLMService as the unified interface for all Large Language Model (LLM) interactions. This service provides:
- Multi-provider support: OpenAI, Anthropic, Google Gemini, and 100+ other models via LiteLLM
- Automatic fallbacks: Retry logic with alternative models on failure
- Token optimization: Intelligent token counting and cost management
- Streaming support: Real-time response streaming for long generations
- Observability: LangFuse integration for monitoring and debugging
Primary Use Cases:
- Tag Classification - 356-tag multimodal classification using Gemini 2.0 Flash Thinking
- Caption Generation - 4-part caption structure (entity, context, visual, emotions)
- Tag Refinement - Confidence-based tag evaluation and filtering
- Metadata Generation - Token names, tickers, descriptions
- RAG Generation - Context-enhanced memecoin creation
- Edit Planning - AI-powered metadata edit proposals
Location: src/ai_services/lite_llm_service.py
LiteLLMService provides unified LLM interface with multi-provider support (100+ models: OpenAI, Anthropic, Gemini, etc.), configurable temperature, and optional LangFuse observability.
| Feature | Details |
|---|---|
| Multi-Model Support | 100+ models via LiteLLM: OpenAI (gpt-4o, o1), Anthropic (claude-3-5-sonnet/haiku), Gemini (2.0-flash-thinking) |
| Retry Logic | RetryManager with exponential backoff (1s, 2s, 4s), max 3 attempts, optional fallback |
| Token Optimization | Pre-call token counting, cost management, warnings for large prompts |
| Streaming | Real-time response streaming for long generations |
| Observability | LangFuse integration for monitoring and debugging |
Purpose: Monitor LLM usage, costs, and performance
Features:
- Trace all LLM calls with metadata
- Track token usage and costs
- Monitor latency and errors
- Analyze prompt performance
Configuration:
service = LiteLLMService(
model="gpt-4o-mini",
langfuse_secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
langfuse_public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
enable_observability=True
)| Model | Context | Cost (per 1M tokens) | Best For |
|---|---|---|---|
gpt-4o-mini |
128K | Input: $0.15 Output: $0.60 |
General tasks, low cost |
gpt-4o |
128K | Input: $2.50 Output: $10.00 |
Complex reasoning, high quality |
o1 |
200K | Input: $15.00 Output: $60.00 |
Advanced reasoning, math |
| Model | Context | Cost (per 1M tokens) | Best For |
|---|---|---|---|
claude-3-5-sonnet-20241022 |
200K | Input: $3.00 Output: $15.00 |
Long context, reasoning |
claude-3-5-haiku-20241022 |
200K | Input: $1.00 Output: $5.00 |
Fast responses, low cost |
| Model | Context | Cost (per 1M tokens) | Best For |
|---|---|---|---|
gemini-2.0-flash-thinking-exp |
1M | Input: Free Output: Free |
Tag classification, reasoning |
gemini-2.0-flash-exp |
1M | Input: Free Output: Free |
General tasks, vision |
Note: Gemini 2.0 Flash models are currently free during preview period.
RAGMemecoinInsertionProcessor uses two LiteLLMService instances: Gemini 2.0 Flash Thinking for 356-tag classification (advanced reasoning), GPT-4o-mini for general tasks (captions, refinement - faster, cheaper).
Stages Using LLM Services:
- CategoryClassificationStage - Main category detection
- TagClassificationStage - 356-tag multimodal classification
- ImageCaptioningStage - 4-part caption generation
- TagsRefinementStage - Tag confidence evaluation
- MetadataOnlyEditPlanningStage - Edit proposal generation
- CaptionAndTagsRegenerationStage - Targeted metadata updates
Best Practice: Use JSON schema for structured responses
# Define expected JSON structure in prompt
system_prompt = """
Generate a JSON response with this structure:
{
"entity": "description of entities",
"context": "cultural context",
"visual": "visual style description",
"emotions": "emotional tone"
}
CRITICAL: Return ONLY valid JSON, no additional text.
"""
response = await llm_service.generate(
system_prompt=system_prompt,
user_prompt=user_message
)
# Parse JSON response
try:
result = json.loads(response)
except json.JSONDecodeError:
# Handle malformed JSON
logger.error(f"Invalid JSON response: {response}")Pattern: Provide 2-3 examples for consistency
system_prompt = """
Classify images into tags. Examples:
Example 1:
Image: Shiba Inu dog in space suit
Tags: ["dog", "space", "cryptocurrency", "meme"]
Example 2:
Image: Pepe frog with sunglasses
Tags: ["frog", "sunglasses", "meme", "internet-culture"]
Now classify this image:
"""Guidelines:
- 0.0-0.3: Factual tasks (tag classification, validation)
- 0.4-0.7: General tasks (captions, descriptions)
- 0.8-1.0: Creative tasks (name generation, artistic descriptions)
# Tag classification - low temperature for consistency
tag_service = LiteLLMService(model="gemini-2.0-flash-thinking", temperature=0.3)
# Caption generation - medium temperature for variation
caption_service = LiteLLMService(model="gpt-4o-mini", temperature=0.7)
# Creative generation - high temperature for diversity
creative_service = LiteLLMService(model="gpt-4o", temperature=0.9)| Technique | Method | Example | Token Savings |
|---|---|---|---|
| Dynamic Context Loading | Load only relevant context based on task | filter_tags_by_category(category) returns 500-1000 tokens vs load_all_tags() at 5000+ tokens |
80-90% reduction |
| Prompt Compression | Remove unnecessary whitespace and formatting | lines = [line.strip() for line in prompt.split('\n')]; '\n'.join(line for line in lines if line) |
10-20% reduction |
| Response Length Limits | Set max_tokens based on expected output |
Tag classification: max_tokens=300, Caption generation: max_tokens=600 |
Prevents over-generation |
Implementation: CategoryClassificationStage uses dynamic context loading to filter tags before classification.
Strategy: Use smallest capable model for each task
| Task | Model | Reasoning |
|---|---|---|
| Tag Classification | Gemini 2.0 Flash Thinking | Free, advanced reasoning |
| Caption Generation | GPT-4o-mini | Low cost, good quality |
| Tag Refinement | GPT-4o-mini | Low cost, sufficient quality |
| Complex Editing | GPT-4o | High quality reasoning needed |
| Pattern | Code | Configuration |
|---|---|---|
| API Exceptions | try: response = await llm_service.generate(prompt)except APIServiceError as e: logger.error(f"LLM API error: {e}") |
Catch APIServiceError from src.domain.exceptions |
| Retry with Backoff | retry_manager = RetryManager(config=API_RETRY_CONFIG)response = await retry_manager.execute_with_retry( lambda: llm_service.generate(prompt), fallback=lambda: {"error": "LLM unavailable"}) |
Max 3 attempts, exponential backoff: 1s → 2s → 4s |
| Timeout Handling | response = await asyncio.wait_for( llm_service.generate(prompt), timeout=30.0) |
30 second timeout recommended for LLM calls |
| Practice | ✅ Good | ❌ Bad | Notes |
|---|---|---|---|
| Dedicated Services | tag_service = LiteLLMService(model="gemini-2.0-flash-thinking")caption_service = LiteLLMService(model="gpt-4o-mini") |
service = LiteLLMService(model="gemini-2.0-flash-thinking")service.model = "gpt-4o-mini" # Don't! |
Create separate service instances for each model, don't reuse with switching |
| JSON Parsing | try: result = json.loads(response)except json.JSONDecodeError: result = {"error": "Invalid format"} |
result = json.loads(response) # Crashes! |
Always wrap JSON parsing in try-catch with fallback |
| Temperature | LiteLLMService(model="gemini-2.0-flash-thinking", temperature=0.3) # FactualLiteLLMService(model="gpt-4o-mini", temperature=0.7) # BalancedLiteLLMService(model="gpt-4o", temperature=0.9) # Creative |
LiteLLMService(model="gemini-2.0-flash-thinking", temperature=0.9) # Wrong! |
Match temperature to task: 0.0-0.3 factual, 0.4-0.7 general, 0.8-1.0 creative |
| Token Monitoring | token_count = llm_service.count_tokens(prompt)if token_count > 100000: logger.warning(f"Large: {token_count} tokens") |
await llm_service.generate(prompt) # Expensive! |
Count tokens before API call to avoid unexpected costs |
| Retry Logic | retry_manager = RetryManager(config=API_RETRY_CONFIG)await retry_manager.execute_with_retry( lambda: llm_service.generate(prompt), fallback=lambda: default_response) |
await llm_service.generate(prompt) # Fails! |
Use retry with fallback for critical operations to handle API failures |
| Component | File Path |
|---|---|
| LiteLLMService | src/ai_services/lite_llm_service.py |
| RetryManager | src/domain/exceptions/retry_manager.py |
| Tag Classification Stage | src/domain/processor/stages/tag_classification_stage.py |
| Caption Generation Stage | src/domain/processor/stages/image_captioning_stage.py |
| Tag Refinement Stage | src/domain/processor/stages/tags_refinement_stage.py |
| Edit Planning Stage | src/domain/processor/stages/metadata_only_edit_planning_stage.py |
Document Status: Complete Last Updated: October 16, 2025 Implementation: Multi-provider LLM service with 100+ model support