Semantic Foragecast Engine is a configuration-first, modular pipeline for generating audio-driven animated videos. The system is designed around clean separation of concerns, extensibility, and production deployment requirements.
Core Philosophy: Configuration over code changes. Users should be able to create entirely different outputs by modifying YAML files, not Python code.
┌─────────────────────────────────────────────────────────────┐
│ Main Orchestrator │
│ (main.py) │
│ - Validates configuration │
│ - Ensures dependencies │
│ - Routes to phase executors │
└──────────────┬──────────────────────────────┬────────────────┘
│ │
▼ ▼
┌───────────────────────┐ ┌──────────────────────────┐
│ Phase Execution │ │ Configuration Layer │
│ (Sequential) │◄─────┤ (config.yaml) │
└───────────────────────┘ └──────────────────────────┘
│
│
┌───────────┴────────────────────────────────────────┐
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ PHASE 1 │ │ PHASE 2 │ │ PHASE 3 │
│ Audio Prep │─▶│ Rendering │─▶│ Export │
│ │ │ │ │ │
│ prep_audio.py │ │ blender_script.py│ │ export_video.py │
└──────────────────┘ └──────────────────┘ └──────────────────┘
│ │ │
▼ ▼ ▼
prep_data.json PNG frames MP4 video
File: prep_audio.py
Input: Audio file (WAV), lyrics file (TXT), configuration
Output: prep_data.json
Responsibilities:
-
Audio Analysis
- Duration calculation
- Sample rate validation
- Tempo detection (LibROSA)
-
Beat Detection
- Beat times using LibROSA's onset detection
- Onset times for finer-grained timing
- Frame number conversion (based on configured FPS)
-
Phoneme Extraction
- Rhubarb Lip Sync integration (if available)
- Mock phoneme fallback for testing
- Time-to-frame conversion
-
Lyrics Parsing
- Pipe-delimited format parsing (word|start|end)
- Word-level timing data
- Frame-based timing calculation
Design Pattern: Data extraction layer
- No rendering logic
- Pure data processing
- Cacheable output (JSON)
- Can be run independently for validation
Key Classes:
AudioPreprocessor
├── load_audio()
├── detect_beats()
├── extract_phonemes()
└── parse_lyrics()File: blender_script.py
Input: Configuration, prep_data.json, asset files
Output: PNG frame sequence
Responsibilities:
-
Scene Setup
- Camera positioning
- Lighting configuration
- Render settings
-
Animation Mode Dispatch
- 2D Grease Pencil mode
- 3D Mesh mode
- Hybrid mode
- Extensible to new modes
-
Animation Application
- Lip sync (phoneme-driven mouth shapes)
- Beat gestures (scale/rotation on beats)
- Lyrics display (timed text objects)
-
Frame Rendering
- EEVEE or Cycles engine
- Configurable quality/performance
- Headless-compatible (Xvfb)
Design Pattern: Builder + Strategy
- Builder:
GreasePencilBuilderconstructs scene - Strategy: Animation mode selected at runtime
- Factory: Creates appropriate builder based on config
Key Classes:
BlenderSceneBuilder
├── __init__(config, prep_data)
├── setup_scene()
├── build_animation() → dispatches to:
│ ├── GreasePencilBuilder
│ ├── MeshBuilder (planned)
│ └── HybridBuilder (planned)
└── render_frames()
GreasePencilBuilder
├── convert_image_to_strokes()
├── animate_lipsync()
├── add_beat_gestures()
└── create_lyric_text()Extension Point: Add new animation modes by:
- Create new builder class (e.g.,
ParticleSystemBuilder) - Implement required methods
- Register in
build_animation()dispatcher - Add mode to config schema
File: export_video.py
Input: PNG frames, audio file, configuration
Output: MP4 video file
Responsibilities:
-
Frame Validation
- Check all frames rendered
- Validate frame numbering
- Verify frame resolution
-
Video Encoding
- FFmpeg integration
- Codec configuration (libx264, libx265, etc.)
- Quality presets (low, medium, high)
-
Audio Synchronization
- Embed original audio track
- Ensure frame rate matches audio timing
- Verify final duration
-
Preview Generation
- Optional lower-resolution preview
- Quick validation output
Design Pattern: Facade
- Abstracts FFmpeg complexity
- Provides simple encode() interface
- Handles cross-platform paths
Key Classes:
VideoExporter
├── validate_frames()
├── encode_video()
└── create_preview()Audio File (song.wav)
│
▼
┌─────────────────────┐
│ LibROSA │
│ - Analyze tempo │
│ - Detect beats │
└──────┬──────────────┘
│
▼
┌─────────────────────┐ ┌──────────────────┐
│ Rhubarb/Mock │ │ Lyrics Parser │
│ - Extract phonemes│ │ - Parse timing │
└──────┬──────────────┘ └────────┬─────────┘
│ │
└───────────┬───────────────────┘
▼
prep_data.json
{beats, phonemes, lyrics}
│
▼
┌───────────────────────┐
│ Blender Python API │
│ - Build scene │
│ - Apply animations │
│ - Render frames │
└───────────┬───────────┘
│
▼
PNG Frames (0001-NNNN)
│
▼
┌───────────────────────┐
│ FFmpeg │
│ - Encode video │
│ - Sync audio │
└───────────┬───────────┘
│
▼
Final MP4
Pattern: Hierarchical YAML with inheritance
# Top level: environment settings
inputs: {mascot, song, lyrics}
output: {directories, naming}
# Middle: rendering specifications
video: {resolution, fps, quality}
style: {colors, lighting, effects}
# Bottom: implementation details
animation: {mode, features, parameters}
advanced: {debug, threads, optimization}Responsibility: Single source of truth for all system behavior
Benefits:
- No code changes for different outputs
- Shareable presets (ultra_fast, production, etc.)
- Validation at startup (fail fast)
- Documentation via example configs
Pattern: Declarative references, validated at startup
Assets:
mascot_image: Source image for charactersong_file: Audio track (WAV preferred)lyrics_file: Timed lyrics text
Validation:
- Check existence
- Verify format/extension
- Validate content (sample rate, image dimensions, etc.)
Pattern: Render engine agnostic design
# Blender-specific implementation hidden behind interface
class Renderer(ABC):
@abstractmethod
def setup_scene(self): pass
@abstractmethod
def render_frame(self, frame_num): pass
# Currently: BlenderRenderer
# Future: Unity, Unreal, Custom enginesDecision: Separate phases with JSON intermediate format
Rationale:
- ✅ Can re-render without re-analyzing audio
- ✅ Each phase independently testable
- ✅ Failed renders don't require audio re-processing
- ✅ Parallel development possible
- ❌ Slight disk I/O overhead (negligible)
Alternative Rejected: Single monolithic process
- Would couple audio analysis to rendering
- Makes testing harder
- No caching benefits
Decision: YAML configuration drives all behavior
Rationale:
- ✅ Non-programmers can create presets
- ✅ Easier A/B testing (just swap configs)
- ✅ Configuration can be versioned separately
- ✅ Reduces code changes for common variations
- ❌ More complex validation required
- ❌ Harder to express complex logic in YAML
Alternative Rejected: Programmatic API
- Steeper learning curve
- Less shareable
- Would still need configs for common cases
Decision: 2D Grease Pencil as primary mode
Rationale:
- ✅ Faster rendering (2-3x speedup)
- ✅ Unique artistic style
- ✅ More forgiving of low-quality input images
- ✅ Smaller file sizes
- ❌ Less "polished" appearance
- ❌ Fewer effects available
Alternative Available: 3D mesh mode (config: mode: "3d")
- Higher quality but slower
- More realistic lighting
- Better for professional output
Decision: Phoneme-based with Rhubarb integration
Rationale:
- ✅ Industry-standard approach
- ✅ Accurate mouth shapes
- ✅ Works with any audio (speech or song)
- ✅ Fallback mock mode for testing
- ❌ External dependency (Rhubarb)
- ❌ Requires audio processing time
Alternative Rejected: Volume-based (mouth opens on loud sounds)
- Inaccurate, looks amateurish
- No correlation to actual words
- Cheaper but not production-quality
Decision: Built-in Xvfb compatibility, no GUI required
Rationale:
- ✅ Cloud deployment (AWS, GCP, containers)
- ✅ CI/CD integration possible
- ✅ Batch processing on servers
- ✅ Scalable to render farms
- ❌ Slightly complex local setup (need Xvfb)
Alternative Rejected: Require display/GUI
- Limits deployment options
- Can't run in containers
- Not automation-friendly
Example: Particle system mode
- Create builder class:
# particle_system.py
class ParticleSystemBuilder:
def __init__(self, config, prep_data):
self.config = config
self.prep_data = prep_data
def build_scene(self):
# Setup particle emitter
# Configure physics
pass
def animate(self):
# Trigger emissions on beats
# Color changes on phonemes
pass- Register in dispatcher:
# blender_script.py
def build_animation(config, prep_data):
mode = config['animation']['mode']
if mode == '2d_grease':
return GreasePencilBuilder(config, prep_data)
elif mode == '3d':
return MeshBuilder(config, prep_data)
elif mode == 'particles': # NEW
return ParticleSystemBuilder(config, prep_data)- Add config schema:
# config_particles.yaml
animation:
mode: "particles"
particle_count: 1000
emission_rate: 100Example: Melody extraction
- Extend preprocessor:
# prep_audio.py
class AudioPreprocessor:
def extract_melody(self):
# Use librosa.piptrack or CREPE
pitches = librosa.piptrack(y=self.audio, sr=self.sr)
return self._convert_to_frame_data(pitches)- Save to prep_data:
prep_data['melody'] = {
'pitches': [...],
'confidence': [...]
}- Use in animation:
# blender_script.py
def apply_melody_animation(self):
melody = self.prep_data['melody']
# Map pitch to mascot height/colorExample: Camera shake on beats
- Add to config schema:
effects:
camera_shake:
enabled: true
intensity: 0.1
frequency: 2- Implement in builder:
def add_camera_shake(self):
camera = bpy.data.objects['Camera']
for beat in self.prep_data['beats']['beat_frames']:
# Add location keyframe with noiseBased on empirical testing (30s video, 2D mode, 12fps):
| Resolution | Pixels | Render Time | File Size | Use Case |
|---|---|---|---|---|
| 180p (320x180) | 57.6K | ~3 min | 489KB | Fast testing |
| 360p (640x360) | 230.4K | ~6 min | 806KB | Quality check |
| 540p (960x540) | 518.4K | ~12 min | ~2MB | Preview |
| 1080p (1920x1080) | 2.07M | ~45 min | ~8MB | Production |
Scaling: Roughly linear with pixel count at low resolutions, sub-linear at high (due to Blender optimizations)
| FPS | Frames (30s) | Render Time | Smoothness |
|---|---|---|---|
| 12 | 360 | 1x (base) | Acceptable |
| 24 | 720 | 2x | Good |
| 30 | 900 | 2.5x | Excellent |
| 60 | 1800 | 5x | Overkill (web video) |
Recommendation: 24 FPS for production (best quality/time ratio)
| Mode | Speed | Quality | File Size | Best For |
|---|---|---|---|---|
| 2D Grease | ⚡⚡⚡ Fast | ⭐⭐ Good | Small | Testing, artistic style |
| 3D Mesh | ⚡⚡ Medium | ⭐⭐⭐ Best | Medium | Professional output |
| Hybrid | ⚡ Slow | ⭐⭐⭐ Best | Large | Maximum quality |
User Machine (Windows/Mac/Linux)
├── Blender installed locally
├── Python environment
└── Direct file system access
Pros: Full GUI access, easy debugging Cons: Manual setup per machine
Container Image
├── Ubuntu base
├── Blender 4.0+
├── Python + dependencies
├── Xvfb for headless
└── FFmpeg
Pros: Reproducible, portable Cons: No GPU acceleration (CPU only)
EC2/Compute Instance
├── Headless Blender
├── GPU-enabled instance (optional)
├── S3/Cloud Storage for outputs
└── Queue-based job system
Pros: Scalable, parallel renders Cons: Network transfer overhead, cost
Coordinator Node
├── Job queue
├── Asset distribution
└── Result aggregation
Worker Nodes (1-N)
├── Blender + Xvfb
├── Pull jobs from queue
└── Upload rendered frames
Pros: Massive parallelization Cons: Complex setup, only worth at scale
-
Startup Validation
- Check all files exist
- Validate config schema
- Verify dependencies (Blender, FFmpeg)
-
Phase Boundaries
- Validate phase 1 output before phase 2
- Check frame count before encoding
- Verify video duration matches audio
-
Graceful Degradation
- Rhubarb missing? Use mock phonemes
- Lyrics file missing? Skip lyrics
- Effects unsupported? Disable cleanly
# Structured logging at multiple levels
[OK] # Success messages
[INFO] # Informational progress
[WARN] # Non-fatal issues (fallback used)
[ERROR] # Fatal issues (abort)Output: Console for interactive, file for batch
- Audio parsing logic
- Configuration validation
- Math utilities (frame conversion, etc.)
- Full pipeline with test assets
- Multiple config variations
- Headless vs. GUI modes
- Render time benchmarks
- Memory usage profiling
- Scalability testing (long videos)
- Frame comparison (detect regressions)
- Position verification (debug mode)
- Output quality checks
-
Plugin System
- External plugins for effects
- Community-contributed animation modes
- Hot-reload during development
-
Real-time Preview
- WebSocket-based progress streaming
- Browser-based preview player
- Live scrubbing of timeline
-
Distributed Rendering
- Split frames across multiple machines
- Coordinator node for job distribution
- Result merging and encoding
-
API Service
- REST API for job submission
- Webhook callbacks on completion
- Multi-tenancy support
-
Web UI
- No-code configuration builder
- Asset upload interface
- Progress monitoring dashboard
The architecture prioritizes:
- Modularity: Each phase independent
- Extensibility: Easy to add features
- Configurability: No code changes for variations
- Production-readiness: Headless, scalable, error-handled
This design enables both rapid experimentation (swap configs) and production deployment (Docker, cloud, render farms).