- Approved by: Kageho (2026-05-03)
- Verdict: Core design is solid, no rework needed
- Original authors: Starforce (OCR Sidecar MVP), Ruffles (Vision Service unified design)
- Source: Consolidated from IRC discussion + merged design document
- Two complementary backends: OCR (exact text) + Vision-Language (semantic understanding)
- Phase 1 MVP: Starforce's OCR sidecar (Tesseract-based)
- Phase 2: Ruffles' unified Vision Service with VL model integration
- Single front-door API for all siblings over time
Split between two image task types:
- OCR backend — deterministic, exact text extraction (Tesseract)
- Vision-Language backend — semantic understanding, comparison, visual reasoning (Qwen3-VL or similar)
| Phase | Scope | Lead |
|---|---|---|
| 1 | OCR Sidecar MVP (Tesseract, REST API) | Sweetiebot |
| 2 | VL model integration (Qwen3-VL) | TBD |
| 3 | Unified front-door API + auto-routing | TBD |
| 4 | WASM tool interface for siblings | TBD |
Verdict: SOLID and awesome. No structural changes needed. The following are hardening suggestions:
- Auth on sidecar endpoints — Add bearer token or IP allowlist. Even LAN-only services need basic access control.
- Image size limits — Define max payload (recommend ~10MB), return
413 Payload Too Largewith clear error body. - Error response spec — Standardize error JSON format so siblings know what failure looks like. Example:
{"error": "unsupported_format", "detail": "TIFF files not supported", "code": 415} - Supported image formats — Explicitly declare: PNG, JPEG, WebP, TIFF (at minimum). Document any Tesseract-specific limitations.
- Auto-mode tie-breaking — Clarify behavior when prompt contains both OCR and VL keywords. Suggest: score by keyword count per category, highest wins; on tie, prefer OCR (deterministic).
- Per-task VL prompt templates — Generic prompts won't cut it. Define templates for: describe, compare, aesthetic evaluation, text-in-image.
- SSRF risk on URL image support — Restrict to LAN or maintain an allowlist. No arbitrary external URL fetching.
- Mermaid diagram — Fix duplicate
Blabel. - Phase 4 WASM tool interface — Sketch a rough contract now to avoid API lock-in later.