@@ -29,6 +29,17 @@ graph LR
2929- Time-bounded search filters
3030- CLI and MCP server interfaces
3131
32+ ## Multi-Modal Support
33+
34+ Librarian supports indexing and searching across multiple file types:
35+
36+ | Asset Type | File Extensions | Features |
37+ | ------------| ----------------| ----------|
38+ | ** Text** | ` .md ` , ` .txt ` | Frontmatter extraction, header-aware chunking |
39+ | ** Code** | ` .py ` , ` .js ` , ` .ts ` , ` .go ` , ` .rs ` , ` .java ` , ` .cpp ` , and more | Symbol extraction (classes, functions, methods) |
40+ | ** PDF** | ` .pdf ` | Page-based text extraction |
41+ | ** Image** | ` .png ` , ` .jpg ` , ` .jpeg ` , ` .gif ` , ` .webp ` | Metadata and EXIF extraction, optional OCR |
42+
3243## Installation
3344
3445``` bash
@@ -43,6 +54,14 @@ Or install manually:
4354uv pip install -e " .[dev]"
4455```
4556
57+ Optional multi-modal dependencies:
58+
59+ ``` bash
60+ uv pip install -e " .[pdf]" # PDF support (pypdf)
61+ uv pip install -e " .[vision]" # Image support (Pillow)
62+ uv pip install -e " .[all]" # All optional features
63+ ```
64+
4665## CLI Usage
4766
4867``` bash
@@ -126,7 +145,7 @@ librarian/
126145│ └── fts_store.py # FTS5 search
127146├── processing/
128147│ ├── embed/ # Embedding providers
129- │ ├── parsers/ # Document parsers
148+ │ ├── parsers/ # Document parsers (md, code, pdf, image)
130149│ └── transform/ # Text chunking
131150├── retrieval/
132151│ └── search.py # Hybrid search + MMR
@@ -159,89 +178,3 @@ MIT License - see [LICENSE](LICENSE) for details.
159178
160179- Email: < contact@arcade.dev >
161180- Website: [ arcade.dev] ( https://arcade.dev )
162-
163- ## Current Limitations & Roadmap
164-
165- ### Image Search Limitations
166-
167- Images are currently indexed by ** metadata only** (filename, format, dimensions, EXIF data). The system does not yet understand visual content.
168-
169- ** What works now** :
170- - Search by filename: ` search("diagram.png") `
171- - Search by format: ` search("PNG") `
172- - Filter results by asset type
173-
174- ** What doesn't work yet** :
175- - Visual content search: ` search("architecture diagram") ` won't understand what's IN the image
176- - Text within images: Can't find text that appears inside screenshots or diagrams
177- - Image-to-image similarity: Can't find visually similar images
178-
179- ### Multi-Modal Roadmap
180-
181- | Phase | Feature | Status | Impact | Effort | ETA |
182- | -------| ---------| --------| --------| --------| -----|
183- | ** 1** | ** Documentation & Config** | ** In Progress** | Set expectations | Low | v0.6.0 |
184- | | Document current limitations | Complete | Users understand metadata-only indexing | - | - |
185- | | Add configuration structure | Planned | Prepare for future embedding models | - | - |
186- | ** 2** | ** OCR for Images** | ** Planned** | Extract text FROM images | High | v0.6.0 |
187- | | Add pytesseract integration | Planned | Search text in screenshots | Low | 2-3 days |
188- | | Enable text extraction from diagrams | Planned | Find labels, annotations in images | - | - |
189- | | Search scanned documents | Planned | Index PDF images and photos | - | - |
190- | ** 3** | ** CLIP Visual Embeddings** | Planned | True visual understanding | Very High | v0.7.0 |
191- | | Add CLIP model integration | Planned | Text-to-image semantic search | Medium | 5-7 days |
192- | | Create vision vector table | Planned | Separate 512-dim embeddings | - | - |
193- | | Implement search_images tool | Planned | Find images by visual content | - | - |
194- | ** 4** | ** CodeBERT for Code** | Planned | Better code search | Medium | v0.8.0 |
195- | | Add CodeBERT embeddings | Planned | Improved semantic code search | Medium | 4-5 days |
196- | | Cross-language similarity | Planned | Find similar algorithms across languages | - | - |
197- | ** 5** | ** Cross-Modal Search** | Planned | Unified search experience | High | v1.0.0 |
198- | | Merge results across modalities | Planned | Single query finds all asset types | High | 3-4 days |
199- | | Score normalization | Planned | Fair ranking across embedding spaces | - | - |
200-
201- ### Next Steps
202-
203- ** Immediate** (v0.6.0 - This Month):
204- 1 . Add OCR support with pytesseract
205- 2 . Enable text extraction from images
206- 3 . Document installation and configuration
207- 4 . Test with screenshots and diagrams
208-
209- ** Short-term** (v0.7.0 - Next Month):
210- 1 . Evaluate OCR adoption and usage patterns
211- 2 . Decide on CLIP investment based on image search demand
212- 3 . If validated: Implement CLIP visual embeddings
213- 4 . Add text-to-image semantic search
214-
215- ** Long-term** (v0.8.0+):
216- 1 . CodeBERT for improved code search (if needed)
217- 2 . Cross-modal unified search
218- 3 . Audio transcription (Whisper)
219- 4 . Video frame extraction
220-
221- ** Decision Points** :
222- - ** After OCR** : Measure adoption before investing in CLIP
223- - ** After CLIP** : Assess if CodeBERT adds value over text embeddings
224- - ** After individual modalities** : Evaluate need for unified cross-modal search
225-
226- ### Installing Optional Features
227-
228- ``` bash
229- # OCR support (v0.6.0+)
230- # Enabled by default - requires Tesseract
231- uv pip install -e " .[ocr]"
232- brew install tesseract # macOS
233- # To disable: export ENABLE_OCR=false
234-
235- # Vision support with CLIP (v0.7.0+)
236- uv pip install -e " .[vision]"
237- export ENABLE_VISION_EMBEDDINGS=true
238-
239- # Code embeddings with CodeBERT (v0.8.0+)
240- uv pip install -e " .[code]"
241- export ENABLE_CODE_EMBEDDINGS=true
242-
243- # All features
244- uv pip install -e " .[all]"
245- ```
246-
247- Vision and code embeddings are ** opt-in** and disabled by default. OCR is ** enabled by default** (v0.6.0+).
0 commit comments