ArcadeAI
diff --git a/‎README.md‎
Lines changed: 20 additions & 87 deletions b/‎README.md‎
Lines changed: 20 additions & 87 deletions
diff --git a/‎librarian/indexing.py‎
Lines changed: 24 additions & 8 deletions b/‎librarian/indexing.py‎
Lines changed: 24 additions & 8 deletions
diff --git a/‎librarian/processing/parsers/__init__.py‎
Lines changed: 11 additions & 1 deletion b/‎librarian/processing/parsers/__init__.py‎
Lines changed: 11 additions & 1 deletion
diff --git a/‎librarian/processing/parsers/base.py‎
Lines changed: 133 additions & 0 deletions b/‎librarian/processing/parsers/base.py‎
Lines changed: 133 additions & 0 deletions
@@ -29,6 +29,17 @@ graph LR
 - Time-bounded search filters
 - CLI and MCP server interfaces
 
+## Multi-Modal Support
+
+Librarian supports indexing and searching across multiple file types:
+
+| Asset Type | File Extensions | Features |
+|------------|----------------|----------|
+| **Text** | `.md`, `.txt` | Frontmatter extraction, header-aware chunking |
+| **Code** | `.py`, `.js`, `.ts`, `.go`, `.rs`, `.java`, `.cpp`, and more | Symbol extraction (classes, functions, methods) |
+| **PDF** | `.pdf` | Page-based text extraction |
+| **Image** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp` | Metadata and EXIF extraction, optional OCR |
+
 ## Installation
 
 ```bash
@@ -43,6 +54,14 @@ Or install manually:
 uv pip install -e ".[dev]"
 ```
 
+Optional multi-modal dependencies:
+
+```bash
+uv pip install -e ".[pdf]"      # PDF support (pypdf)
+uv pip install -e ".[vision]"   # Image support (Pillow)
+uv pip install -e ".[all]"      # All optional features
+```
+
 ## CLI Usage
 
 ```bash
@@ -126,7 +145,7 @@ librarian/
 │   └── fts_store.py     # FTS5 search
 ├── processing/
 │   ├── embed/       # Embedding providers
-│   ├── parsers/     # Document parsers
+│   ├── parsers/     # Document parsers (md, code, pdf, image)
 │   └── transform/   # Text chunking
 ├── retrieval/
 │   └── search.py    # Hybrid search + MMR
@@ -159,89 +178,3 @@ MIT License - see [LICENSE](LICENSE) for details.
 
 - Email: <contact@arcade.dev>
 - Website: [arcade.dev](https://arcade.dev)
-
-## Current Limitations & Roadmap
-
-### Image Search Limitations
-
-Images are currently indexed by **metadata only** (filename, format, dimensions, EXIF data). The system does not yet understand visual content.
-
-**What works now**:
-- Search by filename: `search("diagram.png")`
-- Search by format: `search("PNG")`
-- Filter results by asset type
-
-**What doesn't work yet**:
-- Visual content search: `search("architecture diagram")` won't understand what's IN the image
-- Text within images: Can't find text that appears inside screenshots or diagrams
-- Image-to-image similarity: Can't find visually similar images
-
-### Multi-Modal Roadmap
-
-| Phase | Feature | Status | Impact | Effort | ETA |
-|-------|---------|--------|--------|--------|-----|
-| **1** | **Documentation & Config** | **In Progress** | Set expectations | Low | v0.6.0 |
-| | Document current limitations | Complete | Users understand metadata-only indexing | - | - |
-| | Add configuration structure | Planned | Prepare for future embedding models | - | - |
-| **2** | **OCR for Images** | **Planned** | Extract text FROM images | High | v0.6.0 |
-| | Add pytesseract integration | Planned | Search text in screenshots | Low | 2-3 days |
-| | Enable text extraction from diagrams | Planned | Find labels, annotations in images | - | - |
-| | Search scanned documents | Planned | Index PDF images and photos | - | - |
-| **3** | **CLIP Visual Embeddings** | Planned | True visual understanding | Very High | v0.7.0 |
-| | Add CLIP model integration | Planned | Text-to-image semantic search | Medium | 5-7 days |
-| | Create vision vector table | Planned | Separate 512-dim embeddings | - | - |
-| | Implement search_images tool | Planned | Find images by visual content | - | - |
-| **4** | **CodeBERT for Code** | Planned | Better code search | Medium | v0.8.0 |
-| | Add CodeBERT embeddings | Planned | Improved semantic code search | Medium | 4-5 days |
-| | Cross-language similarity | Planned | Find similar algorithms across languages | - | - |
-| **5** | **Cross-Modal Search** | Planned | Unified search experience | High | v1.0.0 |
-| | Merge results across modalities | Planned | Single query finds all asset types | High | 3-4 days |
-| | Score normalization | Planned | Fair ranking across embedding spaces | - | - |
-
-### Next Steps
-
-**Immediate** (v0.6.0 - This Month):
-1. Add OCR support with pytesseract
-2. Enable text extraction from images
-3. Document installation and configuration
-4. Test with screenshots and diagrams
-
-**Short-term** (v0.7.0 - Next Month):
-1. Evaluate OCR adoption and usage patterns
-2. Decide on CLIP investment based on image search demand
-3. If validated: Implement CLIP visual embeddings
-4. Add text-to-image semantic search
-
-**Long-term** (v0.8.0+):
-1. CodeBERT for improved code search (if needed)
-2. Cross-modal unified search
-3. Audio transcription (Whisper)
-4. Video frame extraction
-
-**Decision Points**:
-- **After OCR**: Measure adoption before investing in CLIP
-- **After CLIP**: Assess if CodeBERT adds value over text embeddings
-- **After individual modalities**: Evaluate need for unified cross-modal search
-
-### Installing Optional Features
-
-```bash
-# OCR support (v0.6.0+)
-# Enabled by default - requires Tesseract
-uv pip install -e ".[ocr]"
-brew install tesseract  # macOS
-# To disable: export ENABLE_OCR=false
-
-# Vision support with CLIP (v0.7.0+)
-uv pip install -e ".[vision]"
-export ENABLE_VISION_EMBEDDINGS=true
-
-# Code embeddings with CodeBERT (v0.8.0+)
-uv pip install -e ".[code]"
-export ENABLE_CODE_EMBEDDINGS=true
-
-# All features
-uv pip install -e ".[all]"
-```
-
-Vision and code embeddings are **opt-in** and disabled by default. OCR is **enabled by default** (v0.6.0+).
@@ -12,6 +12,7 @@
 
 from librarian.config import ENABLE_CODE_EMBEDDINGS, ENABLE_VISION_EMBEDDINGS
 from librarian.processing.embed import get_embedder, get_embedder_for_modality
+from librarian.processing.parsers.base import FileReadError, FileReadTimeoutError
 from librarian.processing.parsers.registry import get_parser_for_file
 from librarian.processing.transform.chunker import Chunker, ChunkingStrategy
 from librarian.processing.transform.code import CodeChunker, chunk_code_by_blocks
@@ -123,29 +124,32 @@ def _embed_image_chunks(
             fallback: list[list[float]] = embedder.embed_documents([c.content for c in chunks])
             return fallback
 
-    def index_file(self, file_path: Path, timeout: float = 5.0) -> dict[str, Any]:
+    def index_file(self, file_path: Path) -> dict[str, Any]:
         """
         Process and index a file (text, code, PDF, image).
 
         Args:
             file_path: Path to the file.
-            timeout: Max seconds to wait for file read (for network filesystems).
 
         Returns:
             Dictionary with indexing results including path, title, chunk count, status.
 
         Raises:
-            TimeoutError: If file read times out (e.g., iCloud not synced).
+            FileReadTimeoutError: If file read times out (e.g., iCloud not synced).
+            FileReadError: For I/O errors (permissions, etc.).
             FileNotFoundError: If file doesn't exist.
         """
         db = get_database()
 
         # Get file modification time for change detection
         try:
             file_mtime = file_path.stat().st_mtime
-        except (OSError, TimeoutError) as e:
-            # Re-raise with context for network/cloud filesystem issues
-            raise TimeoutError(str(file_path)) from e
+        except TimeoutError as e:
+            raise FileReadTimeoutError(
+                f"Timed out accessing {file_path} (file may not be synced from cloud storage)"
+            ) from e
+        except OSError as e:
+            raise FileReadError(f"Cannot access {file_path}: {e}") from e
 
         # Get appropriate parser from registry
         parser, asset_type = get_parser_for_file(file_path)
@@ -159,7 +163,7 @@ def index_file(self, file_path: Path, timeout: float = 5.0) -> dict[str, Any]:
                 "reason": "no parser found",
             }
 
-        # Parse the document
+        # Parse the document (parsers handle their own timeout/IO errors)
         parsed = parser.parse_file(file_path)
 
         # Check if document exists for update vs insert
@@ -286,9 +290,21 @@ def should_reindex(self, file_path: Path) -> bool:
 
         Returns:
             True if file should be reindexed, False if unchanged.
+
+        Raises:
+            FileReadTimeoutError: If stat() times out.
+            FileReadError: For I/O errors.
         """
         db = get_database()
-        current_mtime = file_path.stat().st_mtime
+
+        try:
+            current_mtime = file_path.stat().st_mtime
+        except TimeoutError as e:
+            raise FileReadTimeoutError(
+                f"Timed out accessing {file_path} (file may not be synced from cloud storage)"
+            ) from e
+        except OSError as e:
+            raise FileReadError(f"Cannot access {file_path}: {e}") from e
 
         existing = db.get_document_by_path(str(file_path))
         if not existing:
 
@@ -7,12 +7,22 @@
     from librarian.processing.parsers import MarkdownParser, ObsidianParser
 """
 
-from librarian.processing.parsers.base import BaseParser
+from librarian.processing.parsers.base import (
+    BaseParser,
+    FileReadError,
+    FileReadTimeoutError,
+    safe_read_bytes,
+    safe_read_text,
+)
 from librarian.processing.parsers.md import MarkdownParser
 from librarian.processing.parsers.obsidian import ObsidianParser
 
 __all__ = [
     "BaseParser",
+    "FileReadError",
+    "FileReadTimeoutError",
     "MarkdownParser",
     "ObsidianParser",
+    "safe_read_bytes",
+    "safe_read_text",
 ]
@@ -5,11 +5,142 @@
 for different document formats (Markdown, Obsidian, etc.).
 """
 
+import logging
+import signal
 from abc import ABC, abstractmethod
 from pathlib import Path
 
 from librarian.types import ParsedDocument
 
+logger = logging.getLogger(__name__)
+
+# Default timeout for file reads (seconds). Handles network/cloud filesystems
+# (iCloud, Dropbox) where files may not be locally available.
+DEFAULT_READ_TIMEOUT = 10
+
+
+class FileReadError(OSError):
+    """Raised when a file cannot be read due to I/O errors."""
+
+
+class FileReadTimeoutError(FileReadError, TimeoutError):
+    """Raised when a file read times out (e.g., iCloud file not synced)."""
+
+
+def _timeout_handler(signum: int, frame: object) -> None:
+    raise FileReadTimeoutError("File read timed out")
+
+
+def safe_read_text(
+    file_path: Path,
+    timeout: int = DEFAULT_READ_TIMEOUT,
+    encoding: str = "utf-8",
+    fallback_encoding: str | None = "latin-1",
+) -> str:
+    """
+    Read a text file with timeout protection and encoding fallback.
+
+    Handles common issues with cloud-synced filesystems (iCloud, Dropbox)
+    where files may not be locally available, causing read_text() to hang.
+
+    Args:
+        file_path: Path to the file.
+        timeout: Max seconds to wait for the read.
+        encoding: Primary encoding to try.
+        fallback_encoding: Fallback encoding if primary fails. None to skip.
+
+    Returns:
+        File content as string.
+
+    Raises:
+        FileNotFoundError: If file doesn't exist.
+        FileReadTimeoutError: If read exceeds timeout.
+        FileReadError: For other I/O errors (permissions, etc.).
+    """
+    if not file_path.exists():
+        msg = f"File not found: {file_path}"
+        raise FileNotFoundError(msg)
+
+    old_handler = signal.getsignal(signal.SIGALRM)
+    content: str | None = None
+    try:
+        signal.signal(signal.SIGALRM, _timeout_handler)
+        signal.alarm(timeout)
+
+        try:
+            content = file_path.read_text(encoding=encoding)
+        except UnicodeDecodeError:
+            if fallback_encoding:
+                content = file_path.read_text(encoding=fallback_encoding)
+            else:
+                raise
+
+        signal.alarm(0)
+    except FileReadTimeoutError as e:
+        raise FileReadTimeoutError(
+            f"Timed out reading {file_path} after {timeout}s "
+            f"(file may not be synced from cloud storage)"
+        ) from e
+    except FileNotFoundError:
+        raise
+    except PermissionError as e:
+        raise FileReadError(f"Permission denied: {file_path}") from e
+    except OSError as e:
+        raise FileReadError(f"Cannot read {file_path}: {e}") from e
+    finally:
+        signal.alarm(0)
+        signal.signal(signal.SIGALRM, old_handler)
+
+    return content
+
+
+def safe_read_bytes(
+    file_path: Path,
+    timeout: int = DEFAULT_READ_TIMEOUT,
+) -> bytes:
+    """
+    Read a binary file with timeout protection.
+
+    Args:
+        file_path: Path to the file.
+        timeout: Max seconds to wait for the read.
+
+    Returns:
+        File content as bytes.
+
+    Raises:
+        FileNotFoundError: If file doesn't exist.
+        FileReadTimeoutError: If read exceeds timeout.
+        FileReadError: For other I/O errors.
+    """
+    if not file_path.exists():
+        msg = f"File not found: {file_path}"
+        raise FileNotFoundError(msg)
+
+    old_handler = signal.getsignal(signal.SIGALRM)
+    content: bytes | None = None
+    try:
+        signal.signal(signal.SIGALRM, _timeout_handler)
+        signal.alarm(timeout)
+        content = file_path.read_bytes()
+        signal.alarm(0)
+    except FileReadTimeoutError as e:
+        raise FileReadTimeoutError(
+            f"Timed out reading {file_path} after {timeout}s "
+            f"(file may not be synced from cloud storage)"
+        ) from e
+    except FileNotFoundError:
+        raise
+    except PermissionError as e:
+        raise FileReadError(f"Permission denied: {file_path}") from e
+    except OSError as e:
+        raise FileReadError(f"Cannot read {file_path}: {e}") from e
+    finally:
+        signal.alarm(0)
+        signal.signal(signal.SIGALRM, old_handler)
+
+    return content
+
 
 class BaseParser(ABC):
     """
@@ -32,6 +163,8 @@ def parse_file(self, file_path: str | Path) -> ParsedDocument:
 
         Raises:
             FileNotFoundError: If the file doesn't exist.
+            FileReadTimeoutError: If file read times out.
+            FileReadError: For I/O errors.
             ValueError: If the file cannot be parsed.
         """
         ...