Skip to content

Commit 33b747e

Browse files
authored
Merge pull request #8 from arcadeai-labs/Spartee/fix-sources-errors
Add timeout-safe file reading and clean up dead code
2 parents f12022a + fa223e6 commit 33b747e

14 files changed

Lines changed: 516 additions & 707 deletions

File tree

README.md

Lines changed: 20 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,17 @@ graph LR
2929
- Time-bounded search filters
3030
- CLI and MCP server interfaces
3131

32+
## Multi-Modal Support
33+
34+
Librarian supports indexing and searching across multiple file types:
35+
36+
| Asset Type | File Extensions | Features |
37+
|------------|----------------|----------|
38+
| **Text** | `.md`, `.txt` | Frontmatter extraction, header-aware chunking |
39+
| **Code** | `.py`, `.js`, `.ts`, `.go`, `.rs`, `.java`, `.cpp`, and more | Symbol extraction (classes, functions, methods) |
40+
| **PDF** | `.pdf` | Page-based text extraction |
41+
| **Image** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp` | Metadata and EXIF extraction, optional OCR |
42+
3243
## Installation
3344

3445
```bash
@@ -43,6 +54,14 @@ Or install manually:
4354
uv pip install -e ".[dev]"
4455
```
4556

57+
Optional multi-modal dependencies:
58+
59+
```bash
60+
uv pip install -e ".[pdf]" # PDF support (pypdf)
61+
uv pip install -e ".[vision]" # Image support (Pillow)
62+
uv pip install -e ".[all]" # All optional features
63+
```
64+
4665
## CLI Usage
4766

4867
```bash
@@ -126,7 +145,7 @@ librarian/
126145
│ └── fts_store.py # FTS5 search
127146
├── processing/
128147
│ ├── embed/ # Embedding providers
129-
│ ├── parsers/ # Document parsers
148+
│ ├── parsers/ # Document parsers (md, code, pdf, image)
130149
│ └── transform/ # Text chunking
131150
├── retrieval/
132151
│ └── search.py # Hybrid search + MMR
@@ -159,89 +178,3 @@ MIT License - see [LICENSE](LICENSE) for details.
159178

160179
- Email: <contact@arcade.dev>
161180
- Website: [arcade.dev](https://arcade.dev)
162-
163-
## Current Limitations & Roadmap
164-
165-
### Image Search Limitations
166-
167-
Images are currently indexed by **metadata only** (filename, format, dimensions, EXIF data). The system does not yet understand visual content.
168-
169-
**What works now**:
170-
- Search by filename: `search("diagram.png")`
171-
- Search by format: `search("PNG")`
172-
- Filter results by asset type
173-
174-
**What doesn't work yet**:
175-
- Visual content search: `search("architecture diagram")` won't understand what's IN the image
176-
- Text within images: Can't find text that appears inside screenshots or diagrams
177-
- Image-to-image similarity: Can't find visually similar images
178-
179-
### Multi-Modal Roadmap
180-
181-
| Phase | Feature | Status | Impact | Effort | ETA |
182-
|-------|---------|--------|--------|--------|-----|
183-
| **1** | **Documentation & Config** | **In Progress** | Set expectations | Low | v0.6.0 |
184-
| | Document current limitations | Complete | Users understand metadata-only indexing | - | - |
185-
| | Add configuration structure | Planned | Prepare for future embedding models | - | - |
186-
| **2** | **OCR for Images** | **Planned** | Extract text FROM images | High | v0.6.0 |
187-
| | Add pytesseract integration | Planned | Search text in screenshots | Low | 2-3 days |
188-
| | Enable text extraction from diagrams | Planned | Find labels, annotations in images | - | - |
189-
| | Search scanned documents | Planned | Index PDF images and photos | - | - |
190-
| **3** | **CLIP Visual Embeddings** | Planned | True visual understanding | Very High | v0.7.0 |
191-
| | Add CLIP model integration | Planned | Text-to-image semantic search | Medium | 5-7 days |
192-
| | Create vision vector table | Planned | Separate 512-dim embeddings | - | - |
193-
| | Implement search_images tool | Planned | Find images by visual content | - | - |
194-
| **4** | **CodeBERT for Code** | Planned | Better code search | Medium | v0.8.0 |
195-
| | Add CodeBERT embeddings | Planned | Improved semantic code search | Medium | 4-5 days |
196-
| | Cross-language similarity | Planned | Find similar algorithms across languages | - | - |
197-
| **5** | **Cross-Modal Search** | Planned | Unified search experience | High | v1.0.0 |
198-
| | Merge results across modalities | Planned | Single query finds all asset types | High | 3-4 days |
199-
| | Score normalization | Planned | Fair ranking across embedding spaces | - | - |
200-
201-
### Next Steps
202-
203-
**Immediate** (v0.6.0 - This Month):
204-
1. Add OCR support with pytesseract
205-
2. Enable text extraction from images
206-
3. Document installation and configuration
207-
4. Test with screenshots and diagrams
208-
209-
**Short-term** (v0.7.0 - Next Month):
210-
1. Evaluate OCR adoption and usage patterns
211-
2. Decide on CLIP investment based on image search demand
212-
3. If validated: Implement CLIP visual embeddings
213-
4. Add text-to-image semantic search
214-
215-
**Long-term** (v0.8.0+):
216-
1. CodeBERT for improved code search (if needed)
217-
2. Cross-modal unified search
218-
3. Audio transcription (Whisper)
219-
4. Video frame extraction
220-
221-
**Decision Points**:
222-
- **After OCR**: Measure adoption before investing in CLIP
223-
- **After CLIP**: Assess if CodeBERT adds value over text embeddings
224-
- **After individual modalities**: Evaluate need for unified cross-modal search
225-
226-
### Installing Optional Features
227-
228-
```bash
229-
# OCR support (v0.6.0+)
230-
# Enabled by default - requires Tesseract
231-
uv pip install -e ".[ocr]"
232-
brew install tesseract # macOS
233-
# To disable: export ENABLE_OCR=false
234-
235-
# Vision support with CLIP (v0.7.0+)
236-
uv pip install -e ".[vision]"
237-
export ENABLE_VISION_EMBEDDINGS=true
238-
239-
# Code embeddings with CodeBERT (v0.8.0+)
240-
uv pip install -e ".[code]"
241-
export ENABLE_CODE_EMBEDDINGS=true
242-
243-
# All features
244-
uv pip install -e ".[all]"
245-
```
246-
247-
Vision and code embeddings are **opt-in** and disabled by default. OCR is **enabled by default** (v0.6.0+).

librarian/indexing.py

Lines changed: 24 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212

1313
from librarian.config import ENABLE_CODE_EMBEDDINGS, ENABLE_VISION_EMBEDDINGS
1414
from librarian.processing.embed import get_embedder, get_embedder_for_modality
15+
from librarian.processing.parsers.base import FileReadError, FileReadTimeoutError
1516
from librarian.processing.parsers.registry import get_parser_for_file
1617
from librarian.processing.transform.chunker import Chunker, ChunkingStrategy
1718
from librarian.processing.transform.code import CodeChunker, chunk_code_by_blocks
@@ -123,29 +124,32 @@ def _embed_image_chunks(
123124
fallback: list[list[float]] = embedder.embed_documents([c.content for c in chunks])
124125
return fallback
125126

126-
def index_file(self, file_path: Path, timeout: float = 5.0) -> dict[str, Any]:
127+
def index_file(self, file_path: Path) -> dict[str, Any]:
127128
"""
128129
Process and index a file (text, code, PDF, image).
129130
130131
Args:
131132
file_path: Path to the file.
132-
timeout: Max seconds to wait for file read (for network filesystems).
133133
134134
Returns:
135135
Dictionary with indexing results including path, title, chunk count, status.
136136
137137
Raises:
138-
TimeoutError: If file read times out (e.g., iCloud not synced).
138+
FileReadTimeoutError: If file read times out (e.g., iCloud not synced).
139+
FileReadError: For I/O errors (permissions, etc.).
139140
FileNotFoundError: If file doesn't exist.
140141
"""
141142
db = get_database()
142143

143144
# Get file modification time for change detection
144145
try:
145146
file_mtime = file_path.stat().st_mtime
146-
except (OSError, TimeoutError) as e:
147-
# Re-raise with context for network/cloud filesystem issues
148-
raise TimeoutError(str(file_path)) from e
147+
except TimeoutError as e:
148+
raise FileReadTimeoutError(
149+
f"Timed out accessing {file_path} (file may not be synced from cloud storage)"
150+
) from e
151+
except OSError as e:
152+
raise FileReadError(f"Cannot access {file_path}: {e}") from e
149153

150154
# Get appropriate parser from registry
151155
parser, asset_type = get_parser_for_file(file_path)
@@ -159,7 +163,7 @@ def index_file(self, file_path: Path, timeout: float = 5.0) -> dict[str, Any]:
159163
"reason": "no parser found",
160164
}
161165

162-
# Parse the document
166+
# Parse the document (parsers handle their own timeout/IO errors)
163167
parsed = parser.parse_file(file_path)
164168

165169
# Check if document exists for update vs insert
@@ -286,9 +290,21 @@ def should_reindex(self, file_path: Path) -> bool:
286290
287291
Returns:
288292
True if file should be reindexed, False if unchanged.
293+
294+
Raises:
295+
FileReadTimeoutError: If stat() times out.
296+
FileReadError: For I/O errors.
289297
"""
290298
db = get_database()
291-
current_mtime = file_path.stat().st_mtime
299+
300+
try:
301+
current_mtime = file_path.stat().st_mtime
302+
except TimeoutError as e:
303+
raise FileReadTimeoutError(
304+
f"Timed out accessing {file_path} (file may not be synced from cloud storage)"
305+
) from e
306+
except OSError as e:
307+
raise FileReadError(f"Cannot access {file_path}: {e}") from e
292308

293309
existing = db.get_document_by_path(str(file_path))
294310
if not existing:

librarian/processing/parsers/__init__.py

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,22 @@
77
from librarian.processing.parsers import MarkdownParser, ObsidianParser
88
"""
99

10-
from librarian.processing.parsers.base import BaseParser
10+
from librarian.processing.parsers.base import (
11+
BaseParser,
12+
FileReadError,
13+
FileReadTimeoutError,
14+
safe_read_bytes,
15+
safe_read_text,
16+
)
1117
from librarian.processing.parsers.md import MarkdownParser
1218
from librarian.processing.parsers.obsidian import ObsidianParser
1319

1420
__all__ = [
1521
"BaseParser",
22+
"FileReadError",
23+
"FileReadTimeoutError",
1624
"MarkdownParser",
1725
"ObsidianParser",
26+
"safe_read_bytes",
27+
"safe_read_text",
1828
]

librarian/processing/parsers/base.py

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,142 @@
55
for different document formats (Markdown, Obsidian, etc.).
66
"""
77

8+
import logging
9+
import signal
810
from abc import ABC, abstractmethod
911
from pathlib import Path
1012

1113
from librarian.types import ParsedDocument
1214

15+
logger = logging.getLogger(__name__)
16+
17+
# Default timeout for file reads (seconds). Handles network/cloud filesystems
18+
# (iCloud, Dropbox) where files may not be locally available.
19+
DEFAULT_READ_TIMEOUT = 10
20+
21+
22+
class FileReadError(OSError):
23+
"""Raised when a file cannot be read due to I/O errors."""
24+
25+
26+
class FileReadTimeoutError(FileReadError, TimeoutError):
27+
"""Raised when a file read times out (e.g., iCloud file not synced)."""
28+
29+
30+
def _timeout_handler(signum: int, frame: object) -> None:
31+
raise FileReadTimeoutError("File read timed out")
32+
33+
34+
def safe_read_text(
35+
file_path: Path,
36+
timeout: int = DEFAULT_READ_TIMEOUT,
37+
encoding: str = "utf-8",
38+
fallback_encoding: str | None = "latin-1",
39+
) -> str:
40+
"""
41+
Read a text file with timeout protection and encoding fallback.
42+
43+
Handles common issues with cloud-synced filesystems (iCloud, Dropbox)
44+
where files may not be locally available, causing read_text() to hang.
45+
46+
Args:
47+
file_path: Path to the file.
48+
timeout: Max seconds to wait for the read.
49+
encoding: Primary encoding to try.
50+
fallback_encoding: Fallback encoding if primary fails. None to skip.
51+
52+
Returns:
53+
File content as string.
54+
55+
Raises:
56+
FileNotFoundError: If file doesn't exist.
57+
FileReadTimeoutError: If read exceeds timeout.
58+
FileReadError: For other I/O errors (permissions, etc.).
59+
"""
60+
if not file_path.exists():
61+
msg = f"File not found: {file_path}"
62+
raise FileNotFoundError(msg)
63+
64+
old_handler = signal.getsignal(signal.SIGALRM)
65+
content: str | None = None
66+
try:
67+
signal.signal(signal.SIGALRM, _timeout_handler)
68+
signal.alarm(timeout)
69+
70+
try:
71+
content = file_path.read_text(encoding=encoding)
72+
except UnicodeDecodeError:
73+
if fallback_encoding:
74+
content = file_path.read_text(encoding=fallback_encoding)
75+
else:
76+
raise
77+
78+
signal.alarm(0)
79+
except FileReadTimeoutError as e:
80+
raise FileReadTimeoutError(
81+
f"Timed out reading {file_path} after {timeout}s "
82+
f"(file may not be synced from cloud storage)"
83+
) from e
84+
except FileNotFoundError:
85+
raise
86+
except PermissionError as e:
87+
raise FileReadError(f"Permission denied: {file_path}") from e
88+
except OSError as e:
89+
raise FileReadError(f"Cannot read {file_path}: {e}") from e
90+
finally:
91+
signal.alarm(0)
92+
signal.signal(signal.SIGALRM, old_handler)
93+
94+
return content
95+
96+
97+
def safe_read_bytes(
98+
file_path: Path,
99+
timeout: int = DEFAULT_READ_TIMEOUT,
100+
) -> bytes:
101+
"""
102+
Read a binary file with timeout protection.
103+
104+
Args:
105+
file_path: Path to the file.
106+
timeout: Max seconds to wait for the read.
107+
108+
Returns:
109+
File content as bytes.
110+
111+
Raises:
112+
FileNotFoundError: If file doesn't exist.
113+
FileReadTimeoutError: If read exceeds timeout.
114+
FileReadError: For other I/O errors.
115+
"""
116+
if not file_path.exists():
117+
msg = f"File not found: {file_path}"
118+
raise FileNotFoundError(msg)
119+
120+
old_handler = signal.getsignal(signal.SIGALRM)
121+
content: bytes | None = None
122+
try:
123+
signal.signal(signal.SIGALRM, _timeout_handler)
124+
signal.alarm(timeout)
125+
content = file_path.read_bytes()
126+
signal.alarm(0)
127+
except FileReadTimeoutError as e:
128+
raise FileReadTimeoutError(
129+
f"Timed out reading {file_path} after {timeout}s "
130+
f"(file may not be synced from cloud storage)"
131+
) from e
132+
except FileNotFoundError:
133+
raise
134+
except PermissionError as e:
135+
raise FileReadError(f"Permission denied: {file_path}") from e
136+
except OSError as e:
137+
raise FileReadError(f"Cannot read {file_path}: {e}") from e
138+
finally:
139+
signal.alarm(0)
140+
signal.signal(signal.SIGALRM, old_handler)
141+
142+
return content
143+
13144

14145
class BaseParser(ABC):
15146
"""
@@ -32,6 +163,8 @@ def parse_file(self, file_path: str | Path) -> ParsedDocument:
32163
33164
Raises:
34165
FileNotFoundError: If the file doesn't exist.
166+
FileReadTimeoutError: If file read times out.
167+
FileReadError: For I/O errors.
35168
ValueError: If the file cannot be parsed.
36169
"""
37170
...

0 commit comments

Comments
 (0)