Skip to content

Commit 33cee68

Browse files
committed
docs(readme): document pluggable document parsers (#77)
1 parent 2959a8d commit 33cee68

1 file changed

Lines changed: 41 additions & 0 deletions

File tree

README.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -266,6 +266,7 @@ Settings are initialized by `openkb init`, and stored in `.openkb/config.yaml`:
266266
model: gpt-5.4 # LLM model (any LiteLLM-supported provider)
267267
language: en # Wiki output language
268268
pageindex_threshold: 20 # PDF pages threshold for PageIndex
269+
parser: local # Document parser: local | mineru | mistral | vlm
269270
```
270271
271272
Model names use `provider/model` LiteLLM [format](https://docs.litellm.ai/docs/providers) (OpenAI models can omit the prefix):
@@ -276,6 +277,46 @@ Model names use `provider/model` LiteLLM [format](https://docs.litellm.ai/docs/p
276277
| Anthropic | `anthropic/claude-sonnet-4-6` |
277278
| Gemini | `gemini/gemini-3.1-pro-preview` |
278279

280+
### Document parsers
281+
282+
By default OpenKB extracts Markdown locally (pymupdf for PDFs, markitdown for
283+
Office/HTML) — no extra dependencies, unchanged behavior. For higher accuracy on
284+
complex documents you can route the file → Markdown step through an online or
285+
self-hosted parser:
286+
287+
```yaml
288+
# .openkb/config.yaml
289+
parser: mineru # local (default) | mineru | mistral | vlm
290+
parsers:
291+
mineru:
292+
mode: cloud # cloud | self_hosted
293+
base_url: http://localhost:8000 # required when mode is self_hosted
294+
vlm:
295+
model: gemini/gemini-2.5-pro # any LiteLLM vision model (Gemini, GPT-4o, Claude, …)
296+
```
297+
298+
Install the optional dependency for your parser:
299+
300+
```bash
301+
pip install openkb[mistral] # Mistral OCR
302+
pip install openkb[mineru] # MinerU (HTTP)
303+
pip install openkb[parsers] # all online parsers
304+
# vlm uses the existing LiteLLM dependency — no extra needed
305+
```
306+
307+
Set the API key via environment variable: `MINERU_API_KEY` (MinerU cloud mode),
308+
`MISTRAL_API_KEY`; the `vlm` parser reuses the existing `LLM_API_KEY`. Override
309+
the parser for a single run with `openkb add --parser mistral file.pdf`
310+
(`local | mineru | mistral | vlm`).
311+
312+
Each parser handles a subset of formats — `mineru` covers PDF, Word, PPT, Excel,
313+
and HTML; `mistral` and `vlm` cover PDF. `.md` and any unsupported format always
314+
fall back to the local parser.
315+
316+
> **Note:** Long PDFs (≥ `pageindex_threshold` pages, default 20) continue to be
317+
> indexed with PageIndex and are **not** affected by the `parser` setting. The
318+
> parser governs the file → Markdown step for shorter documents and non-PDF files.
319+
279320
### PageIndex Integration
280321

281322
Long documents are challenging for LLMs due to context limits, context rot, and summarization loss.

0 commit comments

Comments
 (0)