@@ -266,6 +266,7 @@ Settings are initialized by `openkb init`, and stored in `.openkb/config.yaml`:
266266model : gpt-5.4 # LLM model (any LiteLLM-supported provider)
267267language : en # Wiki output language
268268pageindex_threshold : 20 # PDF pages threshold for PageIndex
269+ parser : local # Document parser: local | mineru | mistral | vlm
269270` ` `
270271
271272Model names use ` provider/model` LiteLLM [format](https://docs.litellm.ai/docs/providers) (OpenAI models can omit the prefix):
@@ -276,6 +277,46 @@ Model names use `provider/model` LiteLLM [format](https://docs.litellm.ai/docs/p
276277| Anthropic | `anthropic/claude-sonnet-4-6` |
277278| Gemini | `gemini/gemini-3.1-pro-preview` |
278279
280+ # ## Document parsers
281+
282+ By default OpenKB extracts Markdown locally (pymupdf for PDFs, markitdown for
283+ Office/HTML) — no extra dependencies, unchanged behavior. For higher accuracy on
284+ complex documents you can route the file → Markdown step through an online or
285+ self-hosted parser :
286+
287+ ` ` ` yaml
288+ # .openkb/config.yaml
289+ parser: mineru # local (default) | mineru | mistral | vlm
290+ parsers:
291+ mineru:
292+ mode: cloud # cloud | self_hosted
293+ base_url: http://localhost:8000 # required when mode is self_hosted
294+ vlm:
295+ model: gemini/gemini-2.5-pro # any LiteLLM vision model (Gemini, GPT-4o, Claude, …)
296+ ` ` `
297+
298+ Install the optional dependency for your parser :
299+
300+ ` ` ` bash
301+ pip install openkb[mistral] # Mistral OCR
302+ pip install openkb[mineru] # MinerU (HTTP)
303+ pip install openkb[parsers] # all online parsers
304+ # vlm uses the existing LiteLLM dependency — no extra needed
305+ ` ` `
306+
307+ Set the API key via environment variable : ` MINERU_API_KEY` (MinerU cloud mode),
308+ ` MISTRAL_API_KEY` ; the `vlm` parser reuses the existing `LLM_API_KEY`. Override
309+ the parser for a single run with `openkb add --parser mistral file.pdf`
310+ (`local | mineru | mistral | vlm`).
311+
312+ Each parser handles a subset of formats — `mineru` covers PDF, Word, PPT, Excel,
313+ and HTML; `mistral` and `vlm` cover PDF. `.md` and any unsupported format always
314+ fall back to the local parser.
315+
316+ > **Note:** Long PDFs (≥ `pageindex_threshold` pages, default 20) continue to be
317+ > indexed with PageIndex and are **not** affected by the `parser` setting. The
318+ > parser governs the file → Markdown step for shorter documents and non-PDF files.
319+
279320# ## PageIndex Integration
280321
281322Long documents are challenging for LLMs due to context limits, context rot, and summarization loss.
0 commit comments