Skip to content

Commit 220e954

Browse files
committed
fix(deps): pin markitdown >=0.1.5 with narrow extras (closes #64)
The bare `markitdown[all]` pulled ~67MB of unused deps (azure-*, pdfminer, pdfplumber, speechrecognition, youtube-transcript-api, pydub, xlrd, olefile) and let users land on pre-0.1.0 markitdown where pptx chart parse errors (`#N/A`, `#DIV/0!`, blanks → `ValueError` from python-pptx `CT_StrVal_NumVal_Composite.value`) propagate out and abort the whole conversion. Switch to `markitdown[docx,pptx,xlsx,xls]>=0.1.5` so we only install Office-format extras we actually use, and we always run against a version whose `_convert_chart_to_markdown` wraps the python-pptx call in `except Exception` — the offending chart degrades to `[unsupported chart]` instead of killing the file. Also add `.xls` to `SUPPORTED_EXTENSIONS` / `_SHORT_DOC_TYPES` so the `[xls]` extra has a code path that exercises it. Chart numeric data is still lost on bad cells (upstream limitation in python-pptx — no fix or open PR there). A higher-fidelity pptx path can be added behind a config flag if users need it.
1 parent 91cf6d2 commit 220e954

3 files changed

Lines changed: 17 additions & 274 deletions

File tree

openkb/cli.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,7 @@ def _setup_llm_key(kb_dir: Path | None = None) -> None:
9292

9393
# Supported document extensions for the `add` command
9494
SUPPORTED_EXTENSIONS = {
95-
".pdf", ".md", ".markdown", ".docx", ".pptx", ".xlsx",
95+
".pdf", ".md", ".markdown", ".docx", ".pptx", ".xlsx", ".xls",
9696
".html", ".htm", ".txt", ".csv",
9797
}
9898

@@ -101,7 +101,7 @@ def _setup_llm_key(kb_dir: Path | None = None) -> None:
101101
"long_pdf": "pageindex",
102102
}
103103

104-
_SHORT_DOC_TYPES = {"pdf", "docx", "md", "markdown", "html", "htm", "txt", "csv", "pptx", "xlsx"}
104+
_SHORT_DOC_TYPES = {"pdf", "docx", "md", "markdown", "html", "htm", "txt", "csv", "pptx", "xlsx", "xls"}
105105

106106

107107
def _display_type(raw_type: str) -> str:

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ classifiers = [
2727
keywords = ["ai", "rag", "retrieval", "knowledge-base", "llm", "pageindex", "agents", "document"]
2828
dependencies = [
2929
"pageindex==0.3.0.dev1",
30-
"markitdown[all]",
30+
"markitdown[docx,pptx,xlsx,xls]>=0.1.5",
3131
"trafilatura>=2.0",
3232
"click>=8.0",
3333
"watchdog>=3.0",

0 commit comments

Comments
 (0)