Skip to content

Commit f51769b

Browse files
feat(cli): add unstructured doctor diagnostics command (#4342)
## Summary Adds a first-class unstructured doctor command so users can verify Python extras, optional system tools, and partitioning readiness before hitting runtime import or tool errors. Closes #4341 ## What’s included - Console script: unstructured (see [project.scripts] in pyproject.toml). - Module entry: python -m unstructured → doctor (and __main__.py). - unstructured doctor: tables for environment, system tools (libmagic smoke, tesseract, pandoc, ffmpeg, LibreOffice), and partitionable file types with pip install "unstructured[extra]" hints. - unstructured doctor --for <type>: e.g. pdf, docx, image, audio; exits 1 if the requested capability is not ready, 2 on unknown type. - unstructured doctor --file <path>: infer type via detect_filetype, same exit semantics. - Tests: test_unstructured/test_cli_doctor.py. - Release notes: version 0.22.23 and CHANGELOG.md entry. ## How to verify - unstructured doctor - unstructured doctor --for pdf - unstructured doctor --file path/to/some.pdf - python -m pytest test_unstructured/test_cli_doctor.py -q <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Medium risk due to introducing a new CLI entrypoint and changing CSV parsing behavior (engine selection) plus tweaks to metrics DataFrame writes; failures would mainly affect tooling/metrics rather than core partitioning output. > > **Overview** > Introduces a first-class `unstructured` CLI (and `python -m unstructured`) with a `doctor` subcommand that reports environment details, optional system tool availability (e.g., libmagic/tesseract/pandoc/ffmpeg/LibreOffice), and per-filetype partitioning readiness; adds `--for` and `--file` modes with non-zero exit codes when capabilities are missing. > > Also tightens pandas usage to avoid chained-assignment issues in metrics reporting, adjusts CSV partitioning to use the Python engine when delimiter inference is needed (`sep=None`), and fixes tests to set environment variables as strings; adds comprehensive tests for the new doctor command and bumps version/docs to `0.22.25`. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 580e27b. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY -->
1 parent b909cf4 commit f51769b

13 files changed

Lines changed: 782 additions & 10 deletions

File tree

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,9 @@
1+
## 0.22.25
2+
3+
### Enhancements
4+
5+
- **`unstructured doctor` CLI**: Add a `unstructured` console script and `python -m unstructured` entry point with a `doctor` subcommand for dependency and capability diagnostics (environment, optional system tools such as libmagic, tesseract, pandoc, ffmpeg, and LibreOffice, and per file-type extras). Supports `doctor --for <type>` (including `image` and `audio` families) and `doctor --file <path>`; exits non-zero when the requested capability is not available.
6+
17
## 0.22.24
28

39
### Fixes

pyproject.toml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -141,6 +141,9 @@ ingest = [
141141
"unstructured-ingest[airtable,astradb,azure,azure-ai-search,bedrock,biomed,box,chroma,confluence,couchbase,databricks-volumes,delta-table,discord,dropbox,elasticsearch,gcs,github,gitlab,google-drive,hubspot,huggingface,jira,kafka,kdbai,milvus,mongodb,notion,octoai,onedrive,openai,opensearch,outlook,pinecone,postgres,qdrant,reddit,remote,s3,salesforce,sftp,sharepoint,singlestore,slack,vectara,vertexai,voyageai,weaviate,wikipedia]>=1.4.0, <2.0.0; platform_system == 'Windows' and python_version < '3.13'",
142142
]
143143

144+
[project.scripts]
145+
unstructured = "unstructured.cli:main"
146+
144147
[project.urls]
145148
Homepage = "https://github.com/Unstructured-IO/unstructured"
146149

@@ -259,6 +262,12 @@ testpaths = [
259262
"test_unstructured_ingest",
260263
]
261264

265+
[tool.coverage.run]
266+
omit = [
267+
# Entrypoint only; exercised via `unstructured` console script, not imports in tests.
268+
"unstructured/__main__.py",
269+
]
270+
262271
[tool.coverage.report]
263272
fail_under = 90
264273

test_unstructured/partition/pdf_image/test_pdf_image_utils.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -251,7 +251,10 @@ def test_save_elements(
251251

252252
@pytest.mark.parametrize("storage_enabled", [False, True])
253253
def test_save_elements_with_output_dir_path_none(monkeypatch, storage_enabled):
254-
monkeypatch.setenv("GLOBAL_WORKING_DIR_ENABLED", storage_enabled)
254+
monkeypatch.setenv(
255+
"GLOBAL_WORKING_DIR_ENABLED",
256+
"true" if storage_enabled else "false",
257+
)
255258
with (
256259
patch("PIL.Image.open"),
257260
patch("unstructured.partition.pdf_image.pdf_image_utils.write_image"),

test_unstructured/partition/test_text_type.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -234,7 +234,7 @@ def test_contains_exceeds_cap_ratio(text, expected, monkeypatch):
234234
def test_set_caps_ratio_with_environment_variable(monkeypatch):
235235
monkeypatch.setattr(text_type, "word_tokenize", mock_word_tokenize)
236236
monkeypatch.setattr(text_type, "sent_tokenize", mock_sent_tokenize)
237-
monkeypatch.setenv("UNSTRUCTURED_NARRATIVE_TEXT_CAP_THRESHOLD", 0.8)
237+
monkeypatch.setenv("UNSTRUCTURED_NARRATIVE_TEXT_CAP_THRESHOLD", str(0.8))
238238

239239
text = "All The King's Horses. And All The King's Men."
240240
with patch.object(text_type, "exceeds_cap_ratio", return_value=False) as mock_exceeds:
@@ -246,7 +246,7 @@ def test_set_caps_ratio_with_environment_variable(monkeypatch):
246246
def test_set_title_non_alpha_threshold_with_environment_variable(monkeypatch):
247247
monkeypatch.setattr(text_type, "word_tokenize", mock_word_tokenize)
248248
monkeypatch.setattr(text_type, "sent_tokenize", mock_sent_tokenize)
249-
monkeypatch.setenv("UNSTRUCTURED_TITLE_NON_ALPHA_THRESHOLD", 0.8)
249+
monkeypatch.setenv("UNSTRUCTURED_TITLE_NON_ALPHA_THRESHOLD", str(0.8))
250250

251251
text = "/--------------- All the king's horses----------------/"
252252
with patch.object(text_type, "under_non_alpha_ratio", return_value=False) as mock_exceeds:
@@ -258,7 +258,7 @@ def test_set_title_non_alpha_threshold_with_environment_variable(monkeypatch):
258258
def test_set_narrative_text_non_alpha_threshold_with_environment_variable(monkeypatch):
259259
monkeypatch.setattr(text_type, "word_tokenize", mock_word_tokenize)
260260
monkeypatch.setattr(text_type, "sent_tokenize", mock_sent_tokenize)
261-
monkeypatch.setenv("UNSTRUCTURED_NARRATIVE_TEXT_NON_ALPHA_THRESHOLD", 0.8)
261+
monkeypatch.setenv("UNSTRUCTURED_NARRATIVE_TEXT_NON_ALPHA_THRESHOLD", str(0.8))
262262

263263
text = "/--------------- All the king's horses----------------/"
264264
with patch.object(text_type, "under_non_alpha_ratio", return_value=False) as mock_exceeds:
@@ -270,7 +270,7 @@ def test_set_narrative_text_non_alpha_threshold_with_environment_variable(monkey
270270
def test_set_title_max_word_length_with_environment_variable(monkeypatch):
271271
monkeypatch.setattr(text_type, "word_tokenize", mock_word_tokenize)
272272
monkeypatch.setattr(text_type, "sent_tokenize", mock_sent_tokenize)
273-
monkeypatch.setenv("UNSTRUCTURED_TITLE_MAX_WORD_LENGTH", 5)
273+
monkeypatch.setenv("UNSTRUCTURED_TITLE_MAX_WORD_LENGTH", str(5))
274274

275275
text = "Intellectual Property in the United States"
276276
assert text_type.is_possible_narrative_text(text) is False

test_unstructured/partition/utils/test_config.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ def test_default_config():
1212

1313

1414
def test_env_override(monkeypatch):
15-
monkeypatch.setenv("IMAGE_CROP_PAD", 1)
15+
monkeypatch.setenv("IMAGE_CROP_PAD", str(1))
1616
from unstructured.partition.utils.config import env_config
1717

1818
assert env_config.IMAGE_CROP_PAD == 1

0 commit comments

Comments
 (0)