docs(use-struct-tree): clarify that output quality depends on tag quality

bundolee · claude · bundolee · commit ece9c1f6bf3a · 2026-05-15T14:55:42.000+09:00
Objective: A user reported that --use-struct-tree drops markdown
headings (##) and omits some content. The PDF turned out to be a
deck-style document tagged entirely as &lt;P&gt; with zero heading tags,
so the option was behaving as designed but the user expected it to
recover structure. The docs did not set this expectation anywhere.

Approach: Add a one-line clarification to both the README Tagged PDF
Support section and the CLI option description: output quality
depends on tag quality, and for PDFs with sparse or incorrect tags
the default heuristic mode or --hybrid is often a better fit. No
behavior change — keep --use-struct-tree strict and let the docs
align user expectation with actual behavior.

Evidence: Ran `npm run sync` to regenerate bindings and verified
the new description propagates everywhere users see it.

Before: "Use PDF structure tree (tagged PDF) for reading order and
semantic structure"
After: "Use PDF structure tree (tagged PDF) for reading order and
semantic structure. Output quality depends on tag quality"

Verified in:
- java -jar ...cli.jar --help  (CLI help text)
- options.json                  (generator source of truth)
- python/.../cli_options_generated.py (Python SDK)
- node/.../cli-options.generated.ts   (Node SDK)
- README.md Tagged PDF Support section (Note block added)

Maven build: SUCCESS, 21 tests passed.

Fixes PDFDLOSP-8

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/README.md b/README.md
@@ -328,6 +328,8 @@ Combine formats: `format="json,markdown"`
 
 When a PDF has structure tags, OpenDataLoader extracts the **exact layout** the author intended — no guessing, no heuristics. Headings, lists, tables, and reading order are preserved from the source.
 
+> **Output quality depends on tag quality.** Not all tagged PDFs are well-tagged. For PDFs with sparse or incorrect tags, the default heuristic mode or `--hybrid docling-fast` often produces better results.
+
 ```python
 # Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
 opendataloader_pdf.convert(
diff --git a/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/cli/CLIOptions.java b/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/cli/CLIOptions.java
@@ -106,7 +106,7 @@ public class CLIOptions {
 
     // ===== Use Struct Tree =====
     private static final String USE_STRUCT_TREE_LONG_OPTION = "use-struct-tree";
-    private static final String USE_STRUCT_TREE_DESC = "Use PDF structure tree (tagged PDF) for reading order and semantic structure";
+    private static final String USE_STRUCT_TREE_DESC = "Use PDF structure tree (tagged PDF) for reading order and semantic structure. Output quality depends on tag quality";
 
     // ===== Table Method =====
     private static final String TABLE_METHOD_LONG_OPTION = "table-method";
diff --git a/node/opendataloader-pdf/src/cli-options.generated.ts b/node/opendataloader-pdf/src/cli-options.generated.ts
@@ -15,7 +15,7 @@ export function registerCliOptions(program: Command): void {
   program.option('--sanitize', 'Enable sensitive data sanitization. Replaces emails, phone numbers, IPs, credit cards, and URLs with placeholders');
   program.option('--keep-line-breaks', 'Preserve original line breaks in extracted text');
   program.option('--replace-invalid-chars <value>', 'Replacement character for invalid/unrecognized characters. Default: space');
-  program.option('--use-struct-tree', 'Use PDF structure tree (tagged PDF) for reading order and semantic structure');
+  program.option('--use-struct-tree', 'Use PDF structure tree (tagged PDF) for reading order and semantic structure. Output quality depends on tag quality');
   program.option('--table-method <value>', 'Table detection method. Values: default (border-based), cluster (border + cluster). Default: default');
   program.option('--reading-order <value>', 'Reading order algorithm. Values: off, xycut. Default: xycut');
   program.option('--markdown-page-separator <value>', 'Separator between pages in Markdown output. Use %page-number% for page numbers. Default: none');
diff --git a/node/opendataloader-pdf/src/convert-options.generated.ts b/node/opendataloader-pdf/src/convert-options.generated.ts
@@ -21,7 +21,7 @@ export interface ConvertOptions {
   keepLineBreaks?: boolean;
   /** Replacement character for invalid/unrecognized characters. Default: space */
   replaceInvalidChars?: string;
-  /** Use PDF structure tree (tagged PDF) for reading order and semantic structure */
+  /** Use PDF structure tree (tagged PDF) for reading order and semantic structure. Output quality depends on tag quality */
   useStructTree?: boolean;
   /** Table detection method. Values: default (border-based), cluster (border + cluster). Default: default */
   tableMethod?: string;
diff --git a/options.json b/options.json
@@ -70,7 +70,7 @@
       "type": "boolean",
       "required": false,
       "default": false,
-      "description": "Use PDF structure tree (tagged PDF) for reading order and semantic structure"
+      "description": "Use PDF structure tree (tagged PDF) for reading order and semantic structure. Output quality depends on tag quality"
     },
     {
       "name": "table-method",
diff --git a/python/opendataloader-pdf/src/opendataloader_pdf/cli_options_generated.py b/python/opendataloader-pdf/src/opendataloader_pdf/cli_options_generated.py
@@ -88,7 +88,7 @@
         "type": "boolean",
         "required": False,
         "default": False,
-        "description": "Use PDF structure tree (tagged PDF) for reading order and semantic structure",
+        "description": "Use PDF structure tree (tagged PDF) for reading order and semantic structure. Output quality depends on tag quality",
     },
     {
         "name": "table-method",
diff --git a/python/opendataloader-pdf/src/opendataloader_pdf/convert_generated.py b/python/opendataloader-pdf/src/opendataloader_pdf/convert_generated.py
@@ -56,7 +56,7 @@ def convert(
         sanitize: Enable sensitive data sanitization. Replaces emails, phone numbers, IPs, credit cards, and URLs with placeholders
         keep_line_breaks: Preserve original line breaks in extracted text
         replace_invalid_chars: Replacement character for invalid/unrecognized characters. Default: space
-        use_struct_tree: Use PDF structure tree (tagged PDF) for reading order and semantic structure
+        use_struct_tree: Use PDF structure tree (tagged PDF) for reading order and semantic structure. Output quality depends on tag quality
         table_method: Table detection method. Values: default (border-based), cluster (border + cluster). Default: default
         reading_order: Reading order algorithm. Values: off, xycut. Default: xycut
         markdown_page_separator: Separator between pages in Markdown output. Use %page-number% for page numbers. Default: none

Original file line number	Diff line number	Diff line change
`@@ -70,7 +70,7 @@`
`70`	`70`	`"type": "boolean",`
`71`	`71`	`"required": false,`
`72`	`72`	`"default": false,`
`73`		`- "description": "Use PDF structure tree (tagged PDF) for reading order and semantic structure"`
	`73`	`+ "description": "Use PDF structure tree (tagged PDF) for reading order and semantic structure. Output quality depends on tag quality"`
`74`	`74`	`},`
`75`	`75`	`{`
`76`	`76`	`"name": "table-method",`
Original file line number	Diff line number	Diff line change
`@@ -88,7 +88,7 @@`
`88`	`88`	`"type": "boolean",`
`89`	`89`	`"required": False,`
`90`	`90`	`"default": False,`
`91`		`- "description": "Use PDF structure tree (tagged PDF) for reading order and semantic structure",`
	`91`	`+ "description": "Use PDF structure tree (tagged PDF) for reading order and semantic structure. Output quality depends on tag quality",`
`92`	`92`	`},`
`93`	`93`	`{`
`94`	`94`	`"name": "table-method",`