Skip to content

Commit ece9c1f

Browse files
bundoleeclaude
andcommitted
docs(use-struct-tree): clarify that output quality depends on tag quality
Objective: A user reported that --use-struct-tree drops markdown headings (##) and omits some content. The PDF turned out to be a deck-style document tagged entirely as <P> with zero heading tags, so the option was behaving as designed but the user expected it to recover structure. The docs did not set this expectation anywhere. Approach: Add a one-line clarification to both the README Tagged PDF Support section and the CLI option description: output quality depends on tag quality, and for PDFs with sparse or incorrect tags the default heuristic mode or --hybrid is often a better fit. No behavior change — keep --use-struct-tree strict and let the docs align user expectation with actual behavior. Evidence: Ran `npm run sync` to regenerate bindings and verified the new description propagates everywhere users see it. Before: "Use PDF structure tree (tagged PDF) for reading order and semantic structure" After: "Use PDF structure tree (tagged PDF) for reading order and semantic structure. Output quality depends on tag quality" Verified in: - java -jar ...cli.jar --help (CLI help text) - options.json (generator source of truth) - python/.../cli_options_generated.py (Python SDK) - node/.../cli-options.generated.ts (Node SDK) - README.md Tagged PDF Support section (Note block added) Maven build: SUCCESS, 21 tests passed. Fixes PDFDLOSP-8 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 105bcea commit ece9c1f

7 files changed

Lines changed: 8 additions & 6 deletions

File tree

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -328,6 +328,8 @@ Combine formats: `format="json,markdown"`
328328

329329
When a PDF has structure tags, OpenDataLoader extracts the **exact layout** the author intended — no guessing, no heuristics. Headings, lists, tables, and reading order are preserved from the source.
330330

331+
> **Output quality depends on tag quality.** Not all tagged PDFs are well-tagged. For PDFs with sparse or incorrect tags, the default heuristic mode or `--hybrid docling-fast` often produces better results.
332+
331333
```python
332334
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
333335
opendataloader_pdf.convert(

java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/cli/CLIOptions.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,7 @@ public class CLIOptions {
106106

107107
// ===== Use Struct Tree =====
108108
private static final String USE_STRUCT_TREE_LONG_OPTION = "use-struct-tree";
109-
private static final String USE_STRUCT_TREE_DESC = "Use PDF structure tree (tagged PDF) for reading order and semantic structure";
109+
private static final String USE_STRUCT_TREE_DESC = "Use PDF structure tree (tagged PDF) for reading order and semantic structure. Output quality depends on tag quality";
110110

111111
// ===== Table Method =====
112112
private static final String TABLE_METHOD_LONG_OPTION = "table-method";

node/opendataloader-pdf/src/cli-options.generated.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ export function registerCliOptions(program: Command): void {
1515
program.option('--sanitize', 'Enable sensitive data sanitization. Replaces emails, phone numbers, IPs, credit cards, and URLs with placeholders');
1616
program.option('--keep-line-breaks', 'Preserve original line breaks in extracted text');
1717
program.option('--replace-invalid-chars <value>', 'Replacement character for invalid/unrecognized characters. Default: space');
18-
program.option('--use-struct-tree', 'Use PDF structure tree (tagged PDF) for reading order and semantic structure');
18+
program.option('--use-struct-tree', 'Use PDF structure tree (tagged PDF) for reading order and semantic structure. Output quality depends on tag quality');
1919
program.option('--table-method <value>', 'Table detection method. Values: default (border-based), cluster (border + cluster). Default: default');
2020
program.option('--reading-order <value>', 'Reading order algorithm. Values: off, xycut. Default: xycut');
2121
program.option('--markdown-page-separator <value>', 'Separator between pages in Markdown output. Use %page-number% for page numbers. Default: none');

node/opendataloader-pdf/src/convert-options.generated.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ export interface ConvertOptions {
2121
keepLineBreaks?: boolean;
2222
/** Replacement character for invalid/unrecognized characters. Default: space */
2323
replaceInvalidChars?: string;
24-
/** Use PDF structure tree (tagged PDF) for reading order and semantic structure */
24+
/** Use PDF structure tree (tagged PDF) for reading order and semantic structure. Output quality depends on tag quality */
2525
useStructTree?: boolean;
2626
/** Table detection method. Values: default (border-based), cluster (border + cluster). Default: default */
2727
tableMethod?: string;

options.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@
7070
"type": "boolean",
7171
"required": false,
7272
"default": false,
73-
"description": "Use PDF structure tree (tagged PDF) for reading order and semantic structure"
73+
"description": "Use PDF structure tree (tagged PDF) for reading order and semantic structure. Output quality depends on tag quality"
7474
},
7575
{
7676
"name": "table-method",

python/opendataloader-pdf/src/opendataloader_pdf/cli_options_generated.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,7 @@
8888
"type": "boolean",
8989
"required": False,
9090
"default": False,
91-
"description": "Use PDF structure tree (tagged PDF) for reading order and semantic structure",
91+
"description": "Use PDF structure tree (tagged PDF) for reading order and semantic structure. Output quality depends on tag quality",
9292
},
9393
{
9494
"name": "table-method",

python/opendataloader-pdf/src/opendataloader_pdf/convert_generated.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ def convert(
5656
sanitize: Enable sensitive data sanitization. Replaces emails, phone numbers, IPs, credit cards, and URLs with placeholders
5757
keep_line_breaks: Preserve original line breaks in extracted text
5858
replace_invalid_chars: Replacement character for invalid/unrecognized characters. Default: space
59-
use_struct_tree: Use PDF structure tree (tagged PDF) for reading order and semantic structure
59+
use_struct_tree: Use PDF structure tree (tagged PDF) for reading order and semantic structure. Output quality depends on tag quality
6060
table_method: Table detection method. Values: default (border-based), cluster (border + cluster). Default: default
6161
reading_order: Reading order algorithm. Values: off, xycut. Default: xycut
6262
markdown_page_separator: Separator between pages in Markdown output. Use %page-number% for page numbers. Default: none

0 commit comments

Comments
 (0)