Skip to content

Commit d770f00

Browse files
bundoleeclaude
andcommitted
docs: unify code snippets to batch style, fix speed claims, update roadmap
- All Python/CLI code snippets now use multi-file input arrays and include the "Batch all files in one call" comment for AI agent context - Fix "100+ pages/sec" → "20+ pages/sec" in faq.mdx and index.mdx - Update compliance workflow from 3 steps to 4 steps (Audit → Auto-Tag → Export PDF/UA → Accessibility Studio) matching the actual pipeline - Fix auto-tagging timeline: Q1 2026 → Q2 2026 across all docs - Update upcoming-roadmap.mdx: move Equation & Figure AI to shipped, add v2.0.0 features, add release dates column Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 85365e7 commit d770f00

16 files changed

Lines changed: 187 additions & 156 deletions

README.md

Lines changed: 19 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -137,6 +137,7 @@ pip install -U opendataloader-pdf
137137
```python
138138
import opendataloader_pdf
139139

140+
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
140141
opendataloader_pdf.convert(
141142
input_path=["file1.pdf", "file2.pdf", "folder/"],
142143
output_dir="output/",
@@ -187,12 +188,14 @@ opendataloader-pdf-hybrid --port 5002
187188
**Terminal 2** — Process PDFs:
188189

189190
```bash
191+
# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
190192
opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/
191193
```
192194

193195
**Python:**
194196

195197
```python
198+
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
196199
opendataloader_pdf.convert(
197200
input_path=["file1.pdf", "file2.pdf", "folder/"],
198201
output_dir="output/",
@@ -224,7 +227,7 @@ Extract mathematical formulas as LaTeX from scientific PDFs:
224227
# Server: enable formula enrichment
225228
opendataloader-pdf-hybrid --enrich-formula
226229

227-
# Client: must use full mode for enrichments
230+
# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
228231
opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf file2.pdf folder/
229232
```
230233

@@ -248,7 +251,7 @@ Generate AI descriptions for charts and images — useful for RAG search and acc
248251
# Server
249252
opendataloader-pdf-hybrid --enrich-picture-description
250253

251-
# Client (must use full mode)
254+
# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
252255
opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf file2.pdf folder/
253256
```
254257

@@ -317,6 +320,7 @@ Combine formats: `format="json,markdown"`
317320
When a PDF has structure tags, OpenDataLoader extracts the **exact layout** the author intended — no guessing, no heuristics. Headings, lists, tables, and reading order are preserved from the source.
318321

319322
```python
323+
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
320324
opendataloader_pdf.convert(
321325
input_path=["file1.pdf", "file2.pdf", "folder/"],
322326
output_dir="output/",
@@ -337,7 +341,8 @@ PDFs can contain hidden prompt injection attacks. OpenDataLoader automatically f
337341
To sanitize sensitive data (emails, URLs, phone numbers → placeholders), enable it explicitly:
338342

339343
```bash
340-
opendataloader-pdf input.pdf --sanitize
344+
# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
345+
opendataloader-pdf file1.pdf file2.pdf folder/ --sanitize
341346
```
342347

343348
[AI Safety Guide](https://opendataloader.org/docs/ai-safety)
@@ -363,6 +368,7 @@ documents = loader.load()
363368
### Advanced Options
364369

365370
```python
371+
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
366372
opendataloader_pdf.convert(
367373
input_path=["file1.pdf", "file2.pdf", "folder/"],
368374
output_dir="output/",
@@ -424,18 +430,14 @@ opendataloader_pdf.convert(
424430
Existing PDFs (untagged)
425431
426432
427-
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
428-
│ 1. Audit │───>│ 2. Remediate │───>│ 3. Export │
429-
│ (check tags) │ │ (auto-tag) │ │ (PDF/UA) │
430-
└─────────────────┘ └─────────────────┘ └─────────────────┘
431-
│ │ │
432-
▼ ▼ ▼
433-
use_struct_tree auto_tag PDF/UA export
434-
(Available now) (Q2 2026, Apache 2.0) (Enterprise)
435-
436-
437-
PDF/UA-1 or PDF/UA-2
438-
compliant output
433+
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
434+
│ 1. Audit │───>│ 2. Auto-Tag │───>│ 3. Export │───>│ 4. Studio │
435+
│ (check tags) │ │ (→ Tagged PDF) │ │ (PDF/UA) │ │ (visual editor) │
436+
└─────────────────┘ └─────────────────┘ └─────────────────┘ └─────────────────┘
437+
│ │ │ │
438+
▼ ▼ ▼ ▼
439+
use_struct_tree auto_tag PDF/UA export Accessibility Studio
440+
(Available now) (Q2 2026, Apache 2.0) (Enterprise) (Enterprise)
439441
```
440442

441443
[PDF Accessibility Guide](https://opendataloader.org/docs/accessibility-compliance)
@@ -465,6 +467,7 @@ OpenDataLoader PDF is the only open-source parser that combines: rule-based dete
465467
OpenDataLoader detects tables using border analysis and text clustering, preserving row/column structure. For complex tables, enable hybrid mode for +90% accuracy improvement (0.49 to 0.93 TEDS score):
466468

467469
```python
470+
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
468471
opendataloader_pdf.convert(
469472
input_path=["file1.pdf", "file2.pdf", "folder/"],
470473
output_dir="output/",
@@ -518,6 +521,7 @@ Every element in JSON output includes a `bounding box` (`[left, bottom, right, t
518521
```python
519522
import opendataloader_pdf
520523

524+
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
521525
opendataloader_pdf.convert(
522526
input_path=["file1.pdf", "file2.pdf", "folder/"],
523527
output_dir="output/",

content/docs/accessibility-compliance.mdx

Lines changed: 22 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -46,11 +46,11 @@ Use existing PDF structure tags to understand document organization:
4646
```python
4747
import opendataloader_pdf
4848

49-
# Extract using native structure tags
49+
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
5050
opendataloader_pdf.convert(
51-
input_path="document.pdf",
51+
input_path=["file1.pdf", "file2.pdf", "folder/"],
5252
output_dir="output/",
53-
use_struct_tree=True # Use native PDF structure tags
53+
use_struct_tree=True # Use native PDF structure tags
5454
)
5555
```
5656

@@ -61,45 +61,42 @@ This preserves the author's intended reading order and semantic structure.
6161
If the PDF lacks structure tags, OpenDataLoader falls back to visual heuristics (XY-Cut++ algorithm).
6262

6363
```bash
64-
opendataloader-pdf document.pdf --output-dir output/ --use-struct-tree
64+
# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
65+
opendataloader-pdf file1.pdf file2.pdf folder/ --output-dir output/ --use-struct-tree
6566
```
6667

67-
### 3. Future: Auto-Tagging Engine (Q1 2026)
68+
### 3. Future: Auto-Tagging Engine (Q2 2026)
6869

6970
Generate accessible Tagged PDFs automatically from untagged documents:
7071

7172
```python
72-
# Coming Q1 2026
73+
# API shape preview — available Q2 2026
7374
opendataloader_pdf.convert(
74-
input_path="legacy-document.pdf",
75+
input_path=["file1.pdf", "file2.pdf", "folder/"],
7576
output_dir="output/",
76-
auto_tag=True # Generate structure tags
77+
auto_tag=True # Generate structure tags
7778
)
7879
```
7980

80-
### 4. Future: PDF/UA Validation (Q2 2026)
81+
### 4. Export PDF/UA (Enterprise)
8182

82-
Validate documents against PDF/UA standards:
83+
Convert Tagged PDF to PDF/UA-1 or PDF/UA-2 compliant output. Available now as an enterprise add-on.
8384

84-
```python
85-
# Coming Q2 2026
86-
result = opendataloader_pdf.validate(
87-
input_path="document.pdf",
88-
standard="pdf-ua-2"
89-
)
90-
```
85+
### 5. Accessibility Studio (Enterprise)
86+
87+
Visual editor to review, adjust, and approve tags before export. Available now as an enterprise add-on.
9188

9289
## Compliance Workflow
9390

9491
```
95-
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
96-
│ Audit PDFs │───▶│ Remediate │───▶│ Validate
97-
│ (check tags) │ │ (auto-tag) │ │ (PDF/UA) │
98-
└─────────────────┘ └─────────────────┘ └─────────────────┘
99-
│ │ │
100-
▼ ▼ ▼
101-
use_struct_tree auto_tag validate()
102-
(Available now) (Q1 2026) (Q2 2026)
92+
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
93+
1. Audit │───▶│ 2. Auto-Tag │───▶│ 3. Export │───▶│ 4. Studio
94+
│ (check tags) │ │ (→ Tagged PDF) │ │ (PDF/UA) │ │ (visual editor)
95+
└─────────────────┘ └─────────────────┘ └─────────────────┘ └─────────────────┘
96+
│ │ │
97+
▼ ▼ ▼
98+
use_struct_tree auto_tag PDF/UA export Accessibility Studio
99+
(Available now) (Q2 2026, Apache 2.0) (Enterprise) (Enterprise)
103100
```
104101

105102
## Best Practices

content/docs/accessibility-glossary.mdx

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -207,9 +207,11 @@ A PDF that contains a structure tree with tags identifying the semantic role of
207207

208208
**In OpenDataLoader:**
209209
```python
210+
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
210211
opendataloader_pdf.convert(
211-
input_path="document.pdf",
212-
use_struct_tree=True # Use structure tags
212+
input_path=["file1.pdf", "file2.pdf", "folder/"],
213+
output_dir="output/",
214+
use_struct_tree=True # Use structure tags
213215
)
214216
```
215217

content/docs/ai-safety.mdx

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,8 @@ These filters remove content that is invisible to humans but readable by machine
7373
To disable a specific filter for trusted documents:
7474

7575
```bash
76-
opendataloader-pdf input.pdf --content-safety-off hidden-text
76+
# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
77+
opendataloader-pdf file1.pdf file2.pdf folder/ --content-safety-off hidden-text
7778
```
7879

7980
`--content-safety-off all` disables all four rendering-mismatch filters. It does not affect `--sanitize`.
@@ -83,12 +84,15 @@ opendataloader-pdf input.pdf --content-safety-off hidden-text
8384
The `--sanitize` flag replaces personally identifiable information with placeholders. This is **disabled by default** because it modifies visible, legitimate content.
8485

8586
```bash
86-
opendataloader-pdf input.pdf --sanitize
87+
# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
88+
opendataloader-pdf file1.pdf file2.pdf folder/ --sanitize
8789
```
8890

8991
```python
92+
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
9093
opendataloader_pdf.convert(
91-
input_path="input.pdf",
94+
input_path=["file1.pdf", "file2.pdf", "folder/"],
95+
output_dir="output/",
9296
sanitize=True,
9397
)
9498
```

content/docs/cli-options-reference.mdx

Lines changed: 10 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -43,35 +43,34 @@ This page documents all available CLI options for opendataloader-pdf.
4343
### Basic conversion
4444

4545
```bash
46-
opendataloader-pdf document.pdf -o ./output -f json,markdown
47-
```
48-
49-
### Convert entire folder
50-
51-
```bash
52-
opendataloader-pdf ./pdf-folder -o ./output -f json
46+
# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
47+
opendataloader-pdf file1.pdf file2.pdf folder/ -o ./output -f json,markdown
5348
```
5449

5550
### Save images as external files
5651

5752
```bash
58-
opendataloader-pdf document.pdf -f markdown --image-output external
53+
# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
54+
opendataloader-pdf file1.pdf file2.pdf folder/ -f markdown --image-output external
5955
```
6056

6157
### Disable reading order sorting
6258

6359
```bash
64-
opendataloader-pdf document.pdf -f json --reading-order off
60+
# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
61+
opendataloader-pdf file1.pdf file2.pdf folder/ -f json --reading-order off
6562
```
6663

6764
### Add page separators in output
6865

6966
```bash
70-
opendataloader-pdf document.pdf -f markdown --markdown-page-separator "--- Page %page-number% ---"
67+
# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
68+
opendataloader-pdf file1.pdf file2.pdf folder/ -f markdown --markdown-page-separator "--- Page %page-number% ---"
7169
```
7270

7371
### Encrypted PDF
7472

7573
```bash
76-
opendataloader-pdf encrypted.pdf -p mypassword -o ./output
74+
# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
75+
opendataloader-pdf encrypted1.pdf encrypted2.pdf -p mypassword -o ./output
7776
```

0 commit comments

Comments
 (0)