You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- All Python/CLI code snippets now use multi-file input arrays and
include the "Batch all files in one call" comment for AI agent context
- Fix "100+ pages/sec" → "20+ pages/sec" in faq.mdx and index.mdx
- Update compliance workflow from 3 steps to 4 steps (Audit → Auto-Tag
→ Export PDF/UA → Accessibility Studio) matching the actual pipeline
- Fix auto-tagging timeline: Q1 2026 → Q2 2026 across all docs
- Update upcoming-roadmap.mdx: move Equation & Figure AI to shipped,
add v2.0.0 features, add release dates column
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a PDF has structure tags, OpenDataLoader extracts the **exact layout** the author intended — no guessing, no heuristics. Headings, lists, tables, and reading order are preserved from the source.
318
321
319
322
```python
323
+
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
320
324
opendataloader_pdf.convert(
321
325
input_path=["file1.pdf", "file2.pdf", "folder/"],
322
326
output_dir="output/",
@@ -337,7 +341,8 @@ PDFs can contain hidden prompt injection attacks. OpenDataLoader automatically f
337
341
To sanitize sensitive data (emails, URLs, phone numbers → placeholders), enable it explicitly:
338
342
339
343
```bash
340
-
opendataloader-pdf input.pdf --sanitize
344
+
# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
@@ -465,6 +467,7 @@ OpenDataLoader PDF is the only open-source parser that combines: rule-based dete
465
467
OpenDataLoader detects tables using border analysis and text clustering, preserving row/column structure. For complex tables, enable hybrid mode for +90% accuracy improvement (0.49 to 0.93 TEDS score):
466
468
467
469
```python
470
+
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
468
471
opendataloader_pdf.convert(
469
472
input_path=["file1.pdf", "file2.pdf", "folder/"],
470
473
output_dir="output/",
@@ -518,6 +521,7 @@ Every element in JSON output includes a `bounding box` (`[left, bottom, right, t
518
521
```python
519
522
import opendataloader_pdf
520
523
524
+
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
The `--sanitize` flag replaces personally identifiable information with placeholders. This is **disabled by default** because it modifies visible, legitimate content.
84
85
85
86
```bash
86
-
opendataloader-pdf input.pdf --sanitize
87
+
# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
0 commit comments