opendataloader-project
diff --git a/‎README.md‎
Lines changed: 19 additions & 15 deletions b/‎README.md‎
Lines changed: 19 additions & 15 deletions
diff --git a/‎content/docs/accessibility-compliance.mdx‎
Lines changed: 22 additions & 25 deletions b/‎content/docs/accessibility-compliance.mdx‎
Lines changed: 22 additions & 25 deletions
diff --git a/‎content/docs/accessibility-glossary.mdx‎
Lines changed: 4 additions & 2 deletions b/‎content/docs/accessibility-glossary.mdx‎
Lines changed: 4 additions & 2 deletions
diff --git a/‎content/docs/ai-safety.mdx‎
Lines changed: 7 additions & 3 deletions b/‎content/docs/ai-safety.mdx‎
Lines changed: 7 additions & 3 deletions
diff --git a/‎content/docs/cli-options-reference.mdx‎
Lines changed: 10 additions & 11 deletions b/‎content/docs/cli-options-reference.mdx‎
Lines changed: 10 additions & 11 deletions
@@ -137,6 +137,7 @@ pip install -U opendataloader-pdf
 ```python
 import opendataloader_pdf
 
+# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
 opendataloader_pdf.convert(
     input_path=["file1.pdf", "file2.pdf", "folder/"],
     output_dir="output/",
@@ -187,12 +188,14 @@ opendataloader-pdf-hybrid --port 5002
 **Terminal 2** — Process PDFs:
 
 ```bash
+# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
 opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/
 ```
 
 **Python:**
 
 ```python
+# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
 opendataloader_pdf.convert(
     input_path=["file1.pdf", "file2.pdf", "folder/"],
     output_dir="output/",
@@ -224,7 +227,7 @@ Extract mathematical formulas as LaTeX from scientific PDFs:
 # Server: enable formula enrichment
 opendataloader-pdf-hybrid --enrich-formula
 
-# Client: must use full mode for enrichments
+# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
 opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf file2.pdf folder/
 ```
 
@@ -248,7 +251,7 @@ Generate AI descriptions for charts and images — useful for RAG search and acc
 # Server
 opendataloader-pdf-hybrid --enrich-picture-description
 
-# Client (must use full mode)
+# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
 opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf file2.pdf folder/
 ```
 
@@ -317,6 +320,7 @@ Combine formats: `format="json,markdown"`
 When a PDF has structure tags, OpenDataLoader extracts the **exact layout** the author intended — no guessing, no heuristics. Headings, lists, tables, and reading order are preserved from the source.
 
 ```python
+# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
 opendataloader_pdf.convert(
     input_path=["file1.pdf", "file2.pdf", "folder/"],
     output_dir="output/",
@@ -337,7 +341,8 @@ PDFs can contain hidden prompt injection attacks. OpenDataLoader automatically f
 To sanitize sensitive data (emails, URLs, phone numbers → placeholders), enable it explicitly:
 
 ```bash
-opendataloader-pdf input.pdf --sanitize
+# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
+opendataloader-pdf file1.pdf file2.pdf folder/ --sanitize
 ```
 
 [AI Safety Guide](https://opendataloader.org/docs/ai-safety)
@@ -363,6 +368,7 @@ documents = loader.load()
 ### Advanced Options
 
 ```python
+# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
 opendataloader_pdf.convert(
     input_path=["file1.pdf", "file2.pdf", "folder/"],
     output_dir="output/",
@@ -424,18 +430,14 @@ opendataloader_pdf.convert(
 Existing PDFs (untagged)
     │
     ▼
-┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
-│  1. Audit       │───>│  2. Remediate   │───>│  3. Export       │
-│  (check tags)   │    │  (auto-tag)     │    │  (PDF/UA)        │
-└─────────────────┘    └─────────────────┘    └─────────────────┘
-        │                      │                      │
-        ▼                      ▼                      ▼
-  use_struct_tree         auto_tag              PDF/UA export
-  (Available now)    (Q2 2026, Apache 2.0)   (Enterprise)
-                                                      │
-                                                      ▼
-                                            PDF/UA-1 or PDF/UA-2
-                                            compliant output
+┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
+│  1. Audit       │───>│  2. Auto-Tag    │───>│  3. Export       │───>│  4. Studio       │
+│  (check tags)   │    │  (→ Tagged PDF) │    │  (PDF/UA)        │    │  (visual editor) │
+└─────────────────┘    └─────────────────┘    └─────────────────┘    └─────────────────┘
+        │                      │                      │                      │
+        ▼                      ▼                      ▼                      ▼
+  use_struct_tree         auto_tag              PDF/UA export       Accessibility Studio
+  (Available now)    (Q2 2026, Apache 2.0)    (Enterprise)          (Enterprise)
 ```
 
 [PDF Accessibility Guide](https://opendataloader.org/docs/accessibility-compliance)
@@ -465,6 +467,7 @@ OpenDataLoader PDF is the only open-source parser that combines: rule-based dete
 OpenDataLoader detects tables using border analysis and text clustering, preserving row/column structure. For complex tables, enable hybrid mode for +90% accuracy improvement (0.49 to 0.93 TEDS score):
 
 ```python
+# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
 opendataloader_pdf.convert(
     input_path=["file1.pdf", "file2.pdf", "folder/"],
     output_dir="output/",
@@ -518,6 +521,7 @@ Every element in JSON output includes a `bounding box` (`[left, bottom, right, t
 ```python
 import opendataloader_pdf
 
+# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
 opendataloader_pdf.convert(
     input_path=["file1.pdf", "file2.pdf", "folder/"],
     output_dir="output/",
 
@@ -46,11 +46,11 @@ Use existing PDF structure tags to understand document organization:
 ```python
 import opendataloader_pdf
 
-# Extract using native structure tags
+# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
 opendataloader_pdf.convert(
-    input_path="document.pdf",
+    input_path=["file1.pdf", "file2.pdf", "folder/"],
     output_dir="output/",
-    use_struct_tree=True  # Use native PDF structure tags
+    use_struct_tree=True                # Use native PDF structure tags
 )
 ```
 
@@ -61,45 +61,42 @@ This preserves the author's intended reading order and semantic structure.
 If the PDF lacks structure tags, OpenDataLoader falls back to visual heuristics (XY-Cut++ algorithm).
 
 ```bash
-opendataloader-pdf document.pdf --output-dir output/ --use-struct-tree
+# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
+opendataloader-pdf file1.pdf file2.pdf folder/ --output-dir output/ --use-struct-tree
 ```
 
-### 3. Future: Auto-Tagging Engine (Q1 2026)
+### 3. Future: Auto-Tagging Engine (Q2 2026)
 
 Generate accessible Tagged PDFs automatically from untagged documents:
 
 ```python
-# Coming Q1 2026
+# API shape preview — available Q2 2026
 opendataloader_pdf.convert(
-    input_path="legacy-document.pdf",
+    input_path=["file1.pdf", "file2.pdf", "folder/"],
     output_dir="output/",
-    auto_tag=True  # Generate structure tags
+    auto_tag=True                       # Generate structure tags
 )
 ```
 
-### 4. Future: PDF/UA Validation (Q2 2026)
+### 4. Export PDF/UA (Enterprise)
 
-Validate documents against PDF/UA standards:
+Convert Tagged PDF to PDF/UA-1 or PDF/UA-2 compliant output. Available now as an enterprise add-on.
 
-```python
-# Coming Q2 2026
-result = opendataloader_pdf.validate(
-    input_path="document.pdf",
-    standard="pdf-ua-2"
-)
-```
+### 5. Accessibility Studio (Enterprise)
+
+Visual editor to review, adjust, and approve tags before export. Available now as an enterprise add-on.
 
 ## Compliance Workflow
 
 ```
-┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
-│  Audit PDFs     │───▶│  Remediate      │───▶│  Validate       │
-│  (check tags)   │    │  (auto-tag)     │    │  (PDF/UA)       │
-└─────────────────┘    └─────────────────┘    └─────────────────┘
-        │                      │                      │
-        ▼                      ▼                      ▼
-  use_struct_tree         auto_tag              validate()
-  (Available now)      (Q1 2026)              (Q2 2026)
+┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
+│  1. Audit       │───▶│  2. Auto-Tag    │───▶│  3. Export       │───▶│  4. Studio       │
+│  (check tags)   │    │  (→ Tagged PDF) │    │  (PDF/UA)        │    │  (visual editor) │
+└─────────────────┘    └─────────────────┘    └─────────────────┘    └─────────────────┘
+        │                      │                      │                      │
+        ▼                      ▼                      ▼                      ▼
+  use_struct_tree         auto_tag              PDF/UA export       Accessibility Studio
+  (Available now)    (Q2 2026, Apache 2.0)    (Enterprise)          (Enterprise)
 ```
 
 ## Best Practices
 
@@ -207,9 +207,11 @@ A PDF that contains a structure tree with tags identifying the semantic role of
 
 **In OpenDataLoader:**
 ```python
+# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
 opendataloader_pdf.convert(
-    input_path="document.pdf",
-    use_struct_tree=True  # Use structure tags
+    input_path=["file1.pdf", "file2.pdf", "folder/"],
+    output_dir="output/",
+    use_struct_tree=True                # Use structure tags
 )
 ```
 
 
@@ -73,7 +73,8 @@ These filters remove content that is invisible to humans but readable by machine
 To disable a specific filter for trusted documents:
 
 ```bash
-opendataloader-pdf input.pdf --content-safety-off hidden-text
+# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
+opendataloader-pdf file1.pdf file2.pdf folder/ --content-safety-off hidden-text
 ```
 
 `--content-safety-off all` disables all four rendering-mismatch filters. It does not affect `--sanitize`.
@@ -83,12 +84,15 @@ opendataloader-pdf input.pdf --content-safety-off hidden-text
 The `--sanitize` flag replaces personally identifiable information with placeholders. This is **disabled by default** because it modifies visible, legitimate content.
 
 ```bash
-opendataloader-pdf input.pdf --sanitize
+# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
+opendataloader-pdf file1.pdf file2.pdf folder/ --sanitize
 ```
 
 ```python
+# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
 opendataloader_pdf.convert(
-    input_path="input.pdf",
+    input_path=["file1.pdf", "file2.pdf", "folder/"],
+    output_dir="output/",
     sanitize=True,
 )
 ```
 
@@ -43,35 +43,34 @@ This page documents all available CLI options for opendataloader-pdf.
 ### Basic conversion
 
 ```bash
-opendataloader-pdf document.pdf -o ./output -f json,markdown
-```
-
-### Convert entire folder
-
-```bash
-opendataloader-pdf ./pdf-folder -o ./output -f json
+# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
+opendataloader-pdf file1.pdf file2.pdf folder/ -o ./output -f json,markdown
 ```
 
 ### Save images as external files
 
 ```bash
-opendataloader-pdf document.pdf -f markdown --image-output external
+# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
+opendataloader-pdf file1.pdf file2.pdf folder/ -f markdown --image-output external
 ```
 
 ### Disable reading order sorting
 
 ```bash
-opendataloader-pdf document.pdf -f json --reading-order off
+# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
+opendataloader-pdf file1.pdf file2.pdf folder/ -f json --reading-order off
 ```
 
 ### Add page separators in output
 
 ```bash
-opendataloader-pdf document.pdf -f markdown --markdown-page-separator "--- Page %page-number% ---"
+# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
+opendataloader-pdf file1.pdf file2.pdf folder/ -f markdown --markdown-page-separator "--- Page %page-number% ---"
 ```
 
 ### Encrypted PDF
 
 ```bash
-opendataloader-pdf encrypted.pdf -p mypassword -o ./output
+# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
+opendataloader-pdf encrypted1.pdf encrypted2.pdf -p mypassword -o ./output
 ```