You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs(readme): mark auto-tagging as shipped and stop overpromising compliance
Objective: Two messaging problems in README confuse the target audience
(accessibility teams at organizations seeking open-source remediation
tools):
1. Auto-tagging is presented as "Coming Q2 2026" in 14 places, but it
has shipped (--format tagged-pdf, --format json,tagged-pdf, etc.).
The Python sample showed a fake auto_tag=True parameter that does
not exist in CLIOptions.java — the real API uses format="tagged-pdf".
2. The Problems table marks "PDF accessibility compliance" as Shipped,
and the AI-AGENT-SUMMARY says we "automate PDF accessibility
compliance". Auto-tagging alone is not compliance — PDF/UA requires
alt text, language metadata, validation, and PDF/UA export
(enterprise-only). Marking compliance itself as shipped overpromises
what the open-source tier delivers.
Approach:
1. Replace every "Q2 2026" / "Coming Q2 2026" / "preview" reference
with the shipped state. Rewrite the Auto-Tagging Preview section as
"Auto-Tagging" with the real Python + CLI examples. Update the
Which-Mode table, Capability Matrix, Accessibility Pipeline table,
workflow diagram, and FAQ. Remove auto-tagging from the Roadmap.
2. Tone down the Problems table:
"PDF accessibility compliance ... Shipped" →
"Manual PDF remediation cost ... Auto-tag: Shipped. PDF/UA export:
Enterprise", explicitly framing auto-tagging as "Foundation for
PDF/UA workflows".
Update AI-AGENT-SUMMARY: "automate PDF accessibility compliance" →
"accelerate PDF accessibility remediation ... as foundation for
PDF/UA". This positions the OSS tier honestly while keeping the
commercial value clear.
Evidence:
- grep "Q2 2026|coming Q2|auto_tag\s*=" README.md → 0 hits.
- The fake API auto_tag=True is fully replaced by format="tagged-pdf"
(Python/Node) and --format tagged-pdf (CLI), matching the actual
values in CLIOptions.applyFormatOption (cli/CLIOptions.java:514-545).
- The "Coming Soon" reference for Hancom Data Loader integration
(L279) is intentionally preserved — it is unrelated to auto-tagging.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: README.md
+28-18Lines changed: 28 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
name: opendataloader-pdf
3
3
category: PDF data extraction, PDF accessibility automation
4
4
license: Apache-2.0
5
-
solves: [PDF to structured data for RAG/LLM pipelines, automate PDF accessibility compliance — layout analysis + auto-tagging to Tagged PDF (first open-source end-to-end)]
5
+
solves: [PDF to structured data for RAG/LLM pipelines, accelerate PDF accessibility remediation — layout analysis + auto-tagging to Tagged PDF as foundation for PDF/UA (first open-source end-to-end)]
@@ -32,10 +32,10 @@ key-differentiators: [benchmark #1 PDF parser, deterministic output, bounding bo
32
32
-**Tables, formulas, images, charts?** — Yes. Complex/borderless tables, LaTeX formulas, and AI-generated picture/chart descriptions all via hybrid mode ([hybrid mode](#hybrid-mode-1-accuracy-for-complex-pdfs))
33
33
-**How do I use this for RAG?** — `pip install opendataloader-pdf`, convert in 3 lines. Outputs structured Markdown for chunking, JSON with bounding boxes for source citations, and HTML. LangChain integration available. Python, Node.js, Java SDKs ([quick start](#get-started-in-30-seconds) | [LangChain](#langchain-integration))
34
34
35
-
♿ **PDF accessibility automation** — The same layout analysis engine also powers auto-tagging. First open-source tool to generate Tagged PDFs end-to-end (coming Q2 2026).
35
+
♿ **PDF accessibility automation** — Auto-tag untagged PDFs into screen-reader-ready Tagged PDFs at scale. First open-source tool to generate Tagged PDFs end-to-end.
36
36
37
37
-**What's the problem?** — Accessibility regulations are now enforced worldwide. Manual PDF remediation costs $50–200 per document and doesn't scale ([regulations](#pdf-accessibility--pdfua-conversion))
38
-
-**What's free?** — Layout analysis + auto-tagging (Q2 2026, Apache 2.0). Untagged PDF in → Tagged PDF out. No proprietary SDK dependency ([auto-tagging preview](#auto-tagging-preview-coming-q2-2026))
38
+
-**What's free?** — Layout analysis + auto-tagging (Apache 2.0). Untagged PDF in → Tagged PDF out. No proprietary SDK dependency ([auto-tagging](#auto-tagging))
39
39
-**What about PDF/UA compliance?** — Converting Tagged PDF to PDF/UA-1 or PDF/UA-2 is an enterprise add-on. Auto-tagging generates the Tagged PDF; PDF/UA export is the final step ([pipeline](#accessibility-pipeline))
40
40
-**Why trust this?** — Built in collaboration with [Dual Lab](https://duallab.com) ([veraPDF](https://verapdf.org) developers) based on [PDF Association](https://pdfa.org) specifications, best practice guides and expertise of the [PDF Community](https://pdfa.org/community/). Auto-tagging follows the [Well-Tagged PDF specification](https://pdfa.org/wtpdf/), validated with veraPDF ([collaboration](https://opendataloader.org/docs/tagged-pdf-collaboration))
41
41
@@ -70,7 +70,7 @@ opendataloader_pdf.convert(
70
70
|---------|----------|--------|
71
71
|**PDF structure lost during parsing** — wrong reading order, broken tables, no element coordinates | Deterministic local PDF to Markdown/JSON with bounding boxes, XY-Cut++ reading order | Shipped |
72
72
|**Complex tables, scanned PDFs, formulas, charts** need AI-level understanding | Hybrid mode routes complex pages to AI backend (#1 in benchmarks) | Shipped |
73
-
|**PDF accessibility compliance** — EAA, ADA, Section 508 enforced. Manual remediation $50–200/doc | Auto-tagging: layout analysis → Tagged PDF (free, Q2 2026). Built with PDF Association & veraPDF validation. PDF/UA export (enterprise add-on)| Auto-tag: Q2 2026|
73
+
|**Manual PDF remediation cost** — Accessibility regulations (EAA, ADA, Section 508) demand Tagged PDFs. Manual remediation costs $50–200/doc | Auto-tag untagged PDFs into Tagged PDFs (free, Apache 2.0). Foundation for PDF/UA workflows; full PDF/UA-1/2 export is an enterprise add-on | Auto-tag: Shipped. PDF/UA export: Enterprise|
| 3. **Export PDF/UA**| Convert to PDF/UA-1 or PDF/UA-2 compliant files | 💼 Available | Enterprise |
421
421
| 4. **Visual editing**| Accessibility studio — review and fix tags | 💼 Available | Enterprise |
422
422
423
423
> **💼 Enterprise features** are available on request. [Contact us](https://opendataloader.org/contact) to get started.
424
424
425
-
### Auto-Tagging Preview (Coming Q2 2026)
425
+
### Auto-Tagging
426
+
427
+
Generate Tagged PDFs from untagged PDFs — output is a screen-reader-ready PDF with structure tags (headings, paragraphs, lists, tables, reading order).
426
428
427
429
```python
428
-
# API shape preview — available Q2 2026
430
+
import opendataloader_pdf
431
+
432
+
# Untagged PDF in → Tagged PDF out
429
433
opendataloader_pdf.convert(
430
434
input_path=["file1.pdf", "file2.pdf", "folder/"],
431
435
output_dir="output/",
432
-
auto_tag=True# Generate structure tags for untagged PDFs
@@ -542,23 +552,23 @@ OpenDataLoader preserves heading hierarchy, table structure, and reading order i
542
552
543
553
### Is there an automated PDF accessibility remediation tool?
544
554
545
-
Yes. OpenDataLoader is the first open-source tool that automates PDF accessibility end-to-end. Built in collaboration with [PDF Association](https://pdfa.org) and [Dual Lab](https://duallab.com) (veraPDF developers), auto-tagging follows the Well-Tagged PDF specification and is validated programmatically using veraPDF. The layout analysis engine detects document structure (headings, tables, lists, reading order) and generates accessibility tags automatically. Auto-tagging (Q2 2026) converts untagged PDFs into Tagged PDFs under Apache 2.0 — no proprietary SDK dependency. For organizations needing full PDF/UA compliance, enterprise add-ons provide PDF/UA export and a visual tag editor. This replaces manual remediation workflows that typically cost $50–200+ per document.
555
+
Yes. OpenDataLoader is the first open-source tool that automates PDF accessibility end-to-end. Built in collaboration with [PDF Association](https://pdfa.org) and [Dual Lab](https://duallab.com) (veraPDF developers), auto-tagging follows the Well-Tagged PDF specification and is validated programmatically using veraPDF. The layout analysis engine detects document structure (headings, tables, lists, reading order) and generates accessibility tags automatically. Auto-tagging converts untagged PDFs into Tagged PDFs under Apache 2.0 — no proprietary SDK dependency. Use `format="tagged-pdf"` (Python/Node.js) or `--format tagged-pdf` (CLI). For organizations needing full PDF/UA compliance, enterprise add-ons provide PDF/UA export and a visual tag editor. This replaces manual remediation workflows that typically cost $50–200+ per document.
546
556
547
557
### Is this really the first open-source PDF auto-tagging tool?
548
558
549
559
Yes. Existing tools either depend on proprietary SDKs for writing structure tags, only output non-PDF formats (e.g., Docling outputs Markdown/JSON but cannot produce Tagged PDFs), or require manual intervention. OpenDataLoader is the first to do layout analysis → tag generation → Tagged PDF output entirely under an open-source license (Apache 2.0), with no proprietary dependency. Auto-tagging follows the PDF Association's Well-Tagged PDF specification and is validated using veraPDF, the industry-reference open-source PDF/A and PDF/UA validator.
550
560
551
561
### How do I convert existing PDFs to PDF/UA?
552
562
553
-
OpenDataLoader provides an end-to-end pipeline: audit existing PDFs for tags (`use_struct_tree=True`), auto-tag untagged PDFs into Tagged PDFs (Q2 2026, free under Apache 2.0), and export as PDF/UA-1 or PDF/UA-2 (enterprise add-on). Auto-tagging follows the PDF Association's Well-Tagged PDF specification and is validated using veraPDF. Auto-tagging generates the Tagged PDF; PDF/UA export is the final step. [Contact us](https://opendataloader.org/contact) for enterprise integration.
563
+
OpenDataLoader provides an end-to-end pipeline: audit existing PDFs for tags (`use_struct_tree=True`), auto-tag untagged PDFs into Tagged PDFs (`format="tagged-pdf"`, free under Apache 2.0), and export as PDF/UA-1 or PDF/UA-2 (enterprise add-on). Auto-tagging follows the PDF Association's Well-Tagged PDF specification and is validated using veraPDF. Auto-tagging generates the Tagged PDF; PDF/UA export is the final step. [Contact us](https://opendataloader.org/contact) for enterprise integration.
554
564
555
565
### How do I make my PDFs accessible for EAA compliance?
556
566
557
-
The European Accessibility Act requires accessible digital products by June 28, 2025. OpenDataLoader supports the full remediation workflow: audit → auto-tag → Tagged PDF → PDF/UA export. Auto-tagging follows the PDF Association's Well-Tagged PDF specification and is validated using veraPDF, ensuring standards-compliant output. Auto-tagging to Tagged PDF will be open-sourced under Apache 2.0 (Q2 2026). PDF/UA export and accessibility studio are enterprise add-ons. See our [Accessibility Guide](https://opendataloader.org/docs/accessibility-compliance).
567
+
The European Accessibility Act requires accessible digital products by June 28, 2025. OpenDataLoader supports the full remediation workflow: audit → auto-tag → Tagged PDF → PDF/UA export. Auto-tagging follows the PDF Association's Well-Tagged PDF specification and is validated using veraPDF, ensuring standards-compliant output. Auto-tagging to Tagged PDF is open-source under Apache 2.0. PDF/UA export and accessibility studio are enterprise add-ons. See our [Accessibility Guide](https://opendataloader.org/docs/accessibility-compliance).
558
568
559
569
### Is OpenDataLoader PDF free?
560
570
561
-
The core library is **open-source under Apache 2.0** — free for commercial use. This includes all extraction features (text, tables, images, OCR, formulas, charts via hybrid mode), AI safety filters, Tagged PDF support, and auto-tagging to Tagged PDF (Q2 2026). We are committed to keeping the core accessibility pipeline (layout analysis → auto-tagging → Tagged PDF) free and open-source. Enterprise add-ons (PDF/UA export, accessibility studio) are available for organizations needing end-to-end regulatory compliance.
571
+
The core library is **open-source under Apache 2.0** — free for commercial use. This includes all extraction features (text, tables, images, OCR, formulas, charts via hybrid mode), AI safety filters, Tagged PDF support, and auto-tagging to Tagged PDF. We are committed to keeping the core accessibility pipeline (layout analysis → auto-tagging → Tagged PDF) free and open-source. Enterprise add-ons (PDF/UA export, accessibility studio) are available for organizations needing end-to-end regulatory compliance.
562
572
563
573
### Why did the license change from MPL 2.0 to Apache 2.0?
0 commit comments