Skip to content

Commit cb85b3b

Browse files
authored
feat: add OpenDataLoaderPDFReader with full extraction support
OpenDataLoaderPDFReader(BasePydanticReader) with 21 SYNCED PARAMS + per-page splitting. - SimpleDirectoryReader file_extractor protocol compatible - Hybrid AI extraction mode support (docling-fast backend) - 30 unit tests + 18 integration tests (48 total, all passing) - CI/CD: test (PR, 3.10+3.13), test-full (multi-platform), release (PyPI OIDC with test gate) - README with Quick Start, SimpleDirectoryReader example, RAG pipeline, full parameter reference Code review: 20 CodeRabbit comments addressed (8 fixed, 8 rejected with rationale), 4 rounds of subagent review.
1 parent 2c91f37 commit cb85b3b

16 files changed

Lines changed: 1440 additions & 1 deletion

File tree

.github/workflows/release.yml

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
name: Release
2+
3+
on:
4+
push:
5+
tags: ["v*"]
6+
7+
permissions:
8+
id-token: write
9+
contents: read
10+
11+
jobs:
12+
test:
13+
runs-on: ubuntu-latest
14+
timeout-minutes: 10
15+
steps:
16+
- uses: actions/checkout@v4
17+
- name: Set up Python
18+
uses: actions/setup-python@v5
19+
with:
20+
python-version: "3.13"
21+
- name: Install dependencies
22+
run: |
23+
python -m pip install --upgrade pip
24+
pip install -e ".[dev]"
25+
- name: Run unit tests
26+
run: pytest tests/test_readers_opendataloader_pdf.py -v --disable-socket --allow-unix-socket
27+
28+
publish:
29+
needs: test
30+
runs-on: ubuntu-latest
31+
timeout-minutes: 10
32+
environment: release
33+
steps:
34+
- uses: actions/checkout@v4
35+
- name: Set up Python
36+
uses: actions/setup-python@v5
37+
with:
38+
python-version: "3.13"
39+
- name: Install build tools
40+
run: pip install --upgrade pip uv
41+
- name: Update version from tag
42+
run: |
43+
VERSION="${GITHUB_REF_NAME#v}"
44+
sed -i "s/version = \"0.0.0\"/version = \"$VERSION\"/" pyproject.toml
45+
grep "version = \"$VERSION\"" pyproject.toml || { echo "Version update failed"; exit 1; }
46+
echo "Publishing version: $VERSION"
47+
- name: Build
48+
run: uv build
49+
- name: Publish to PyPI
50+
uses: pypa/gh-action-pypi-publish@release/v1

.github/workflows/test-full.yml

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
name: Test (Full)
2+
3+
on:
4+
workflow_dispatch:
5+
6+
jobs:
7+
integration-test:
8+
runs-on: ${{ matrix.os }}
9+
timeout-minutes: 20
10+
strategy:
11+
fail-fast: false
12+
matrix:
13+
os: [ubuntu-latest, windows-latest, macos-latest]
14+
python-version: ["3.10", "3.13"]
15+
steps:
16+
- uses: actions/checkout@v4
17+
- name: Set up Java 21
18+
uses: actions/setup-java@v4
19+
with:
20+
distribution: temurin
21+
java-version: "21"
22+
- name: Set up Python ${{ matrix.python-version }}
23+
uses: actions/setup-python@v5
24+
with:
25+
python-version: ${{ matrix.python-version }}
26+
- name: Install dependencies
27+
run: |
28+
python -m pip install --upgrade pip
29+
pip install -e ".[dev]"
30+
- name: Run all tests
31+
run: pytest tests/ -v

.github/workflows/test.yml

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
name: Test
2+
3+
on:
4+
pull_request:
5+
branches: [main]
6+
push:
7+
branches: [main]
8+
9+
jobs:
10+
unit-test:
11+
runs-on: ubuntu-latest
12+
timeout-minutes: 10
13+
strategy:
14+
fail-fast: false
15+
matrix:
16+
python-version: ["3.10", "3.13"]
17+
steps:
18+
- uses: actions/checkout@v4
19+
- name: Set up Python ${{ matrix.python-version }}
20+
uses: actions/setup-python@v5
21+
with:
22+
python-version: ${{ matrix.python-version }}
23+
- name: Install dependencies
24+
run: |
25+
python -m pip install --upgrade pip
26+
pip install -e ".[dev]"
27+
- name: Run unit tests
28+
run: pytest tests/test_readers_opendataloader_pdf.py -v --disable-socket --allow-unix-socket
29+
30+
min-dep-test:
31+
runs-on: ubuntu-latest
32+
timeout-minutes: 10
33+
steps:
34+
- uses: actions/checkout@v4
35+
- name: Set up Python 3.10
36+
uses: actions/setup-python@v5
37+
with:
38+
python-version: "3.10"
39+
- name: Install minimum dependencies
40+
run: |
41+
python -m pip install --upgrade pip
42+
pip install "llama-index-core==0.13.0" "opendataloader-pdf==2.1.0"
43+
pip install -e ".[dev]"
44+
- name: Run unit tests
45+
run: pytest tests/test_readers_opendataloader_pdf.py -v --disable-socket --allow-unix-socket

.gitignore

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
__pycache__/
2+
*.py[cod]
3+
*$py.class
4+
*.egg-info/
5+
dist/
6+
build/
7+
*.egg
8+
.pytest_cache/
9+
.ruff_cache/
10+
.mypy_cache/
11+
*.so
12+
.venv/
13+
venv/
14+
.env
15+
.env.*
16+
*.env
17+
.claude/

CHANGELOG.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Changelog
2+
3+
## [Unreleased]
4+
5+
### Added
6+
- `OpenDataLoaderPDFReader` with 21 extraction parameters and per-page splitting
7+
- `SimpleDirectoryReader` `file_extractor` support
8+
- Hybrid AI extraction mode support (docling-fast backend)
9+
- Unit tests with mock-based testing and pytest-socket network isolation
10+
- Integration tests with real Java engine and PDF files
11+
- CI/CD workflows: test (PR), test-full (multi-platform), release (PyPI)

README.md

Lines changed: 229 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,230 @@
1+
<!-- AI-AGENT-SUMMARY
2+
name: opendataloader-pdf-llamaindex
3+
category: LlamaIndex reader, PDF extraction for RAG
4+
license: Apache-2.0
5+
solves: [Load PDFs as LlamaIndex Document objects for RAG pipelines, structured PDF extraction with correct reading order and table preservation]
6+
input: PDF files (digital, tagged)
7+
output: LlamaIndex Document objects (text, Markdown, JSON with bounding boxes, HTML)
8+
sdk: Python
9+
requirements: Python 3.10+, Java 11+
10+
key-differentiators: [LlamaIndex-native BasePydanticReader, per-page Document splitting, SimpleDirectoryReader file_extractor support, all opendataloader-pdf extraction features]
11+
-->
12+
113
# opendataloader-pdf-llamaindex
2-
LlamaIndex reader for OpenDataLoader PDF — fast, accurate, local PDF extraction
14+
15+
LlamaIndex reader for [OpenDataLoader PDF](https://github.com/opendataloader-project/opendataloader-pdf) — parse PDFs into structured `Document` objects for RAG pipelines.
16+
17+
For the full feature set of the core engine (hybrid AI mode, OCR, formula extraction, benchmarks, accessibility), see the [OpenDataLoader PDF documentation](https://opendataloader.org/docs).
18+
19+
[![PyPI version](https://img.shields.io/pypi/v/opendataloader-pdf-llamaindex.svg)](https://pypi.org/project/opendataloader-pdf-llamaindex/)
20+
[![License](https://img.shields.io/pypi/l/opendataloader-pdf-llamaindex.svg)](https://github.com/opendataloader-project/opendataloader-pdf-llamaindex/blob/main/LICENSE)
21+
22+
## Features
23+
24+
- **Accurate reading order** — XY-Cut++ algorithm handles multi-column layouts correctly
25+
- **Table extraction** — Preserves table structure in output
26+
- **Multiple formats** — Text, Markdown, JSON (with bounding boxes), HTML
27+
- **Per-page splitting** — Each page becomes a separate `Document` with page number metadata
28+
- **AI safety** — Built-in prompt injection filtering (hidden text, off-page content, invisible layers)
29+
- **100% local** — No cloud APIs, your documents never leave your machine
30+
- **Fast** — Rule-based extraction, no GPU required
31+
32+
## Requirements
33+
34+
- Python >= 3.10
35+
- Java 11+ available on system `PATH`
36+
37+
Verify Java is installed:
38+
39+
```bash
40+
java -version
41+
```
42+
43+
## Installation
44+
45+
```bash
46+
pip install -U opendataloader-pdf-llamaindex
47+
```
48+
49+
## Quick Start
50+
51+
```python
52+
from llama_index.readers.opendataloader_pdf import OpenDataLoaderPDFReader
53+
54+
reader = OpenDataLoaderPDFReader(format="text")
55+
documents = reader.load_data(file_path="document.pdf")
56+
57+
print(documents[0].text)
58+
print(documents[0].metadata)
59+
# {'source': 'document.pdf', 'format': 'text', 'page': 1}
60+
```
61+
62+
## SimpleDirectoryReader Integration
63+
64+
Use with LlamaIndex's `SimpleDirectoryReader` via the `file_extractor` parameter:
65+
66+
```python
67+
from llama_index.core import SimpleDirectoryReader
68+
from llama_index.readers.opendataloader_pdf import OpenDataLoaderPDFReader
69+
70+
reader = SimpleDirectoryReader(
71+
input_dir="./documents",
72+
file_extractor={".pdf": OpenDataLoaderPDFReader(format="markdown")}
73+
)
74+
documents = reader.load_data()
75+
```
76+
77+
## Usage Examples
78+
79+
### Output Formats
80+
81+
```python
82+
from llama_index.readers.opendataloader_pdf import OpenDataLoaderPDFReader
83+
84+
# Plain text (default) — best for simple RAG
85+
reader = OpenDataLoaderPDFReader(format="text")
86+
87+
# Markdown — preserves headings, lists, tables
88+
reader = OpenDataLoaderPDFReader(format="markdown")
89+
90+
# JSON — structured data with bounding boxes for source citations
91+
reader = OpenDataLoaderPDFReader(format="json")
92+
93+
# HTML — styled output
94+
reader = OpenDataLoaderPDFReader(format="html")
95+
```
96+
97+
### Tagged PDF Support
98+
99+
For accessible PDFs with structure tags (common in government/legal documents):
100+
101+
```python
102+
reader = OpenDataLoaderPDFReader(use_struct_tree=True)
103+
```
104+
105+
### Table Detection
106+
107+
```python
108+
reader = OpenDataLoaderPDFReader(
109+
format="markdown",
110+
table_method="cluster" # Better for borderless tables
111+
)
112+
```
113+
114+
### Sensitive Data Sanitization
115+
116+
```python
117+
reader = OpenDataLoaderPDFReader(sanitize=True)
118+
# Replaces emails, phone numbers, IPs, credit cards, URLs with placeholders
119+
```
120+
121+
### Page Selection
122+
123+
```python
124+
reader = OpenDataLoaderPDFReader(pages="1,3,5-7")
125+
```
126+
127+
### Headers and Footers
128+
129+
```python
130+
reader = OpenDataLoaderPDFReader(include_header_footer=True)
131+
```
132+
133+
### Password-Protected PDFs
134+
135+
```python
136+
reader = OpenDataLoaderPDFReader(password="secret")
137+
```
138+
139+
### Image Handling
140+
141+
```python
142+
# Embed images as Base64 in output
143+
reader = OpenDataLoaderPDFReader(image_output="embedded")
144+
145+
# Save images to external files
146+
reader = OpenDataLoaderPDFReader(
147+
image_output="external",
148+
image_dir="./extracted_images"
149+
)
150+
```
151+
152+
### Hybrid AI Mode
153+
154+
For higher accuracy on complex documents (requires a running hybrid backend):
155+
156+
```python
157+
reader = OpenDataLoaderPDFReader(
158+
hybrid="docling-fast",
159+
hybrid_fallback=True # Fall back to Java on backend failure
160+
)
161+
```
162+
163+
## RAG Pipeline Example
164+
165+
```python
166+
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
167+
from llama_index.readers.opendataloader_pdf import OpenDataLoaderPDFReader
168+
169+
# Load PDFs
170+
reader = SimpleDirectoryReader(
171+
input_dir="./documents",
172+
file_extractor={".pdf": OpenDataLoaderPDFReader(format="markdown")}
173+
)
174+
documents = reader.load_data()
175+
176+
# Build index and query
177+
index = VectorStoreIndex.from_documents(documents)
178+
query_engine = index.as_query_engine()
179+
response = query_engine.query("What are the key findings?")
180+
print(response)
181+
```
182+
183+
## Parameters
184+
185+
| Parameter | Type | Default | Description |
186+
|-----------|------|---------|-------------|
187+
| `format` | `str` | `"text"` | Output format: `"text"`, `"markdown"`, `"json"`, `"html"` |
188+
| `split_pages` | `bool` | `True` | Split output into separate Documents per page |
189+
| `quiet` | `bool` | `False` | Suppress CLI logging output |
190+
| `content_safety_off` | `list[str]` | `None` | Safety filters to disable: `"all"`, `"hidden-text"`, `"off-page"`, `"tiny"`, `"hidden-ocg"` |
191+
| `password` | `str` | `None` | Password for encrypted PDFs |
192+
| `keep_line_breaks` | `bool` | `False` | Preserve original line breaks |
193+
| `replace_invalid_chars` | `str` | `None` | Replacement for unrecognized characters |
194+
| `use_struct_tree` | `bool` | `False` | Use PDF structure tree (tagged PDFs) |
195+
| `table_method` | `str` | `None` | `"default"` (border-based) or `"cluster"` (border + cluster) |
196+
| `reading_order` | `str` | `None` | `"off"` or `"xycut"` (default when not specified) |
197+
| `image_output` | `str` | `"off"` | `"off"`, `"embedded"` (Base64), `"external"` (files) |
198+
| `image_format` | `str` | `None` | `"png"` or `"jpeg"` |
199+
| `image_dir` | `str` | `None` | Directory for external images |
200+
| `sanitize` | `bool` | `False` | Mask emails, phones, IPs, credit cards, URLs |
201+
| `pages` | `str` | `None` | Pages to extract, e.g., `"1,3,5-7"` |
202+
| `include_header_footer` | `bool` | `False` | Include page headers and footers |
203+
| `detect_strikethrough` | `bool` | `False` | Detect strikethrough text (experimental) |
204+
| `hybrid` | `str` | `None` | Hybrid AI backend: `"docling-fast"` |
205+
| `hybrid_mode` | `str` | `None` | `"auto"` (complex pages only) or `"full"` (all pages) |
206+
| `hybrid_url` | `str` | `None` | Custom backend server URL |
207+
| `hybrid_timeout` | `str` | `None` | Backend timeout in milliseconds |
208+
| `hybrid_fallback` | `bool` | `False` | Fall back to Java on backend failure |
209+
210+
## Document Metadata
211+
212+
Each `Document` includes metadata:
213+
214+
**With `split_pages=True` (default):**
215+
216+
```python
217+
{"source": "document.pdf", "format": "text", "page": 1}
218+
```
219+
220+
**With `split_pages=False`:**
221+
222+
```python
223+
{"source": "document.pdf", "format": "text"}
224+
```
225+
226+
**With hybrid mode:**
227+
228+
```python
229+
{"source": "document.pdf", "format": "text", "page": 1, "hybrid": "docling-fast"}
230+
```
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
"""OpenDataLoader PDF Reader for LlamaIndex."""
2+
3+
from llama_index.readers.opendataloader_pdf.base import OpenDataLoaderPDFReader
4+
5+
__all__ = ["OpenDataLoaderPDFReader"]

0 commit comments

Comments
 (0)