|
| 1 | +<!-- AI-AGENT-SUMMARY |
| 2 | +name: opendataloader-pdf-llamaindex |
| 3 | +category: LlamaIndex reader, PDF extraction for RAG |
| 4 | +license: Apache-2.0 |
| 5 | +solves: [Load PDFs as LlamaIndex Document objects for RAG pipelines, structured PDF extraction with correct reading order and table preservation] |
| 6 | +input: PDF files (digital, tagged) |
| 7 | +output: LlamaIndex Document objects (text, Markdown, JSON with bounding boxes, HTML) |
| 8 | +sdk: Python |
| 9 | +requirements: Python 3.10+, Java 11+ |
| 10 | +key-differentiators: [LlamaIndex-native BasePydanticReader, per-page Document splitting, SimpleDirectoryReader file_extractor support, all opendataloader-pdf extraction features] |
| 11 | +--> |
| 12 | + |
1 | 13 | # opendataloader-pdf-llamaindex |
2 | | -LlamaIndex reader for OpenDataLoader PDF — fast, accurate, local PDF extraction |
| 14 | + |
| 15 | +LlamaIndex reader for [OpenDataLoader PDF](https://github.com/opendataloader-project/opendataloader-pdf) — parse PDFs into structured `Document` objects for RAG pipelines. |
| 16 | + |
| 17 | +For the full feature set of the core engine (hybrid AI mode, OCR, formula extraction, benchmarks, accessibility), see the [OpenDataLoader PDF documentation](https://opendataloader.org/docs). |
| 18 | + |
| 19 | +[](https://pypi.org/project/opendataloader-pdf-llamaindex/) |
| 20 | +[](https://github.com/opendataloader-project/opendataloader-pdf-llamaindex/blob/main/LICENSE) |
| 21 | + |
| 22 | +## Features |
| 23 | + |
| 24 | +- **Accurate reading order** — XY-Cut++ algorithm handles multi-column layouts correctly |
| 25 | +- **Table extraction** — Preserves table structure in output |
| 26 | +- **Multiple formats** — Text, Markdown, JSON (with bounding boxes), HTML |
| 27 | +- **Per-page splitting** — Each page becomes a separate `Document` with page number metadata |
| 28 | +- **AI safety** — Built-in prompt injection filtering (hidden text, off-page content, invisible layers) |
| 29 | +- **100% local** — No cloud APIs, your documents never leave your machine |
| 30 | +- **Fast** — Rule-based extraction, no GPU required |
| 31 | + |
| 32 | +## Requirements |
| 33 | + |
| 34 | +- Python >= 3.10 |
| 35 | +- Java 11+ available on system `PATH` |
| 36 | + |
| 37 | +Verify Java is installed: |
| 38 | + |
| 39 | +```bash |
| 40 | +java -version |
| 41 | +``` |
| 42 | + |
| 43 | +## Installation |
| 44 | + |
| 45 | +```bash |
| 46 | +pip install -U opendataloader-pdf-llamaindex |
| 47 | +``` |
| 48 | + |
| 49 | +## Quick Start |
| 50 | + |
| 51 | +```python |
| 52 | +from llama_index.readers.opendataloader_pdf import OpenDataLoaderPDFReader |
| 53 | + |
| 54 | +reader = OpenDataLoaderPDFReader(format="text") |
| 55 | +documents = reader.load_data(file_path="document.pdf") |
| 56 | + |
| 57 | +print(documents[0].text) |
| 58 | +print(documents[0].metadata) |
| 59 | +# {'source': 'document.pdf', 'format': 'text', 'page': 1} |
| 60 | +``` |
| 61 | + |
| 62 | +## SimpleDirectoryReader Integration |
| 63 | + |
| 64 | +Use with LlamaIndex's `SimpleDirectoryReader` via the `file_extractor` parameter: |
| 65 | + |
| 66 | +```python |
| 67 | +from llama_index.core import SimpleDirectoryReader |
| 68 | +from llama_index.readers.opendataloader_pdf import OpenDataLoaderPDFReader |
| 69 | + |
| 70 | +reader = SimpleDirectoryReader( |
| 71 | + input_dir="./documents", |
| 72 | + file_extractor={".pdf": OpenDataLoaderPDFReader(format="markdown")} |
| 73 | +) |
| 74 | +documents = reader.load_data() |
| 75 | +``` |
| 76 | + |
| 77 | +## Usage Examples |
| 78 | + |
| 79 | +### Output Formats |
| 80 | + |
| 81 | +```python |
| 82 | +from llama_index.readers.opendataloader_pdf import OpenDataLoaderPDFReader |
| 83 | + |
| 84 | +# Plain text (default) — best for simple RAG |
| 85 | +reader = OpenDataLoaderPDFReader(format="text") |
| 86 | + |
| 87 | +# Markdown — preserves headings, lists, tables |
| 88 | +reader = OpenDataLoaderPDFReader(format="markdown") |
| 89 | + |
| 90 | +# JSON — structured data with bounding boxes for source citations |
| 91 | +reader = OpenDataLoaderPDFReader(format="json") |
| 92 | + |
| 93 | +# HTML — styled output |
| 94 | +reader = OpenDataLoaderPDFReader(format="html") |
| 95 | +``` |
| 96 | + |
| 97 | +### Tagged PDF Support |
| 98 | + |
| 99 | +For accessible PDFs with structure tags (common in government/legal documents): |
| 100 | + |
| 101 | +```python |
| 102 | +reader = OpenDataLoaderPDFReader(use_struct_tree=True) |
| 103 | +``` |
| 104 | + |
| 105 | +### Table Detection |
| 106 | + |
| 107 | +```python |
| 108 | +reader = OpenDataLoaderPDFReader( |
| 109 | + format="markdown", |
| 110 | + table_method="cluster" # Better for borderless tables |
| 111 | +) |
| 112 | +``` |
| 113 | + |
| 114 | +### Sensitive Data Sanitization |
| 115 | + |
| 116 | +```python |
| 117 | +reader = OpenDataLoaderPDFReader(sanitize=True) |
| 118 | +# Replaces emails, phone numbers, IPs, credit cards, URLs with placeholders |
| 119 | +``` |
| 120 | + |
| 121 | +### Page Selection |
| 122 | + |
| 123 | +```python |
| 124 | +reader = OpenDataLoaderPDFReader(pages="1,3,5-7") |
| 125 | +``` |
| 126 | + |
| 127 | +### Headers and Footers |
| 128 | + |
| 129 | +```python |
| 130 | +reader = OpenDataLoaderPDFReader(include_header_footer=True) |
| 131 | +``` |
| 132 | + |
| 133 | +### Password-Protected PDFs |
| 134 | + |
| 135 | +```python |
| 136 | +reader = OpenDataLoaderPDFReader(password="secret") |
| 137 | +``` |
| 138 | + |
| 139 | +### Image Handling |
| 140 | + |
| 141 | +```python |
| 142 | +# Embed images as Base64 in output |
| 143 | +reader = OpenDataLoaderPDFReader(image_output="embedded") |
| 144 | + |
| 145 | +# Save images to external files |
| 146 | +reader = OpenDataLoaderPDFReader( |
| 147 | + image_output="external", |
| 148 | + image_dir="./extracted_images" |
| 149 | +) |
| 150 | +``` |
| 151 | + |
| 152 | +### Hybrid AI Mode |
| 153 | + |
| 154 | +For higher accuracy on complex documents (requires a running hybrid backend): |
| 155 | + |
| 156 | +```python |
| 157 | +reader = OpenDataLoaderPDFReader( |
| 158 | + hybrid="docling-fast", |
| 159 | + hybrid_fallback=True # Fall back to Java on backend failure |
| 160 | +) |
| 161 | +``` |
| 162 | + |
| 163 | +## RAG Pipeline Example |
| 164 | + |
| 165 | +```python |
| 166 | +from llama_index.core import VectorStoreIndex, SimpleDirectoryReader |
| 167 | +from llama_index.readers.opendataloader_pdf import OpenDataLoaderPDFReader |
| 168 | + |
| 169 | +# Load PDFs |
| 170 | +reader = SimpleDirectoryReader( |
| 171 | + input_dir="./documents", |
| 172 | + file_extractor={".pdf": OpenDataLoaderPDFReader(format="markdown")} |
| 173 | +) |
| 174 | +documents = reader.load_data() |
| 175 | + |
| 176 | +# Build index and query |
| 177 | +index = VectorStoreIndex.from_documents(documents) |
| 178 | +query_engine = index.as_query_engine() |
| 179 | +response = query_engine.query("What are the key findings?") |
| 180 | +print(response) |
| 181 | +``` |
| 182 | + |
| 183 | +## Parameters |
| 184 | + |
| 185 | +| Parameter | Type | Default | Description | |
| 186 | +|-----------|------|---------|-------------| |
| 187 | +| `format` | `str` | `"text"` | Output format: `"text"`, `"markdown"`, `"json"`, `"html"` | |
| 188 | +| `split_pages` | `bool` | `True` | Split output into separate Documents per page | |
| 189 | +| `quiet` | `bool` | `False` | Suppress CLI logging output | |
| 190 | +| `content_safety_off` | `list[str]` | `None` | Safety filters to disable: `"all"`, `"hidden-text"`, `"off-page"`, `"tiny"`, `"hidden-ocg"` | |
| 191 | +| `password` | `str` | `None` | Password for encrypted PDFs | |
| 192 | +| `keep_line_breaks` | `bool` | `False` | Preserve original line breaks | |
| 193 | +| `replace_invalid_chars` | `str` | `None` | Replacement for unrecognized characters | |
| 194 | +| `use_struct_tree` | `bool` | `False` | Use PDF structure tree (tagged PDFs) | |
| 195 | +| `table_method` | `str` | `None` | `"default"` (border-based) or `"cluster"` (border + cluster) | |
| 196 | +| `reading_order` | `str` | `None` | `"off"` or `"xycut"` (default when not specified) | |
| 197 | +| `image_output` | `str` | `"off"` | `"off"`, `"embedded"` (Base64), `"external"` (files) | |
| 198 | +| `image_format` | `str` | `None` | `"png"` or `"jpeg"` | |
| 199 | +| `image_dir` | `str` | `None` | Directory for external images | |
| 200 | +| `sanitize` | `bool` | `False` | Mask emails, phones, IPs, credit cards, URLs | |
| 201 | +| `pages` | `str` | `None` | Pages to extract, e.g., `"1,3,5-7"` | |
| 202 | +| `include_header_footer` | `bool` | `False` | Include page headers and footers | |
| 203 | +| `detect_strikethrough` | `bool` | `False` | Detect strikethrough text (experimental) | |
| 204 | +| `hybrid` | `str` | `None` | Hybrid AI backend: `"docling-fast"` | |
| 205 | +| `hybrid_mode` | `str` | `None` | `"auto"` (complex pages only) or `"full"` (all pages) | |
| 206 | +| `hybrid_url` | `str` | `None` | Custom backend server URL | |
| 207 | +| `hybrid_timeout` | `str` | `None` | Backend timeout in milliseconds | |
| 208 | +| `hybrid_fallback` | `bool` | `False` | Fall back to Java on backend failure | |
| 209 | + |
| 210 | +## Document Metadata |
| 211 | + |
| 212 | +Each `Document` includes metadata: |
| 213 | + |
| 214 | +**With `split_pages=True` (default):** |
| 215 | + |
| 216 | +```python |
| 217 | +{"source": "document.pdf", "format": "text", "page": 1} |
| 218 | +``` |
| 219 | + |
| 220 | +**With `split_pages=False`:** |
| 221 | + |
| 222 | +```python |
| 223 | +{"source": "document.pdf", "format": "text"} |
| 224 | +``` |
| 225 | + |
| 226 | +**With hybrid mode:** |
| 227 | + |
| 228 | +```python |
| 229 | +{"source": "document.pdf", "format": "text", "page": 1, "hybrid": "docling-fast"} |
| 230 | +``` |
0 commit comments