Skip to content

Commit 433f158

Browse files
committed
[Feat] Add PaddleOCRLoader for local OCR LangChain integration
Add PaddleOCRLoader that wraps the local PaddleOCR library (PP-OCRv5 and PP-StructureV3) to produce LangChain Document objects without requiring any cloud API. New files: - langchain_paddleocr/document_loaders/paddleocr.py: PaddleOCRLoader, PaddleOCRConfig dataclass, custom exception hierarchy - tests/unit_tests/document_loaders/test_paddleocr_loader.py: 29 unit tests - tests/integration_tests/document_loaders/test_paddleocr_loader.py: Integration tests Modified files: - langchain_paddleocr/__init__.py: Add PaddleOCRLoader export (lazy import for PaddleOCRVLLoader) - langchain_paddleocr/document_loaders/__init__.py: Same - README.md / README_cn.md: Add PaddleOCRLoader usage docs
1 parent f0b39d4 commit 433f158

7 files changed

Lines changed: 1023 additions & 4 deletions

File tree

langchain-paddleocr/README.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,63 @@ for doc in docs[:2]:
4141
print("---")
4242
```
4343

44+
### `PaddleOCRLoader`
45+
46+
The `PaddleOCRLoader` wraps the **local** PaddleOCR library to extract text from PDF and image files — no cloud API or access token required.
47+
48+
It supports two modes:
49+
50+
- **Basic OCR** (default) — fast text extraction using PP-OCRv5.
51+
- **Structure mode** — layout-aware extraction (tables, titles, figures) using PP-StructureV3.
52+
53+
#### Basic OCR
54+
55+
```python
56+
from langchain_paddleocr import PaddleOCRLoader
57+
58+
loader = PaddleOCRLoader(file_path="path/to/document.pdf")
59+
docs = loader.load()
60+
61+
for doc in docs:
62+
print(f"Page {doc.metadata['page']}: {doc.page_content[:100]}...")
63+
print(f"Confidence: {doc.metadata['confidence']:.2f}")
64+
```
65+
66+
#### Structure mode
67+
68+
```python
69+
from langchain_paddleocr import PaddleOCRLoader
70+
from langchain_paddleocr.document_loaders.paddleocr import PaddleOCRConfig
71+
72+
config = PaddleOCRConfig(lang="en", use_table_recognition=True)
73+
loader = PaddleOCRLoader(
74+
file_path=["page1.png", "page2.png"],
75+
use_structure=True,
76+
config=config,
77+
)
78+
79+
for doc in loader.lazy_load():
80+
print(doc.page_content)
81+
print(doc.metadata["layout_blocks"])
82+
```
83+
84+
#### Configuration
85+
86+
Use `PaddleOCRConfig` to pass engine parameters:
87+
88+
| Parameter | Type | Description |
89+
|-----------|------|-------------|
90+
| `lang` | `str` | Language code (`"ch"`, `"en"`, `"fr"`, etc.) |
91+
| `ocr_version` | `str` | Pipeline version (`"PP-OCRv3"`, `"PP-OCRv4"`, `"PP-OCRv5"`) |
92+
| `use_doc_orientation_classify` | `bool` | Enable document orientation classification |
93+
| `use_doc_unwarping` | `bool` | Enable document de-warping |
94+
| `text_det_thresh` | `float` | Detection confidence threshold |
95+
| `text_rec_score_thresh` | `float` | Recognition confidence threshold |
96+
| `use_table_recognition` | `bool` | Enable table recognition (structure mode) |
97+
| `use_chart_recognition` | `bool` | Enable chart recognition (structure mode) |
98+
99+
See the full list in `PaddleOCRConfig`.
100+
44101
## 📖 Documentation
45102

46103
For full documentation, see the [API reference](https://reference.langchain.com/python/integrations/langchain_paddleocr/). For conceptual guides, tutorials, and usage examples, see the [LangChain Docs](https://docs.langchain.com/oss/python/integrations/providers/paddleocr).

langchain-paddleocr/README_cn.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,63 @@ for doc in docs[:2]:
4242
```
4343

4444

45+
### `PaddleOCRLoader`
46+
47+
`PaddleOCRLoader` 封装了 **本地** PaddleOCR 库,从 PDF 和图像文件中提取文本 — 无需云 API 或访问令牌。
48+
49+
支持两种模式:
50+
51+
- **基础 OCR**(默认)— 使用 PP-OCRv5 进行快速文本提取。
52+
- **版面分析模式** — 使用 PP-StructureV3 进行版面感知提取(表格、标题、图片等)。
53+
54+
#### 基础 OCR
55+
56+
```python
57+
from langchain_paddleocr import PaddleOCRLoader
58+
59+
loader = PaddleOCRLoader(file_path="path/to/document.pdf")
60+
docs = loader.load()
61+
62+
for doc in docs:
63+
print(f"页面 {doc.metadata['page']}: {doc.page_content[:100]}...")
64+
print(f"置信度: {doc.metadata['confidence']:.2f}")
65+
```
66+
67+
#### 版面分析模式
68+
69+
```python
70+
from langchain_paddleocr import PaddleOCRLoader
71+
from langchain_paddleocr.document_loaders.paddleocr import PaddleOCRConfig
72+
73+
config = PaddleOCRConfig(lang="ch", use_table_recognition=True)
74+
loader = PaddleOCRLoader(
75+
file_path=["page1.png", "page2.png"],
76+
use_structure=True,
77+
config=config,
78+
)
79+
80+
for doc in loader.lazy_load():
81+
print(doc.page_content)
82+
print(doc.metadata["layout_blocks"])
83+
```
84+
85+
#### 配置
86+
87+
使用 `PaddleOCRConfig` 传递引擎参数:
88+
89+
| 参数 | 类型 | 说明 |
90+
|------|------|------|
91+
| `lang` | `str` | 语言代码(`"ch"``"en"``"fr"` 等) |
92+
| `ocr_version` | `str` | 流水线版本(`"PP-OCRv3"``"PP-OCRv4"``"PP-OCRv5"`|
93+
| `use_doc_orientation_classify` | `bool` | 启用文档方向分类 |
94+
| `use_doc_unwarping` | `bool` | 启用文档去弯曲 |
95+
| `text_det_thresh` | `float` | 检测置信度阈值 |
96+
| `text_rec_score_thresh` | `float` | 识别置信度阈值 |
97+
| `use_table_recognition` | `bool` | 启用表格识别(版面分析模式) |
98+
| `use_chart_recognition` | `bool` | 启用图表识别(版面分析模式) |
99+
100+
完整参数请参阅 `PaddleOCRConfig`
101+
45102
## 📖 文档
46103

47104
完整文档请参阅 [API 参考](https://reference.langchain.com/python/integrations/langchain_paddleocr/)。有关概念指南、教程和使用示例,请参阅 [LangChain 文档](https://docs.langchain.com/oss/python/integrations/providers/paddleocr)
Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,12 @@
1-
from .document_loaders import PaddleOCRVLLoader
1+
from .document_loaders import PaddleOCRLoader
22

3-
__all__ = ["PaddleOCRVLLoader"]
3+
__all__ = ["PaddleOCRLoader", "PaddleOCRVLLoader"]
4+
5+
6+
def __getattr__(name: str) -> object:
7+
if name == "PaddleOCRVLLoader":
8+
from .document_loaders import PaddleOCRVLLoader
9+
10+
return PaddleOCRVLLoader
11+
msg = f"module {__name__!r} has no attribute {name!r}"
12+
raise AttributeError(msg)
Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,12 @@
1-
from .paddleocr_vl import PaddleOCRVLLoader
1+
from .paddleocr import PaddleOCRLoader
22

3-
__all__ = ["PaddleOCRVLLoader"]
3+
__all__ = ["PaddleOCRLoader", "PaddleOCRVLLoader"]
4+
5+
6+
def __getattr__(name: str) -> object:
7+
if name == "PaddleOCRVLLoader":
8+
from .paddleocr_vl import PaddleOCRVLLoader
9+
10+
return PaddleOCRVLLoader
11+
msg = f"module {__name__!r} has no attribute {name!r}"
12+
raise AttributeError(msg)

0 commit comments

Comments
 (0)