Skip to content

Commit 865e7a5

Browse files
authored
docs: expand PaddleOCR advanced usage guide (#11018)
1 parent 0422544 commit 865e7a5

4 files changed

Lines changed: 175 additions & 0 deletions

File tree

docs-website/docs/pipeline-components/converters/paddleocrvldocumentconverter.mdx

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,10 +32,41 @@ The component takes `api_url` as a required parameter. To obtain the API URL, vi
3232

3333
By default, the component uses the `AISTUDIO_ACCESS_TOKEN` environment variable for authentication. You can also pass an `access_token` at initialization. The AI Studio access token can be obtained from [this page](https://aistudio.baidu.com/account/accessToken).
3434

35+
`raw_paddleocr_responses` can be useful while tuning layout thresholds, prompt settings, or Markdown post-processing options because it gives you access to the original API output alongside the converted Haystack documents.
36+
3537
:::note
3638
This component returns Markdown content. Avoid piping it through `DocumentCleaner()` with its default settings because `remove_extra_whitespaces=True` and `remove_empty_lines=True` can collapse line breaks and flatten headings, tables, and image tags. For page-aware chunking, connect the converter directly to `DocumentSplitter`, or disable those options if you need custom cleanup.
3739
:::
3840

41+
## When to use it
42+
43+
`PaddleOCRVLDocumentConverter` is a strong fit when you need more than plain OCR text:
44+
45+
- **Scanned PDFs and camera-captured documents** where page orientation and warped text can reduce extraction quality.
46+
- **Layout-sensitive documents** such as invoices, reports, forms, and multi-column PDFs where preserving structure matters for downstream chunking and retrieval.
47+
- **Tables, formulas, charts, or seals** where you want more targeted extraction behavior than plain text OCR.
48+
- **RAG ingestion pipelines** where Markdown output is useful because headings, lists, tables, and page breaks can be preserved for later splitting.
49+
50+
## Useful configuration areas
51+
52+
The full parameter list is available in the [API reference](/reference/integrations-paddleocr). In practice, the most useful options tend to fall into these groups:
53+
54+
- **Input handling and image cleanup**: `file_type`, `use_doc_orientation_classify`, and `use_doc_unwarping` help when you mix PDFs and images or work with skewed scans and mobile photos.
55+
- **Layout-aware extraction**: `use_layout_detection`, `layout_threshold`, `layout_nms`, `layout_unclip_ratio`, `layout_merge_bboxes_mode`, `layout_shape_mode`, and `merge_layout_blocks` help you tune how regions are detected and merged before Markdown is generated.
56+
- **Content focus**: `prompt_label`, `use_ocr_for_image_block`, `use_chart_recognition`, and `use_seal_recognition` let you bias extraction toward a particular type of content, such as plain OCR, formulas, tables, charts, or seals.
57+
- **Markdown output shaping**: `format_block_content`, `markdown_ignore_labels`, `prettify_markdown`, `show_formula_number`, `restructure_pages`, `merge_tables`, and `relevel_titles` help you control how much cleanup and restructuring happens before the result becomes a Haystack document.
58+
- **VLM generation controls**: `repetition_penalty`, `temperature`, `top_p`, `min_pixels`, `max_pixels`, `max_new_tokens`, `vlm_extra_args`, and `additional_params` are useful when you need to trade off output quality, determinism, and cost.
59+
- **Debugging and inspection**: `visualize=True` and the returned `raw_paddleocr_responses` are helpful when you are tuning extraction quality for a new document type.
60+
61+
## Typical scenarios
62+
63+
These settings are especially useful in a few common workflows:
64+
65+
- **Scanned contracts or receipts from phones**: start with `use_doc_orientation_classify=True` and `use_doc_unwarping=True`.
66+
- **Table-heavy financial or operations PDFs**: consider `use_layout_detection=True`, `merge_tables=True`, and `restructure_pages=True`.
67+
- **Formula-heavy documents**: use `prompt_label="formula"` together with `show_formula_number=True` if formula numbering matters in the final Markdown.
68+
- **Mixed business documents with figures or seals**: enable `use_chart_recognition=True`, `use_seal_recognition=True`, or `use_ocr_for_image_block=True` depending on the content you want to preserve.
69+
3970
## Usage
4071

4172
You need to install the `paddleocr-haystack` integration to use `PaddleOCRVLDocumentConverter`:
@@ -64,6 +95,32 @@ result = converter.run(sources=[Path("my_document.pdf")])
6495
documents = result["documents"]
6596
```
6697

98+
Advanced configuration for structure-heavy PDFs:
99+
100+
```python
101+
from pathlib import Path
102+
from haystack.utils import Secret
103+
from haystack_integrations.components.converters.paddleocr import (
104+
PaddleOCRVLDocumentConverter,
105+
)
106+
107+
converter = PaddleOCRVLDocumentConverter(
108+
api_url="<your-api-url>",
109+
access_token=Secret.from_env_var("AISTUDIO_ACCESS_TOKEN"),
110+
use_doc_orientation_classify=True,
111+
use_doc_unwarping=True,
112+
use_layout_detection=True,
113+
use_ocr_for_image_block=True,
114+
merge_tables=True,
115+
restructure_pages=True,
116+
prettify_markdown=True,
117+
)
118+
119+
result = converter.run(sources=[Path("quarterly_report.pdf")])
120+
documents = result["documents"]
121+
raw_responses = result["raw_paddleocr_responses"]
122+
```
123+
67124
### In a pipeline
68125

69126
Here's an example of an indexing pipeline that processes PDFs with OCR and writes them to a Document Store:

docs-website/versioned_docs/version-2.26/pipeline-components/converters/paddleocrvldocumentconverter.mdx

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,10 +32,41 @@ The component takes `api_url` as a required parameter. To obtain the API URL, vi
3232

3333
By default, the component uses the `AISTUDIO_ACCESS_TOKEN` environment variable for authentication. You can also pass an `access_token` at initialization. The AI Studio access token can be obtained from [this page](https://aistudio.baidu.com/account/accessToken).
3434

35+
`raw_paddleocr_responses` can be useful while tuning layout thresholds, prompt settings, or Markdown post-processing options because it gives you access to the original API output alongside the converted Haystack documents.
36+
3537
:::note
3638
This component returns Markdown content. Avoid piping it through `DocumentCleaner()` with its default settings because `remove_extra_whitespaces=True` and `remove_empty_lines=True` can collapse line breaks and flatten headings, tables, and image tags. For page-aware chunking, connect the converter directly to `DocumentSplitter`, or disable those options if you need custom cleanup.
3739
:::
3840

41+
## When to use it
42+
43+
`PaddleOCRVLDocumentConverter` is a strong fit when you need more than plain OCR text:
44+
45+
- **Scanned PDFs and camera-captured documents** where page orientation and warped text can reduce extraction quality.
46+
- **Layout-sensitive documents** such as invoices, reports, forms, and multi-column PDFs where preserving structure matters for downstream chunking and retrieval.
47+
- **Tables, formulas, charts, or seals** where you want more targeted extraction behavior than plain text OCR.
48+
- **RAG ingestion pipelines** where Markdown output is useful because headings, lists, tables, and page breaks can be preserved for later splitting.
49+
50+
## Useful configuration areas
51+
52+
The full parameter list is available in the [API reference](/reference/integrations-paddleocr). In practice, the most useful options tend to fall into these groups:
53+
54+
- **Input handling and image cleanup**: `file_type`, `use_doc_orientation_classify`, and `use_doc_unwarping` help when you mix PDFs and images or work with skewed scans and mobile photos.
55+
- **Layout-aware extraction**: `use_layout_detection`, `layout_threshold`, `layout_nms`, `layout_unclip_ratio`, `layout_merge_bboxes_mode`, `layout_shape_mode`, and `merge_layout_blocks` help you tune how regions are detected and merged before Markdown is generated.
56+
- **Content focus**: `prompt_label`, `use_ocr_for_image_block`, `use_chart_recognition`, and `use_seal_recognition` let you bias extraction toward a particular type of content, such as plain OCR, formulas, tables, charts, or seals.
57+
- **Markdown output shaping**: `format_block_content`, `markdown_ignore_labels`, `prettify_markdown`, `show_formula_number`, `restructure_pages`, `merge_tables`, and `relevel_titles` help you control how much cleanup and restructuring happens before the result becomes a Haystack document.
58+
- **VLM generation controls**: `repetition_penalty`, `temperature`, `top_p`, `min_pixels`, `max_pixels`, `max_new_tokens`, `vlm_extra_args`, and `additional_params` are useful when you need to trade off output quality, determinism, and cost.
59+
- **Debugging and inspection**: `visualize=True` and the returned `raw_paddleocr_responses` are helpful when you are tuning extraction quality for a new document type.
60+
61+
## Typical scenarios
62+
63+
These settings are especially useful in a few common workflows:
64+
65+
- **Scanned contracts or receipts from phones**: start with `use_doc_orientation_classify=True` and `use_doc_unwarping=True`.
66+
- **Table-heavy financial or operations PDFs**: consider `use_layout_detection=True`, `merge_tables=True`, and `restructure_pages=True`.
67+
- **Formula-heavy documents**: use `prompt_label="formula"` together with `show_formula_number=True` if formula numbering matters in the final Markdown.
68+
- **Mixed business documents with figures or seals**: enable `use_chart_recognition=True`, `use_seal_recognition=True`, or `use_ocr_for_image_block=True` depending on the content you want to preserve.
69+
3970
## Usage
4071

4172
You need to install the `paddleocr-haystack` integration to use `PaddleOCRVLDocumentConverter`:
@@ -64,6 +95,32 @@ result = converter.run(sources=[Path("my_document.pdf")])
6495
documents = result["documents"]
6596
```
6697

98+
Advanced configuration for structure-heavy PDFs:
99+
100+
```python
101+
from pathlib import Path
102+
from haystack.utils import Secret
103+
from haystack_integrations.components.converters.paddleocr import (
104+
PaddleOCRVLDocumentConverter,
105+
)
106+
107+
converter = PaddleOCRVLDocumentConverter(
108+
api_url="<your-api-url>",
109+
access_token=Secret.from_env_var("AISTUDIO_ACCESS_TOKEN"),
110+
use_doc_orientation_classify=True,
111+
use_doc_unwarping=True,
112+
use_layout_detection=True,
113+
use_ocr_for_image_block=True,
114+
merge_tables=True,
115+
restructure_pages=True,
116+
prettify_markdown=True,
117+
)
118+
119+
result = converter.run(sources=[Path("quarterly_report.pdf")])
120+
documents = result["documents"]
121+
raw_responses = result["raw_paddleocr_responses"]
122+
```
123+
67124
### In a pipeline
68125

69126
Here's an example of an indexing pipeline that processes PDFs with OCR and writes them to a Document Store:

docs-website/versioned_docs/version-2.27/pipeline-components/converters/paddleocrvldocumentconverter.mdx

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,10 +32,41 @@ The component takes `api_url` as a required parameter. To obtain the API URL, vi
3232

3333
By default, the component uses the `AISTUDIO_ACCESS_TOKEN` environment variable for authentication. You can also pass an `access_token` at initialization. The AI Studio access token can be obtained from [this page](https://aistudio.baidu.com/account/accessToken).
3434

35+
`raw_paddleocr_responses` can be useful while tuning layout thresholds, prompt settings, or Markdown post-processing options because it gives you access to the original API output alongside the converted Haystack documents.
36+
3537
:::note
3638
This component returns Markdown content. Avoid piping it through `DocumentCleaner()` with its default settings because `remove_extra_whitespaces=True` and `remove_empty_lines=True` can collapse line breaks and flatten headings, tables, and image tags. For page-aware chunking, connect the converter directly to `DocumentSplitter`, or disable those options if you need custom cleanup.
3739
:::
3840

41+
## When to use it
42+
43+
`PaddleOCRVLDocumentConverter` is a strong fit when you need more than plain OCR text:
44+
45+
- **Scanned PDFs and camera-captured documents** where page orientation and warped text can reduce extraction quality.
46+
- **Layout-sensitive documents** such as invoices, reports, forms, and multi-column PDFs where preserving structure matters for downstream chunking and retrieval.
47+
- **Tables, formulas, charts, or seals** where you want more targeted extraction behavior than plain text OCR.
48+
- **RAG ingestion pipelines** where Markdown output is useful because headings, lists, tables, and page breaks can be preserved for later splitting.
49+
50+
## Useful configuration areas
51+
52+
The full parameter list is available in the [API reference](/reference/integrations-paddleocr). In practice, the most useful options tend to fall into these groups:
53+
54+
- **Input handling and image cleanup**: `file_type`, `use_doc_orientation_classify`, and `use_doc_unwarping` help when you mix PDFs and images or work with skewed scans and mobile photos.
55+
- **Layout-aware extraction**: `use_layout_detection`, `layout_threshold`, `layout_nms`, `layout_unclip_ratio`, `layout_merge_bboxes_mode`, `layout_shape_mode`, and `merge_layout_blocks` help you tune how regions are detected and merged before Markdown is generated.
56+
- **Content focus**: `prompt_label`, `use_ocr_for_image_block`, `use_chart_recognition`, and `use_seal_recognition` let you bias extraction toward a particular type of content, such as plain OCR, formulas, tables, charts, or seals.
57+
- **Markdown output shaping**: `format_block_content`, `markdown_ignore_labels`, `prettify_markdown`, `show_formula_number`, `restructure_pages`, `merge_tables`, and `relevel_titles` help you control how much cleanup and restructuring happens before the result becomes a Haystack document.
58+
- **VLM generation controls**: `repetition_penalty`, `temperature`, `top_p`, `min_pixels`, `max_pixels`, `max_new_tokens`, `vlm_extra_args`, and `additional_params` are useful when you need to trade off output quality, determinism, and cost.
59+
- **Debugging and inspection**: `visualize=True` and the returned `raw_paddleocr_responses` are helpful when you are tuning extraction quality for a new document type.
60+
61+
## Typical scenarios
62+
63+
These settings are especially useful in a few common workflows:
64+
65+
- **Scanned contracts or receipts from phones**: start with `use_doc_orientation_classify=True` and `use_doc_unwarping=True`.
66+
- **Table-heavy financial or operations PDFs**: consider `use_layout_detection=True`, `merge_tables=True`, and `restructure_pages=True`.
67+
- **Formula-heavy documents**: use `prompt_label="formula"` together with `show_formula_number=True` if formula numbering matters in the final Markdown.
68+
- **Mixed business documents with figures or seals**: enable `use_chart_recognition=True`, `use_seal_recognition=True`, or `use_ocr_for_image_block=True` depending on the content you want to preserve.
69+
3970
## Usage
4071

4172
You need to install the `paddleocr-haystack` integration to use `PaddleOCRVLDocumentConverter`:
@@ -64,6 +95,32 @@ result = converter.run(sources=[Path("my_document.pdf")])
6495
documents = result["documents"]
6596
```
6697

98+
Advanced configuration for structure-heavy PDFs:
99+
100+
```python
101+
from pathlib import Path
102+
from haystack.utils import Secret
103+
from haystack_integrations.components.converters.paddleocr import (
104+
PaddleOCRVLDocumentConverter,
105+
)
106+
107+
converter = PaddleOCRVLDocumentConverter(
108+
api_url="<your-api-url>",
109+
access_token=Secret.from_env_var("AISTUDIO_ACCESS_TOKEN"),
110+
use_doc_orientation_classify=True,
111+
use_doc_unwarping=True,
112+
use_layout_detection=True,
113+
use_ocr_for_image_block=True,
114+
merge_tables=True,
115+
restructure_pages=True,
116+
prettify_markdown=True,
117+
)
118+
119+
result = converter.run(sources=[Path("quarterly_report.pdf")])
120+
documents = result["documents"]
121+
raw_responses = result["raw_paddleocr_responses"]
122+
```
123+
67124
### In a pipeline
68125

69126
Here's an example of an indexing pipeline that processes PDFs with OCR and writes them to a Document Store:
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
---
2+
enhancements:
3+
- |
4+
Expand the ``PaddleOCRVLDocumentConverter`` documentation with more detailed guidance on advanced parameters, common usage scenarios, and a more realistic configuration example for layout-heavy documents.

0 commit comments

Comments
 (0)