You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs-website/docs/pipeline-components/converters/paddleocrvldocumentconverter.mdx
+57Lines changed: 57 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,10 +32,41 @@ The component takes `api_url` as a required parameter. To obtain the API URL, vi
32
32
33
33
By default, the component uses the `AISTUDIO_ACCESS_TOKEN` environment variable for authentication. You can also pass an `access_token` at initialization. The AI Studio access token can be obtained from [this page](https://aistudio.baidu.com/account/accessToken).
34
34
35
+
`raw_paddleocr_responses` can be useful while tuning layout thresholds, prompt settings, or Markdown post-processing options because it gives you access to the original API output alongside the converted Haystack documents.
36
+
35
37
:::note
36
38
This component returns Markdown content. Avoid piping it through `DocumentCleaner()` with its default settings because `remove_extra_whitespaces=True` and `remove_empty_lines=True` can collapse line breaks and flatten headings, tables, and image tags. For page-aware chunking, connect the converter directly to `DocumentSplitter`, or disable those options if you need custom cleanup.
37
39
:::
38
40
41
+
## When to use it
42
+
43
+
`PaddleOCRVLDocumentConverter` is a strong fit when you need more than plain OCR text:
44
+
45
+
-**Scanned PDFs and camera-captured documents** where page orientation and warped text can reduce extraction quality.
46
+
-**Layout-sensitive documents** such as invoices, reports, forms, and multi-column PDFs where preserving structure matters for downstream chunking and retrieval.
47
+
-**Tables, formulas, charts, or seals** where you want more targeted extraction behavior than plain text OCR.
48
+
-**RAG ingestion pipelines** where Markdown output is useful because headings, lists, tables, and page breaks can be preserved for later splitting.
49
+
50
+
## Useful configuration areas
51
+
52
+
The full parameter list is available in the [API reference](/reference/integrations-paddleocr). In practice, the most useful options tend to fall into these groups:
53
+
54
+
-**Input handling and image cleanup**: `file_type`, `use_doc_orientation_classify`, and `use_doc_unwarping` help when you mix PDFs and images or work with skewed scans and mobile photos.
55
+
-**Layout-aware extraction**: `use_layout_detection`, `layout_threshold`, `layout_nms`, `layout_unclip_ratio`, `layout_merge_bboxes_mode`, `layout_shape_mode`, and `merge_layout_blocks` help you tune how regions are detected and merged before Markdown is generated.
56
+
-**Content focus**: `prompt_label`, `use_ocr_for_image_block`, `use_chart_recognition`, and `use_seal_recognition` let you bias extraction toward a particular type of content, such as plain OCR, formulas, tables, charts, or seals.
57
+
-**Markdown output shaping**: `format_block_content`, `markdown_ignore_labels`, `prettify_markdown`, `show_formula_number`, `restructure_pages`, `merge_tables`, and `relevel_titles` help you control how much cleanup and restructuring happens before the result becomes a Haystack document.
58
+
-**VLM generation controls**: `repetition_penalty`, `temperature`, `top_p`, `min_pixels`, `max_pixels`, `max_new_tokens`, `vlm_extra_args`, and `additional_params` are useful when you need to trade off output quality, determinism, and cost.
59
+
-**Debugging and inspection**: `visualize=True` and the returned `raw_paddleocr_responses` are helpful when you are tuning extraction quality for a new document type.
60
+
61
+
## Typical scenarios
62
+
63
+
These settings are especially useful in a few common workflows:
64
+
65
+
-**Scanned contracts or receipts from phones**: start with `use_doc_orientation_classify=True` and `use_doc_unwarping=True`.
66
+
-**Table-heavy financial or operations PDFs**: consider `use_layout_detection=True`, `merge_tables=True`, and `restructure_pages=True`.
67
+
-**Formula-heavy documents**: use `prompt_label="formula"` together with `show_formula_number=True` if formula numbering matters in the final Markdown.
68
+
-**Mixed business documents with figures or seals**: enable `use_chart_recognition=True`, `use_seal_recognition=True`, or `use_ocr_for_image_block=True` depending on the content you want to preserve.
69
+
39
70
## Usage
40
71
41
72
You need to install the `paddleocr-haystack` integration to use `PaddleOCRVLDocumentConverter`:
@@ -64,6 +95,32 @@ result = converter.run(sources=[Path("my_document.pdf")])
64
95
documents = result["documents"]
65
96
```
66
97
98
+
Advanced configuration for structure-heavy PDFs:
99
+
100
+
```python
101
+
from pathlib import Path
102
+
from haystack.utils import Secret
103
+
from haystack_integrations.components.converters.paddleocr import (
Copy file name to clipboardExpand all lines: docs-website/versioned_docs/version-2.26/pipeline-components/converters/paddleocrvldocumentconverter.mdx
+57Lines changed: 57 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,10 +32,41 @@ The component takes `api_url` as a required parameter. To obtain the API URL, vi
32
32
33
33
By default, the component uses the `AISTUDIO_ACCESS_TOKEN` environment variable for authentication. You can also pass an `access_token` at initialization. The AI Studio access token can be obtained from [this page](https://aistudio.baidu.com/account/accessToken).
34
34
35
+
`raw_paddleocr_responses` can be useful while tuning layout thresholds, prompt settings, or Markdown post-processing options because it gives you access to the original API output alongside the converted Haystack documents.
36
+
35
37
:::note
36
38
This component returns Markdown content. Avoid piping it through `DocumentCleaner()` with its default settings because `remove_extra_whitespaces=True` and `remove_empty_lines=True` can collapse line breaks and flatten headings, tables, and image tags. For page-aware chunking, connect the converter directly to `DocumentSplitter`, or disable those options if you need custom cleanup.
37
39
:::
38
40
41
+
## When to use it
42
+
43
+
`PaddleOCRVLDocumentConverter` is a strong fit when you need more than plain OCR text:
44
+
45
+
-**Scanned PDFs and camera-captured documents** where page orientation and warped text can reduce extraction quality.
46
+
-**Layout-sensitive documents** such as invoices, reports, forms, and multi-column PDFs where preserving structure matters for downstream chunking and retrieval.
47
+
-**Tables, formulas, charts, or seals** where you want more targeted extraction behavior than plain text OCR.
48
+
-**RAG ingestion pipelines** where Markdown output is useful because headings, lists, tables, and page breaks can be preserved for later splitting.
49
+
50
+
## Useful configuration areas
51
+
52
+
The full parameter list is available in the [API reference](/reference/integrations-paddleocr). In practice, the most useful options tend to fall into these groups:
53
+
54
+
-**Input handling and image cleanup**: `file_type`, `use_doc_orientation_classify`, and `use_doc_unwarping` help when you mix PDFs and images or work with skewed scans and mobile photos.
55
+
-**Layout-aware extraction**: `use_layout_detection`, `layout_threshold`, `layout_nms`, `layout_unclip_ratio`, `layout_merge_bboxes_mode`, `layout_shape_mode`, and `merge_layout_blocks` help you tune how regions are detected and merged before Markdown is generated.
56
+
-**Content focus**: `prompt_label`, `use_ocr_for_image_block`, `use_chart_recognition`, and `use_seal_recognition` let you bias extraction toward a particular type of content, such as plain OCR, formulas, tables, charts, or seals.
57
+
-**Markdown output shaping**: `format_block_content`, `markdown_ignore_labels`, `prettify_markdown`, `show_formula_number`, `restructure_pages`, `merge_tables`, and `relevel_titles` help you control how much cleanup and restructuring happens before the result becomes a Haystack document.
58
+
-**VLM generation controls**: `repetition_penalty`, `temperature`, `top_p`, `min_pixels`, `max_pixels`, `max_new_tokens`, `vlm_extra_args`, and `additional_params` are useful when you need to trade off output quality, determinism, and cost.
59
+
-**Debugging and inspection**: `visualize=True` and the returned `raw_paddleocr_responses` are helpful when you are tuning extraction quality for a new document type.
60
+
61
+
## Typical scenarios
62
+
63
+
These settings are especially useful in a few common workflows:
64
+
65
+
-**Scanned contracts or receipts from phones**: start with `use_doc_orientation_classify=True` and `use_doc_unwarping=True`.
66
+
-**Table-heavy financial or operations PDFs**: consider `use_layout_detection=True`, `merge_tables=True`, and `restructure_pages=True`.
67
+
-**Formula-heavy documents**: use `prompt_label="formula"` together with `show_formula_number=True` if formula numbering matters in the final Markdown.
68
+
-**Mixed business documents with figures or seals**: enable `use_chart_recognition=True`, `use_seal_recognition=True`, or `use_ocr_for_image_block=True` depending on the content you want to preserve.
69
+
39
70
## Usage
40
71
41
72
You need to install the `paddleocr-haystack` integration to use `PaddleOCRVLDocumentConverter`:
@@ -64,6 +95,32 @@ result = converter.run(sources=[Path("my_document.pdf")])
64
95
documents = result["documents"]
65
96
```
66
97
98
+
Advanced configuration for structure-heavy PDFs:
99
+
100
+
```python
101
+
from pathlib import Path
102
+
from haystack.utils import Secret
103
+
from haystack_integrations.components.converters.paddleocr import (
Copy file name to clipboardExpand all lines: docs-website/versioned_docs/version-2.27/pipeline-components/converters/paddleocrvldocumentconverter.mdx
+57Lines changed: 57 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,10 +32,41 @@ The component takes `api_url` as a required parameter. To obtain the API URL, vi
32
32
33
33
By default, the component uses the `AISTUDIO_ACCESS_TOKEN` environment variable for authentication. You can also pass an `access_token` at initialization. The AI Studio access token can be obtained from [this page](https://aistudio.baidu.com/account/accessToken).
34
34
35
+
`raw_paddleocr_responses` can be useful while tuning layout thresholds, prompt settings, or Markdown post-processing options because it gives you access to the original API output alongside the converted Haystack documents.
36
+
35
37
:::note
36
38
This component returns Markdown content. Avoid piping it through `DocumentCleaner()` with its default settings because `remove_extra_whitespaces=True` and `remove_empty_lines=True` can collapse line breaks and flatten headings, tables, and image tags. For page-aware chunking, connect the converter directly to `DocumentSplitter`, or disable those options if you need custom cleanup.
37
39
:::
38
40
41
+
## When to use it
42
+
43
+
`PaddleOCRVLDocumentConverter` is a strong fit when you need more than plain OCR text:
44
+
45
+
-**Scanned PDFs and camera-captured documents** where page orientation and warped text can reduce extraction quality.
46
+
-**Layout-sensitive documents** such as invoices, reports, forms, and multi-column PDFs where preserving structure matters for downstream chunking and retrieval.
47
+
-**Tables, formulas, charts, or seals** where you want more targeted extraction behavior than plain text OCR.
48
+
-**RAG ingestion pipelines** where Markdown output is useful because headings, lists, tables, and page breaks can be preserved for later splitting.
49
+
50
+
## Useful configuration areas
51
+
52
+
The full parameter list is available in the [API reference](/reference/integrations-paddleocr). In practice, the most useful options tend to fall into these groups:
53
+
54
+
-**Input handling and image cleanup**: `file_type`, `use_doc_orientation_classify`, and `use_doc_unwarping` help when you mix PDFs and images or work with skewed scans and mobile photos.
55
+
-**Layout-aware extraction**: `use_layout_detection`, `layout_threshold`, `layout_nms`, `layout_unclip_ratio`, `layout_merge_bboxes_mode`, `layout_shape_mode`, and `merge_layout_blocks` help you tune how regions are detected and merged before Markdown is generated.
56
+
-**Content focus**: `prompt_label`, `use_ocr_for_image_block`, `use_chart_recognition`, and `use_seal_recognition` let you bias extraction toward a particular type of content, such as plain OCR, formulas, tables, charts, or seals.
57
+
-**Markdown output shaping**: `format_block_content`, `markdown_ignore_labels`, `prettify_markdown`, `show_formula_number`, `restructure_pages`, `merge_tables`, and `relevel_titles` help you control how much cleanup and restructuring happens before the result becomes a Haystack document.
58
+
-**VLM generation controls**: `repetition_penalty`, `temperature`, `top_p`, `min_pixels`, `max_pixels`, `max_new_tokens`, `vlm_extra_args`, and `additional_params` are useful when you need to trade off output quality, determinism, and cost.
59
+
-**Debugging and inspection**: `visualize=True` and the returned `raw_paddleocr_responses` are helpful when you are tuning extraction quality for a new document type.
60
+
61
+
## Typical scenarios
62
+
63
+
These settings are especially useful in a few common workflows:
64
+
65
+
-**Scanned contracts or receipts from phones**: start with `use_doc_orientation_classify=True` and `use_doc_unwarping=True`.
66
+
-**Table-heavy financial or operations PDFs**: consider `use_layout_detection=True`, `merge_tables=True`, and `restructure_pages=True`.
67
+
-**Formula-heavy documents**: use `prompt_label="formula"` together with `show_formula_number=True` if formula numbering matters in the final Markdown.
68
+
-**Mixed business documents with figures or seals**: enable `use_chart_recognition=True`, `use_seal_recognition=True`, or `use_ocr_for_image_block=True` depending on the content you want to preserve.
69
+
39
70
## Usage
40
71
41
72
You need to install the `paddleocr-haystack` integration to use `PaddleOCRVLDocumentConverter`:
@@ -64,6 +95,32 @@ result = converter.run(sources=[Path("my_document.pdf")])
64
95
documents = result["documents"]
65
96
```
66
97
98
+
Advanced configuration for structure-heavy PDFs:
99
+
100
+
```python
101
+
from pathlib import Path
102
+
from haystack.utils import Secret
103
+
from haystack_integrations.components.converters.paddleocr import (
Expand the ``PaddleOCRVLDocumentConverter`` documentation with more detailed guidance on advanced parameters, common usage scenarios, and a more realistic configuration example for layout-heavy documents.
0 commit comments