Skip to content

Commit fb638b4

Browse files
olruasszymondudycz
authored andcommitted
Olivier/paddleocr tuto (#9762)
Co-authored-by: Szymon Dudycz <szymond@pathway.com> GitOrigin-RevId: 0ea8bdc15d1159398b9e4316eca5fc7568d51d6f
1 parent 85e331d commit fb638b4

1 file changed

Lines changed: 17 additions & 4 deletions

File tree

  • docs/2.developers/4.user-guide/50.llm-xpack/.parsers

docs/2.developers/4.user-guide/50.llm-xpack/.parsers/parsers.md

Lines changed: 17 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,20 +20,33 @@ Here is a table listing all available parsers and some details about them:
2020

2121
| Name | Data | Description |
2222
|--------------------|------|-------------|
23-
| Utf8Parser | Text | Decodes text encoded in UTF-8. |
24-
| UnstructuredParser | Text + tables | Leverages Unstructured library to parse various document types. |
2523
| DoclingParser | PDF + tables + images | Utilizes docling library to extract structured content from PDFs, including images. |
26-
| PypdfParser | PDF | Uses pypdf library to extract text from PDFs with optional text cleanup. |
2724
| ImageParser | Image| Transforms images into textual descriptions and extracts structured information. |
25+
| PaddleOCR | PDF + tables + images | Utilizes the PaddleOCR library to extract structured content from PDFs and images. |
26+
| PypdfParser | PDF | Uses pypdf library to extract text from PDFs with optional text cleanup. |
2827
| SlideParser | Slide| Extracts information from PPTX and PDF slide decks using vision-based LLMs. |
29-
28+
| UnstructuredParser | Text + tables | Leverages Unstructured library to parse various document types. |
29+
| Utf8Parser | Text | Decodes text encoded in UTF-8. |
3030

3131

3232
## Utf8Parser
3333

3434
[`Utf8Parser`](/developers/api-docs/pathway-xpacks-llm/parsers#pathway.xpacks.llm.parsers.Utf8Parser) is a simple parser designed to decode text encoded in UTF-8. It ensures that raw byte-encoded content is converted into a readable string format for further processing in a RAG pipeline.
3535

3636

37+
## PaddleOCR
38+
[`PaddleOCRParser`](/developers/api-docs/pathway-xpacks-llm/parsers/#pathway.xpacks.llm.parsers.PaddleOCRParser) is a parser that relies on the [PaddleOCR](https://aistudio.baidu.com/paddleocr) library.
39+
It requires the `paddlepaddle` package. The version depends on your hardware.
40+
If you want to run the OCR on CPU, you can install it with the following pip command: `pip install paddlepaddle>=3.2.0`
41+
For GPU support, follow the instructions on the [official site](https://www.paddlepaddle.org.cn/en/install/quick).
42+
43+
The `PaddleOCRParser` uses a Paddle pipeline object to perform the parsing/OCR.
44+
Currently, `PaddleOCR` and `PPStructureV3` pipelines are supported.
45+
By default, it uses a `PPStructureV3` pipeline.
46+
47+
More details on how to use the PaddleOCRParser in the [associated blog post](/blog/paddleocr/).
48+
49+
3750
## UnstructuredParser
3851

3952
[`UnstructuredParser`](/developers/api-docs/pathway-xpacks-llm/parsers#pathway.xpacks.llm.parsers.UnstructuredParser) leverages the parsing capabilities of [Unstructured](https://unstructured.io/). It supports various document types, including PDFs, HTML, Word documents, and [more](https://docs.unstructured.io/open-source/introduction/supported-file-types), making it a robust out-of-the-box solution for most use cases. Additionally, it offers good performance in terms of speed.

0 commit comments

Comments
 (0)