You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/2.developers/4.user-guide/50.llm-xpack/.parsers/parsers.md
+17-4Lines changed: 17 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,20 +20,33 @@ Here is a table listing all available parsers and some details about them:
20
20
21
21
| Name | Data | Description |
22
22
|--------------------|------|-------------|
23
-
| Utf8Parser | Text | Decodes text encoded in UTF-8. |
24
-
| UnstructuredParser | Text + tables | Leverages Unstructured library to parse various document types. |
25
23
| DoclingParser | PDF + tables + images | Utilizes docling library to extract structured content from PDFs, including images. |
26
-
| PypdfParser | PDF | Uses pypdf library to extract text from PDFs with optional text cleanup. |
27
24
| ImageParser | Image| Transforms images into textual descriptions and extracts structured information. |
25
+
| PaddleOCR | PDF + tables + images | Utilizes the PaddleOCR library to extract structured content from PDFs and images. |
26
+
| PypdfParser | PDF | Uses pypdf library to extract text from PDFs with optional text cleanup. |
28
27
| SlideParser | Slide| Extracts information from PPTX and PDF slide decks using vision-based LLMs. |
29
-
28
+
| UnstructuredParser | Text + tables | Leverages Unstructured library to parse various document types. |
29
+
| Utf8Parser | Text | Decodes text encoded in UTF-8. |
30
30
31
31
32
32
## Utf8Parser
33
33
34
34
[`Utf8Parser`](/developers/api-docs/pathway-xpacks-llm/parsers#pathway.xpacks.llm.parsers.Utf8Parser) is a simple parser designed to decode text encoded in UTF-8. It ensures that raw byte-encoded content is converted into a readable string format for further processing in a RAG pipeline.
35
35
36
36
37
+
## PaddleOCR
38
+
[`PaddleOCRParser`](/developers/api-docs/pathway-xpacks-llm/parsers/#pathway.xpacks.llm.parsers.PaddleOCRParser) is a parser that relies on the [PaddleOCR](https://aistudio.baidu.com/paddleocr) library.
39
+
It requires the `paddlepaddle` package. The version depends on your hardware.
40
+
If you want to run the OCR on CPU, you can install it with the following pip command: `pip install paddlepaddle>=3.2.0`
41
+
For GPU support, follow the instructions on the [official site](https://www.paddlepaddle.org.cn/en/install/quick).
42
+
43
+
The `PaddleOCRParser` uses a Paddle pipeline object to perform the parsing/OCR.
44
+
Currently, `PaddleOCR` and `PPStructureV3` pipelines are supported.
45
+
By default, it uses a `PPStructureV3` pipeline.
46
+
47
+
More details on how to use the PaddleOCRParser in the [associated blog post](/blog/paddleocr/).
48
+
49
+
37
50
## UnstructuredParser
38
51
39
52
[`UnstructuredParser`](/developers/api-docs/pathway-xpacks-llm/parsers#pathway.xpacks.llm.parsers.UnstructuredParser) leverages the parsing capabilities of [Unstructured](https://unstructured.io/). It supports various document types, including PDFs, HTML, Word documents, and [more](https://docs.unstructured.io/open-source/introduction/supported-file-types), making it a robust out-of-the-box solution for most use cases. Additionally, it offers good performance in terms of speed.
0 commit comments