Skip to content

Latest commit

 

History

History
83 lines (61 loc) · 3.13 KB

File metadata and controls

83 lines (61 loc) · 3.13 KB
title PDFMinerToDocument
id pdfminertodocument
slug /pdfminertodocument
description A component that converts complex PDF files to documents using pdfminer arguments.

PDFMinerToDocument

A component that converts complex PDF files to documents using pdfminer arguments.

Most common position in a pipeline Before PreProcessors or right at the beginning of an indexing pipeline
Mandatory run variables sources: PDF file paths or ByteStream objects
Output variables documents: A list of documents
API reference Converters
GitHub link https://github.com/deepset-ai/haystack/blob/main/haystack/components/converters/pdfminer.py
Package name haystack-ai

Overview

The PDFMinerToDocument component converts PDF files into documents using PDFMiner extraction tool arguments.

You can use it in an indexing pipeline to index the contents of a PDF file in a Document Store. It takes a list of file paths or ByteStreamobjects as input and outputs the converted result as a list of documents. Optionally, you can attach metadata to the documents through the meta input parameter.

When initializing the component, you can adjust several parameters to fit your PDF. See the full parameter list and descriptions in our API reference.

Usage

First, install pdfminer package to start using this converter:

pip install pdfminer.six

On its own

from haystack.components.converters import PDFMinerToDocument

converter = PDFMinerToDocument()
results = converter.run(
    sources=["sample.pdf"],
    meta={"date_added": datetime.now().isoformat()},
)
documents = results["documents"]

print(documents[0].content)

## 'This is a text from the PDF file.'

In a pipeline

from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters import PDFMinerToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter

document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_component("converter", PDFMinerToDocument())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component(
    "splitter",
    DocumentSplitter(split_by="sentence", split_length=5),
)
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
pipeline.connect("converter", "cleaner")
pipeline.connect("cleaner", "splitter")
pipeline.connect("splitter", "writer")

pipeline.run({"converter": {"sources": file_names}})