A Haystack integration for AWS Textract that extracts text and structured data from documents using OCR.
The AmazonTextractConverter component converts images and single-page PDFs into Haystack Document objects using the AWS Textract synchronous API.
Supported file formats: JPEG, PNG, TIFF, BMP, and single-page PDF (up to 10 MB).
pip install amazon-textract-haystackExtract plain text from a document using DetectDocumentText:
from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter
converter = AmazonTextractConverter()
results = converter.run(sources=["document.png"])
documents = results["documents"]
print(documents[0].content)Use AnalyzeDocument to detect tables and forms by setting feature_types:
converter = AmazonTextractConverter(feature_types=["TABLES", "FORMS"])
results = converter.run(sources=["invoice.png"])
documents = results["documents"]
raw_responses = results["raw_textract_response"]Valid feature_types values: "TABLES", "FORMS", "SIGNATURES", "LAYOUT".
Ask questions about a document and get extracted answers. The QUERIES feature type
is enabled automatically when you pass the queries parameter at runtime:
converter = AmazonTextractConverter()
results = converter.run(
sources=["medical_form.png"],
queries=["What is the patient name?", "What is the date of birth?"],
)
documents = results["documents"]
raw_responses = results["raw_textract_response"]Queries can be combined with feature_types for both structural and question-based extraction:
converter = AmazonTextractConverter(feature_types=["TABLES", "FORMS"])
results = converter.run(
sources=["invoice.png"],
queries=["What is the total amount due?"],
)from haystack import Pipeline
from haystack.components.preprocessors import DocumentCleaner
from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter
pipeline = Pipeline()
pipeline.add_component("converter", AmazonTextractConverter())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.connect("converter.documents", "cleaner.documents")
result = pipeline.run({"converter": {"sources": ["scan.png"]}})The component uses the standard boto3 credential chain. You can configure credentials in any of these ways:
- Environment variables (default): Set
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY, andAWS_DEFAULT_REGION. - AWS credentials file: Configure via
~/.aws/credentialsand~/.aws/config. - IAM role: When running on AWS infrastructure (EC2, Lambda, ECS).
- Explicit parameters:
from haystack.utils import Secret
converter = AmazonTextractConverter(
aws_access_key_id=Secret.from_env_var("MY_AWS_KEY"),
aws_secret_access_key=Secret.from_env_var("MY_AWS_SECRET"),
aws_region_name=Secret.from_token("us-east-1"),
)Unit tests (no AWS credentials needed):
cd integrations/amazon_textract
hatch run test:unitIntegration tests (require AWS credentials and a test image at tests/test_files/sample_text.png):
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_DEFAULT_REGION=us-east-1
hatch run test:integrationRefer to the general Contribution Guidelines.