Skip to content

Commit 607de6d

Browse files
authored
Add Amazon Textract integration page (#484)
* Add Amazon Textract integration page * Address comments
1 parent 83b8d10 commit 607de6d

1 file changed

Lines changed: 127 additions & 0 deletions

File tree

integrations/amazon-textract.md

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
---
2+
layout: integration
3+
name: Amazon Textract
4+
description: Use Amazon Textract with Haystack to extract text, tables, forms, and answers to queries from documents
5+
authors:
6+
- name: deepset
7+
socials:
8+
github: deepset-ai
9+
twitter: deepset_ai
10+
linkedin: https://www.linkedin.com/company/deepset-ai/
11+
pypi: https://pypi.org/project/amazon-textract-haystack
12+
repo: https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/amazon_textract
13+
type: Data Ingestion
14+
report_issue: https://github.com/deepset-ai/haystack-core-integrations/issues
15+
logo: /logos/aws.png
16+
version: Haystack 2.0
17+
toc: true
18+
---
19+
20+
### **Table of Contents**
21+
- [Overview](#overview)
22+
- [Installation](#installation)
23+
- [Usage](#usage)
24+
25+
## Overview
26+
27+
[`AmazonTextractConverter`](https://docs.haystack.deepset.ai/docs/amazontextractconverter) provides an integration of [Amazon Textract](https://aws.amazon.com/textract/) with Haystack.
28+
29+
This component uses Amazon Textract's synchronous API to convert images and single-page PDFs into Haystack `Document` objects using OCR. It supports plain text extraction, structural analysis for tables and forms, and natural-language queries on documents.
30+
31+
**Supported file formats**: JPEG, PNG, TIFF, BMP, and single-page PDF (up to 10 MB).
32+
33+
**Key features**:
34+
- Plain text extraction with `DetectDocumentText`
35+
- Table, form, signature, and layout detection with `AnalyzeDocument`
36+
- Natural-language queries to extract specific answers from documents
37+
- Access to the raw Textract response for downstream processing
38+
39+
## Installation
40+
41+
Install the Amazon Textract integration:
42+
43+
```bash
44+
pip install amazon-textract-haystack
45+
```
46+
47+
## Usage
48+
49+
The component uses the standard boto3 credential chain. You can set AWS credentials (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_DEFAULT_REGION`) as environment variables, configure them via `~/.aws/credentials` and `~/.aws/config`, rely on an IAM role when running on AWS infrastructure, or pass them explicitly as [Secret](https://docs.haystack.deepset.ai/docs/secret-management) arguments.
50+
51+
The Textract API is selected automatically based on how you configure the component: `DetectDocumentText` is used by default for plain text extraction, while `AnalyzeDocument` is used whenever you set `feature_types` or pass `queries` at runtime.
52+
53+
### Basic text extraction
54+
55+
Extract plain text from a document with the default configuration, which calls `DetectDocumentText`:
56+
57+
```python
58+
from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter
59+
60+
converter = AmazonTextractConverter()
61+
results = converter.run(sources=["document.png"])
62+
documents = results["documents"]
63+
64+
print(documents[0].content)
65+
```
66+
67+
### Table and form analysis
68+
69+
Use `AnalyzeDocument` to detect tables and forms by setting `feature_types`:
70+
71+
```python
72+
from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter
73+
74+
converter = AmazonTextractConverter(feature_types=["TABLES", "FORMS"])
75+
results = converter.run(sources=["invoice.png"])
76+
77+
documents = results["documents"]
78+
raw_responses = results["raw_textract_response"]
79+
```
80+
81+
Valid `feature_types` values: `"TABLES"`, `"FORMS"`, `"SIGNATURES"`, `"LAYOUT"`.
82+
83+
### Natural-language queries
84+
85+
Ask questions about a document and get extracted answers. The `QUERIES` feature type is enabled automatically when you pass the `queries` parameter at runtime:
86+
87+
```python
88+
from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter
89+
90+
converter = AmazonTextractConverter()
91+
results = converter.run(
92+
sources=["medical_form.png"],
93+
queries=["What is the patient name?", "What is the date of birth?"],
94+
)
95+
96+
documents = results["documents"]
97+
raw_responses = results["raw_textract_response"]
98+
```
99+
100+
Queries can be combined with `feature_types` for both structural and question-based extraction:
101+
102+
```python
103+
converter = AmazonTextractConverter(feature_types=["TABLES", "FORMS"])
104+
results = converter.run(
105+
sources=["invoice.png"],
106+
queries=["What is the total amount due?"],
107+
)
108+
```
109+
110+
### Explicit credentials
111+
112+
```python
113+
from haystack.utils import Secret
114+
from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter
115+
116+
converter = AmazonTextractConverter(
117+
aws_access_key_id=Secret.from_env_var("MY_AWS_KEY"),
118+
aws_secret_access_key=Secret.from_env_var("MY_AWS_SECRET"),
119+
aws_region_name=Secret.from_token("us-east-1"),
120+
)
121+
```
122+
123+
For more details on Amazon Textract capabilities and setup, refer to the [Amazon Textract documentation](https://docs.aws.amazon.com/textract/latest/dg/what-is.html).
124+
125+
### License
126+
127+
`amazon-textract-haystack` is distributed under the terms of the [Apache-2.0](https://spdx.org/licenses/Apache-2.0.html) license.

0 commit comments

Comments
 (0)