|
| 1 | +--- |
| 2 | +layout: integration |
| 3 | +name: Amazon Textract |
| 4 | +description: Use Amazon Textract with Haystack to extract text, tables, forms, and answers to queries from documents |
| 5 | +authors: |
| 6 | + - name: deepset |
| 7 | + socials: |
| 8 | + github: deepset-ai |
| 9 | + twitter: deepset_ai |
| 10 | + linkedin: https://www.linkedin.com/company/deepset-ai/ |
| 11 | +pypi: https://pypi.org/project/amazon-textract-haystack |
| 12 | +repo: https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/amazon_textract |
| 13 | +type: Data Ingestion |
| 14 | +report_issue: https://github.com/deepset-ai/haystack-core-integrations/issues |
| 15 | +logo: /logos/aws.png |
| 16 | +version: Haystack 2.0 |
| 17 | +toc: true |
| 18 | +--- |
| 19 | + |
| 20 | +### **Table of Contents** |
| 21 | +- [Overview](#overview) |
| 22 | +- [Installation](#installation) |
| 23 | +- [Usage](#usage) |
| 24 | + |
| 25 | +## Overview |
| 26 | + |
| 27 | +[`AmazonTextractConverter`](https://docs.haystack.deepset.ai/docs/amazontextractconverter) provides an integration of [Amazon Textract](https://aws.amazon.com/textract/) with Haystack. |
| 28 | + |
| 29 | +This component uses Amazon Textract's synchronous API to convert images and single-page PDFs into Haystack `Document` objects using OCR. It supports plain text extraction, structural analysis for tables and forms, and natural-language queries on documents. |
| 30 | + |
| 31 | +**Supported file formats**: JPEG, PNG, TIFF, BMP, and single-page PDF (up to 10 MB). |
| 32 | + |
| 33 | +**Key features**: |
| 34 | +- Plain text extraction with `DetectDocumentText` |
| 35 | +- Table, form, signature, and layout detection with `AnalyzeDocument` |
| 36 | +- Natural-language queries to extract specific answers from documents |
| 37 | +- Access to the raw Textract response for downstream processing |
| 38 | + |
| 39 | +## Installation |
| 40 | + |
| 41 | +Install the Amazon Textract integration: |
| 42 | + |
| 43 | +```bash |
| 44 | +pip install amazon-textract-haystack |
| 45 | +``` |
| 46 | + |
| 47 | +## Usage |
| 48 | + |
| 49 | +The component uses the standard boto3 credential chain. You can set AWS credentials (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_DEFAULT_REGION`) as environment variables, configure them via `~/.aws/credentials` and `~/.aws/config`, rely on an IAM role when running on AWS infrastructure, or pass them explicitly as [Secret](https://docs.haystack.deepset.ai/docs/secret-management) arguments. |
| 50 | + |
| 51 | +The Textract API is selected automatically based on how you configure the component: `DetectDocumentText` is used by default for plain text extraction, while `AnalyzeDocument` is used whenever you set `feature_types` or pass `queries` at runtime. |
| 52 | + |
| 53 | +### Basic text extraction |
| 54 | + |
| 55 | +Extract plain text from a document with the default configuration, which calls `DetectDocumentText`: |
| 56 | + |
| 57 | +```python |
| 58 | +from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter |
| 59 | + |
| 60 | +converter = AmazonTextractConverter() |
| 61 | +results = converter.run(sources=["document.png"]) |
| 62 | +documents = results["documents"] |
| 63 | + |
| 64 | +print(documents[0].content) |
| 65 | +``` |
| 66 | + |
| 67 | +### Table and form analysis |
| 68 | + |
| 69 | +Use `AnalyzeDocument` to detect tables and forms by setting `feature_types`: |
| 70 | + |
| 71 | +```python |
| 72 | +from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter |
| 73 | + |
| 74 | +converter = AmazonTextractConverter(feature_types=["TABLES", "FORMS"]) |
| 75 | +results = converter.run(sources=["invoice.png"]) |
| 76 | + |
| 77 | +documents = results["documents"] |
| 78 | +raw_responses = results["raw_textract_response"] |
| 79 | +``` |
| 80 | + |
| 81 | +Valid `feature_types` values: `"TABLES"`, `"FORMS"`, `"SIGNATURES"`, `"LAYOUT"`. |
| 82 | + |
| 83 | +### Natural-language queries |
| 84 | + |
| 85 | +Ask questions about a document and get extracted answers. The `QUERIES` feature type is enabled automatically when you pass the `queries` parameter at runtime: |
| 86 | + |
| 87 | +```python |
| 88 | +from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter |
| 89 | + |
| 90 | +converter = AmazonTextractConverter() |
| 91 | +results = converter.run( |
| 92 | + sources=["medical_form.png"], |
| 93 | + queries=["What is the patient name?", "What is the date of birth?"], |
| 94 | +) |
| 95 | + |
| 96 | +documents = results["documents"] |
| 97 | +raw_responses = results["raw_textract_response"] |
| 98 | +``` |
| 99 | + |
| 100 | +Queries can be combined with `feature_types` for both structural and question-based extraction: |
| 101 | + |
| 102 | +```python |
| 103 | +converter = AmazonTextractConverter(feature_types=["TABLES", "FORMS"]) |
| 104 | +results = converter.run( |
| 105 | + sources=["invoice.png"], |
| 106 | + queries=["What is the total amount due?"], |
| 107 | +) |
| 108 | +``` |
| 109 | + |
| 110 | +### Explicit credentials |
| 111 | + |
| 112 | +```python |
| 113 | +from haystack.utils import Secret |
| 114 | +from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter |
| 115 | + |
| 116 | +converter = AmazonTextractConverter( |
| 117 | + aws_access_key_id=Secret.from_env_var("MY_AWS_KEY"), |
| 118 | + aws_secret_access_key=Secret.from_env_var("MY_AWS_SECRET"), |
| 119 | + aws_region_name=Secret.from_token("us-east-1"), |
| 120 | +) |
| 121 | +``` |
| 122 | + |
| 123 | +For more details on Amazon Textract capabilities and setup, refer to the [Amazon Textract documentation](https://docs.aws.amazon.com/textract/latest/dg/what-is.html). |
| 124 | + |
| 125 | +### License |
| 126 | + |
| 127 | +`amazon-textract-haystack` is distributed under the terms of the [Apache-2.0](https://spdx.org/licenses/Apache-2.0.html) license. |
0 commit comments