Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. SPDX-License-Identifier: MIT-0
This module provides document classification capabilities for the IDP Accelerator project, allowing classification of documents based on their text and image content. It supports multiple classification backends including Bedrock LLMs and SageMaker UDOP models.
- Classification of documents using multiple backend options:
- Amazon Bedrock LLMs
- SageMaker UDOP models
- Optional regex-based classification for enhanced performance
- Document name regex matching when all pages should be classified as the same class
- Page content regex matching for multi-modal page-level classification
- Direct integration with the Document data model
- Support for both text and image content
- Concurrent processing of multiple pages
- Structured data models for results
- Grouping of pages into sections by classification
- Comprehensive error handling and retry mechanisms
- DynamoDB caching for resilient page-level classification
- Sequence segmentation using BIO-like approach for document boundary detection
The multimodal page-level classification method implements a sequence segmentation approach similar to BIO (Begin-Inside-Outside) tagging commonly used in NLP. This enables accurate segmentation of multi-document packets where a single file may contain multiple distinct documents.
Each page receives two pieces of information:
- Document Type: The classification label (e.g., "invoice", "letter", "financial_statement")
- Document Boundary: A boundary indicator that signals document transitions:
"start": Indicates the beginning of a new document (similar to "Begin" in BIO)"continue": Indicates continuation of the current document (similar to "Inside" in BIO)
- Multi-Document Packet Support: Accurately segments packets containing multiple documents
- Type-Aware Boundaries: Detects when a new document of the same type begins
- Automatic Section Creation: Pages are grouped into sections based on both type and boundaries
- Improved Accuracy: Context-aware classification that considers document flow
Consider a packet with 6 pages containing two invoices and one letter:
Page 1: type="invoice", boundary="start" → Section 1 (Invoice #1)
Page 2: type="invoice", boundary="continue" → Section 1 (Invoice #1)
Page 3: type="letter", boundary="start" → Section 2 (Letter)
Page 4: type="letter", boundary="continue" → Section 2 (Letter)
Page 5: type="invoice", boundary="start" → Section 3 (Invoice #2)
Page 6: type="invoice", boundary="continue" → Section 3 (Invoice #2)
The system automatically creates three sections, properly separating the two invoices despite them having the same document type.
The classification service now supports optional regex-based pattern matching to provide significant performance improvements and deterministic classification for known document patterns. This feature enables instant classification without LLM API calls when regex patterns match.
When you want all pages of a document to be classified the same way, document name regex patterns can instantly classify entire documents based on their filename or ID:
classes:
- name: Payslip
description: "Employee wage statement showing earnings and deductions"
document_name_regex: "(?i).*(payslip|paystub|salary|wage).*"
attributes:
- name: EmployeeName
description: "Name of the employee"
attributeType: simpleHow it works:
- Works with any number of document classes defined in configuration
- When document ID matches the regex pattern, all pages are classified as that class
- Skips all LLM processing for massive performance gains
- Provides info-level logging when matches occur
For multi-modal page-level classification, page content regex patterns can classify individual pages based on text content:
classes:
- name: Invoice
description: "Business invoice document"
document_page_content_regex: "(?i)(invoice\\s+number|bill\\s+to|amount\\s+due)"
attributes:
- name: InvoiceNumber
description: "Invoice number"
attributeType: simple
- name: Payslip
description: "Employee wage statement"
document_page_content_regex: "(?i)(gross\\s+pay|net\\s+pay|employee\\s+id)"
attributes:
- name: EmployeeName
description: "Employee name"
attributeType: simpleHow it works:
- Only applies to multi-modal page-level classification method
- Each page's text content is checked against all class regex patterns
- First matching pattern wins and classifies the page instantly
- Falls back to LLM classification when no patterns match
- Provides info-level logging when matches occur
Both regex types are optional and can be used together:
classes:
- name: W2-Form
description: "W2 tax form with wage and tax information"
# Both regex types can be specified
document_name_regex: "(?i).*w-?2.*" # For single-class scenarios
document_page_content_regex: "(?i)(form\\s+w-?2|wage\\s+and\\s+tax)" # For page-level
attributes:
- name: EmployerEIN
description: "Employer identification number"
attributeType: simpleSpeed Improvements:
- Regex matching is nearly instantaneous compared to LLM calls
- Document name regex: ~100-1000x faster (entire document classified instantly)
- Page content regex: ~10-50x faster per matched page
Cost Savings:
- Zero token usage for regex-matched classifications
- No Bedrock/SageMaker API calls for matched patterns
- Significant cost reduction for documents with recognizable patterns
Deterministic Results:
- Consistent classification results for pattern-matched documents
- Eliminates LLM variability for known document types
- Reliable classification for high-volume processing scenarios
-
Case-Insensitive Matching: Use
(?i)flag for robust matching(?i).*(invoice|bill).* # Matches "Invoice", "INVOICE", "bill", "BILL"
-
Flexible Whitespace: Use
\\s+for varying whitespace(?i)(gross\\s+pay|net\\s+pay) # Matches "gross pay", "gross pay", "GROSS PAY"
-
Multiple Alternatives: Use
|for different possible terms(?i).*(payslip|paystub|salary|wage).* # Matches any of these terms
-
Specific Enough: Balance specificity to avoid false matches
# Good: Specific to payslips (?i)(gross\\s+pay|employee\\s+id|pay\\s+period) # Too broad: Could match many document types (?i)(pay|id|period)
The regex system includes comprehensive error handling:
- Compilation Errors: Invalid regex patterns are logged and ignored, fallback to LLM
- Runtime Errors: Regex matching failures fallback to standard classification
- Graceful Degradation: System continues to work normally even with invalid patterns
- Detailed Logging: Debug and error logs help with pattern troubleshooting
from idp_common import classification, get_config
from idp_common.models import Document
# Load configuration with regex patterns
config = get_config()
# Initialize service - regex patterns are automatically used
service = classification.ClassificationService(
region="us-east-1",
config=config,
backend="bedrock"
)
# Classification automatically uses regex when patterns match
document = service.classify_document(document)
# Check if regex was used
for page_id, page in document.pages.items():
metadata = getattr(page, 'metadata', {})
if metadata.get('regex_matched', False):
print(f"Page {page_id} was classified using regex patterns")
else:
print(f"Page {page_id} was classified using LLM")See notebooks/examples/step2_classification_with_regex.ipynb for interactive demonstrations of:
- Document name regex classification
- Page content regex classification
- Performance comparisons between regex and LLM methods
- Configuration examples and best practices
- Error handling scenarios
from idp_common import classification, get_config
from idp_common.models import Document
# Load configuration
config = get_config()
# Initialize classification service with Bedrock backend
service = classification.ClassificationService(
region="us-east-1",
config=config,
backend="bedrock" # This is the default
)
# Create or get a Document object
document = Document(
id="doc-123",
input_bucket="input-bucket",
input_key="document.pdf",
output_bucket="output-bucket",
pages={
"1": {
"page_id": "1",
"parsed_text_uri": "s3://bucket/document/pages/1/result.json",
"image_uri": "s3://bucket/document/pages/1/image.jpg",
"raw_text_uri": "s3://bucket/document/pages/1/rawText.json"
}
}
)
# Classify the document - updates the Document object directly
document = service.classify_document(document)
# Document now contains classification results
print(f"Document has {len(document.sections)} sections")
for section in document.sections:
print(f"Section {section.section_id}: {section.classification}")
print(f"Pages: {section.page_ids}")from idp_common import classification, get_config
from idp_common.models import Document
# Load configuration and add SageMaker endpoint
config = get_config()
config["sagemaker_endpoint_name"] = "udop-classification-endpoint"
# Initialize classification service with SageMaker backend
service = classification.ClassificationService(
region="us-east-1",
config=config,
backend="sagemaker"
)
# Create or get a Document object
document = Document(
id="doc-123",
input_bucket="input-bucket",
input_key="document.pdf",
output_bucket="output-bucket",
pages={
"1": {
"page_id": "1",
"parsed_text_uri": "s3://bucket/document/pages/1/result.json",
"image_uri": "s3://bucket/document/pages/1/image.jpg",
"raw_text_uri": "s3://bucket/document/pages/1/rawText.json"
}
}
)
# Classify the document using SageMaker
document = service.classify_document(document)
# Access classification results from the Document
print(f"Document status: {document.status}")
for page_id, page in document.pages.items():
print(f"Page {page_id} classified as: {page.classification}")# Classify a single page
page_result = service.classify_page(
page_id="1",
text_uri="s3://bucket/document/pages/1/result.json",
image_uri="s3://bucket/document/pages/1/image.jpg"
)
# Print the classification result
print(f"Page classified as: {page_result.classification.doc_type}")
# Classify multiple pages concurrently
pages = {
"1": {"parsedTextUri": "s3://bucket/document/pages/1/result.json", "imageUri": "s3://bucket/document/pages/1/image.jpg"},
"2": {"parsedTextUri": "s3://bucket/document/pages/2/result.json", "imageUri": "s3://bucket/document/pages/2/image.jpg"}
}
result = service.classify_pages(pages)
# Print the sections
for section in result.sections:
print(f"Section {section.section_id}: {section.classification.doc_type}")
for page in section.pages:
print(f" - Page {page.page_id}")
# Convert to dictionary for API response
api_response = result.to_dict()The classification service uses the following configuration structure:
{
"model_id": "anthropic.claude-3-sonnet-20240229-v1:0", // Top-level model_id for Bedrock (optional)
"sagemaker_endpoint_name": "udop-classification-endpoint", // SageMaker endpoint name (optional)
"classes": [
{
"name": "invoice",
"description": "An invoice that specifies an amount of money to be paid."
},
{
"name": "financial_statement",
"description": "Documents that summarize financial performance, such as income statements, balance sheets, or cash flow statements."
}
],
"classification": {
"model": "anthropic.claude-3-sonnet-20240229-v1:0", // Specific model for classification (used if top-level model_id not specified)
"temperature": 0,
"top_k": 5,
"system_prompt": "You are a document classification expert...",
"task_prompt": "Classify the following document into one of these types: {CLASS_NAMES_AND_DESCRIPTIONS}...\n\nDocument text:\n{DOCUMENT_TEXT}"
}
}from idp_common import classification, get_config
from idp_common.models import Document, Status
def handler(event, context):
# Extract document from event
document = Document.from_dict(event["OCRResult"]["document"])
# Initialize classification service
config = get_config()
service = classification.ClassificationService(config=config)
# Classify document
document = service.classify_document(document)
# Return response
return {
"document": document.to_dict()
}from idp_common import classification, get_config
from idp_common.models import Document, Status
import os
def handler(event, context):
# Extract document from event
document = Document.from_dict(event["OCRResult"]["document"])
# Configure SageMaker endpoint
config = get_config() or {}
config["sagemaker_endpoint_name"] = os.environ["SAGEMAKER_ENDPOINT_NAME"]
# Initialize classification service with SageMaker backend
service = classification.ClassificationService(
config=config,
backend="sagemaker"
)
# Classify document using SageMaker
document = service.classify_document(document)
# Return response
return {
"document": document.to_dict()
}DocumentType: Definition of a document type with name and descriptionDocumentClassification: Classification result with document type and confidencePageClassification: Classification result for a single pageDocumentSection: A section of consecutive pages with the same classificationClassificationResult: Overall result of a classification operationDocument: Core document data model used throughout the IDP pipeline
The classification service now supports optional DynamoDB caching to improve efficiency and resilience when processing documents with multiple pages. This feature addresses throttling scenarios where some pages succeed while others fail, avoiding the need to reclassify already successful pages on retry.
- Cache Check: Before processing, the service checks for cached classification results for the document
- Selective Processing: Only pages without cached results are classified
- Exception-Safe Caching: Successful page results are cached even when other pages fail
- Retry Efficiency: Subsequent retries only process previously failed pages
from idp_common import classification, get_config
config = get_config()
service = classification.ClassificationService(
region="us-east-1",
config=config,
backend="bedrock",
cache_table="classification-cache-table" # Enable caching
)export CLASSIFICATION_CACHE_TABLE=classification-cache-table# Cache table will be automatically detected from environment
service = classification.ClassificationService(
region="us-east-1",
config=config,
backend="bedrock"
)The cache uses the following DynamoDB table structure:
- Primary Key (PK):
classcache#{document_id}#{workflow_execution_arn} - Sort Key (SK):
none - Attributes:
page_classifications(String): JSON-encoded successful page resultscached_at(String): Unix timestamp of cache creationdocument_id(String): Document identifierworkflow_execution_arn(String): Workflow execution ARNExpiresAfter(Number): TTL attribute for automatic cleanup (24 hours)
{
"PK": "classcache#doc-123#arn:aws:states:us-east-1:123456789012:execution:MyWorkflow:abc-123",
"SK": "none",
"page_classifications": "{\"1\":{\"doc_type\":\"invoice\",\"confidence\":1.0,\"metadata\":{\"metering\":{...}},\"image_uri\":\"s3://...\",\"text_uri\":\"s3://...\",\"raw_text_uri\":\"s3://...\"},\"2\":{...}}",
"cached_at": "1672531200",
"document_id": "doc-123",
"workflow_execution_arn": "arn:aws:states:us-east-1:123456789012:execution:MyWorkflow:abc-123",
"ExpiresAfter": 1672617600
}- Cost Reduction: Avoids redundant API calls to Bedrock/SageMaker for already-classified pages
- Improved Resilience: Handles partial failures gracefully during concurrent processing
- Faster Retries: Subsequent attempts only process failed pages, not the entire document
- Automatic Cleanup: TTL ensures cache entries don't accumulate indefinitely
- Thread Safety: Safe for concurrent page processing within the same document
from idp_common import classification, get_config
from idp_common.models import Document
config = get_config()
service = classification.ClassificationService(
region="us-east-1",
config=config,
backend="bedrock",
cache_table="classification-cache-table"
)
# Create document with 5 pages
document = Document(
id="doc-123",
workflow_execution_arn="arn:aws:states:us-east-1:123456789012:execution:MyWorkflow:abc-123",
pages={
"1": {...},
"2": {...},
"3": {...},
"4": {...},
"5": {...}
}
)
try:
# First attempt: pages 1,2,4 succeed, pages 3,5 fail due to throttling
document = service.classify_document(document)
except Exception as e:
# Pages 1,2,4 are cached automatically before exception is raised
print(f"Classification failed: {e}")
try:
# Retry: only pages 3,5 are processed (1,2,4 loaded from cache)
document = service.classify_document(document)
print("Document classified successfully on retry")
except Exception as e:
print(f"Retry failed: {e}")- Creation: Cache entries are created when
classify_document()completes successfully or encounters exceptions - Retrieval: Cache is checked at the start of each
classify_document()call - Update: Cache entries are updated with new successful results from each processing attempt
- Expiration: Entries automatically expire after 24 hours via DynamoDB TTL
- Caching only applies to the
classify_document()method, not individualclassify_page()calls - Cache entries are scoped to specific document and workflow execution combinations
- Only successful page classifications (without errors in metadata) are cached
- The cache is transparent - existing code continues to work without modifications
The Bedrock backend uses Amazon Bedrock LLMs to classify documents:
- Supports multiple model options (Claude, Titan, etc.)
- Works with both text and image content
- Uses natural language understanding for classification
- Configurable system prompts and parameters
The SageMaker backend uses custom UDOP (Unified Document Processing) models:
- Uses vision-language models specifically trained for document understanding
- Requires both image and raw text URIs to be available
- Better performance for document-specific classification tasks
- Requires a deployed SageMaker endpoint
The classification service supports few shot learning through example-based prompting. This feature allows you to provide concrete examples of documents with their expected classifications and attribute extractions, significantly improving model accuracy and consistency.
Few shot examples work by including reference documents with known classifications and expected attribute values in the prompts sent to the AI model. This helps the model understand the expected format and accuracy requirements for your specific use case.
Few shot examples are configured in the document class definitions within your configuration file:
classes:
- name: letter
description: "A formal written correspondence..."
attributes:
- name: sender_name
description: "The name of the person who wrote the letter..."
# ... other attributes
examples:
- classPrompt: "This is an example of the class 'letter'"
name: "Letter1"
attributesPrompt: |
expected attributes are:
"sender_name": "Will E. Clark",
"sender_address": "206 Maple Street P.O. Box 1056 Murray Kentucky 42071-1056",
"recipient_name": "The Honorable Wendell H. Ford",
# ... other expected attributes
imagePath: "config_library/pattern-2/few_shot_example/example-images/letter1.jpg"
- classPrompt: "This is an example of the class 'letter'"
name: "Letter2"
attributesPrompt: |
expected attributes are:
"sender_name": "William H. W. Anderson",
# ... other expected attributes
imagePath: "config_library/pattern-2/few_shot_example/example-images/letter2.png"Each few shot example includes:
- classPrompt: A description identifying this as an example of the document class
- name: A unique identifier for the example (for reference and debugging)
- attributesPrompt: The expected attribute extraction results in a structured format
- imagePath: Path to example document image(s) - supports single files, local directories, or S3 prefixes
The imagePath field now supports multiple formats for maximum flexibility:
Single Image File (Original functionality):
imagePath: "config_library/pattern-2/few_shot_example/example-images/letter1.jpg"Local Directory with Multiple Images (New):
imagePath: "config_library/pattern-2/few_shot_example/example-images/"S3 Prefix with Multiple Images (New):
imagePath: "s3://my-config-bucket/few-shot-examples/letter/"Direct S3 Image URI:
imagePath: "s3://my-config-bucket/few-shot-examples/letter/example1.jpg"When pointing to a directory or S3 prefix, the system automatically:
- Discovers all image files with supported extensions (
.jpg,.jpeg,.png,.gif,.bmp,.tiff,.tif,.webp) - Sorts them alphabetically by filename for consistent ordering
- Includes each image as a separate content item in the few-shot examples
- Gracefully handles individual image loading failures without breaking the entire process
The system uses these environment variables for resolving relative paths:
-
CONFIGURATION_BUCKET: S3 bucket name for configuration files- Used when
imagePathdoesn't start withs3:// - The path is treated as a key within this bucket
- Used when
-
ROOT_DIR: Root directory for local file resolution- Used when
CONFIGURATION_BUCKETis not set - The path is treated as relative to this directory
- Used when
Using few shot examples provides several advantages:
- Improved Accuracy: Models perform better when given concrete examples
- Consistent Formatting: Examples help ensure consistent output structure
- Domain Adaptation: Examples help models understand domain-specific terminology
- Reduced Hallucination: Examples reduce the likelihood of made-up data
- Better Edge Case Handling: Examples can demonstrate how to handle unusual cases
When creating few shot examples:
- Use 1-3 high-quality examples per document class
- Ensure examples are representative of real-world documents
- Include diverse examples that cover different variations
# Good example - specific and complete
attributesPrompt: |
expected attributes are:
"invoice_number": "INV-2024-001",
"invoice_date": "01/15/2024",
"vendor_name": "ACME Corp",
"total_amount": "$1,250.00"
# Avoid incomplete examples
attributesPrompt: |
expected attributes are:
"invoice_number": "INV-2024-001"
# Missing other important attributesattributesPrompt: |
expected attributes are:
"sender_name": "John Smith",
"cc": null, # Explicitly show when fields are not present
"reference_number": null- Choose examples that represent typical documents in your use case
- Include examples with both common and edge case scenarios
- Ensure image quality is good and text is clearly readable
The few shot examples are automatically integrated when using the classification service:
from idp_common import classification, get_config
from idp_common.models import Document
# Load configuration with few shot examples
config = get_config()
# Initialize service - few shot examples are automatically used
service = classification.ClassificationService(
region="us-east-1",
config=config
)
# Examples are automatically included in prompts during classification
document = service.classify_document(document)The service automatically:
- Loads few shot examples from the configuration
- Includes them in classification prompts using the
{FEW_SHOT_EXAMPLES}placeholder - Formats examples appropriately for both classification and extraction tasks
Here's a complete example showing how few shot examples integrate with document class definitions:
classes:
- name: email
description: "A digital message with email headers..."
attributes:
- name: from_address
description: "The email address of the sender..."
- name: to_address
description: "The email address of the primary recipient..."
- name: subject
description: "The topic of the email..."
- name: date_sent
description: "The date and time when the email was sent..."
examples:
- classPrompt: "This is an example of the class 'email'"
name: "Email1"
attributesPrompt: |
expected attributes are:
"from_address": "john.doe@company.com",
"to_address": "jane.smith@client.com",
"subject": "FW: Meeting Notes 4/20",
"date_sent": "04/18/2024"
imagePath: "config_library/pattern-2/few_shot_example/example-images/email1.jpg"
classification:
task_prompt: |
Classify this document into exactly one of these categories:
{CLASS_NAMES_AND_DESCRIPTIONS}
<few_shot_examples>
{FEW_SHOT_EXAMPLES}
</few_shot_examples>
<document_ocr_data>
{DOCUMENT_TEXT}
</document_ocr_data>Common issues and solutions:
- Images Not Found: Ensure image paths are correct and files exist
- Inconsistent Results: Review example quality and ensure they're representative
- Poor Performance: Consider adding more diverse examples or improving example quality
- Format Errors: Ensure attributesPrompt follows exact JSON-like format expected by your prompts
- ✅ Support for SageMaker UDOP models
- ✅ Direct integration with Document data model
- ✅ Improved error handling and retry mechanisms
- ✅ Few shot example support for improved accuracy
- 🔲 Better confidence score estimation
- 🔲 More advanced document structure analysis
- 🔲 Support for additional classification backends (custom models)
- 🔲 Multi-model classification for improved accuracy
- 🔲 Dynamic few shot example selection based on document similarity