Skip to content

Commit 385b9a4

Browse files
committed
Merge branch 'develop' v0.5.0
2 parents 03f9122 + 1a4e3e8 commit 385b9a4

250 files changed

Lines changed: 22142 additions & 24024 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.clinerules

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -186,3 +186,75 @@ The format is flexible - focus on capturing valuable insights that help me work
186186
REMEMBER: After every memory reset, I begin completely fresh. The Memory Bank is my only link to previous work. It must be maintained with precision and clarity, as my effectiveness depends entirely on its accuracy.
187187

188188
REMEMBER: I always use mermaid diagrams when I want to visualize any concepts.
189+
190+
## Mandatory QA Review
191+
192+
Before calling `attempt_completion` on ANY task that involves code changes, I MUST perform a QA review. This is not optional — it is a required step in every implementation workflow.
193+
194+
```mermaid
195+
flowchart TD
196+
Done[Implementation Complete] --> QA{QA Review Gate}
197+
198+
QA --> C1[1. Code Review]
199+
QA --> C2[2. Test Verification]
200+
QA --> C3[3. Consistency Check]
201+
QA --> C4[4. Side Effect Analysis]
202+
203+
C1 --> Pass{All Checks Pass?}
204+
C2 --> Pass
205+
C3 --> Pass
206+
C4 --> Pass
207+
208+
Pass -->|Yes| Complete[attempt_completion]
209+
Pass -->|No| Fix[Fix Issues]
210+
Fix --> QA
211+
```
212+
213+
### QA Review Checklist
214+
215+
For every code change, I must review and verify:
216+
217+
#### 1. Code Quality
218+
- [ ] No syntax errors or typos in changed files
219+
- [ ] Consistent code style with existing codebase (ruff/formatting conventions)
220+
- [ ] No hardcoded values that should be configurable
221+
- [ ] Error handling is appropriate (no bare excepts, meaningful error messages)
222+
- [ ] No commented-out code left behind unless intentional
223+
224+
#### 2. Test Coverage
225+
- [ ] **MANDATORY**: Run `make test-cicd -C lib/idp_common_pkg` (or `make test` from project root) and verify ALL tests pass — do NOT skip this step
226+
- [ ] **MANDATORY**: Run `ruff check` (or `make lint`) on changed Python files and verify no new lint errors
227+
- [ ] New functionality has corresponding tests (or note why tests weren't added)
228+
- [ ] Test assertions are meaningful, not just "it doesn't crash"
229+
230+
#### 3. Cross-Module Consistency
231+
- [ ] Changes to shared interfaces (Document model, config schemas) are reflected in all consumers
232+
- [ ] If config format changed, config_library examples are updated
233+
- [ ] If API changed, docs are updated
234+
- [ ] CHANGELOG.md is updated for user-facing changes
235+
236+
#### 4. Side Effect Analysis
237+
- [ ] Review imports — no circular dependencies introduced
238+
- [ ] Check if changed functions/methods are called elsewhere (use `search_files` to verify)
239+
- [ ] Backward compatibility is maintained (or breaking changes are documented)
240+
- [ ] No unintended changes to files outside the scope of the task
241+
242+
#### 5. Documentation
243+
- [ ] Code comments for complex logic
244+
- [ ] docstrings for new public functions/classes
245+
- [ ] Memory Bank updated if significant patterns or decisions were made
246+
247+
### QA Review Output Format
248+
249+
After completing the QA review, I will include a brief summary in my completion message:
250+
251+
```
252+
## QA Review ✅
253+
- **Code Quality**: [pass/issues found and fixed]
254+
- **Tests**: [ran X tests, all passing / N tests added]
255+
- **Consistency**: [cross-module impacts checked]
256+
- **Side Effects**: [none found / details]
257+
- **Docs**: [updated / not needed]
258+
```
259+
260+
If any issues are found during QA, I MUST fix them before completing the task. I do NOT present incomplete or unreviewed work.

CHANGELOG.md

Lines changed: 47 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,50 @@ SPDX-License-Identifier: MIT-0
55

66
## [Unreleased]
77

8+
## [0.5.0]
9+
10+
### Added
11+
12+
- **Unified Pattern** — Merged Pattern-1 (BDA) and Pattern-2 (Pipeline) into a single deployment. Switch between BDA and Pipeline processing modes at runtime using the `use_bda` configuration toggle — no redeployment needed. Use [Test Studio](./docs/test-studio.md) to compare accuracy and cost across both modes to find the optimal approach for your documents. See the [Migration Guide](./docs/migration-v04-to-v05.md) for upgrade instructions.
13+
14+
- **Rule Validation for BDA mode** — Rule validation (business rule checking) is now available in both BDA and Pipeline modes. Previously it was Pipeline-only.
15+
16+
- **Fake W-2 Tax Form Test Set Auto-Deployment** — New pre-deployed benchmark test set with 2,000 synthetically generated US W-2 tax form images and structured ground truth, sourced from HuggingFace (`singhsays/fake-w2-us-tax-form-dataset`, originally from Kaggle under CC0: Public Domain license). Features 45 ground truth fields per document covering employer info (EIN, name, address), employee info (SSN, name, address), federal wages/taxes (boxes 1-8), compensation codes (boxes 12a-d), checkboxes (box 13), and state/local taxes (boxes 15-20). Includes both clean and noisy image variants for testing OCR robustness. Ideal for benchmarking W-2 extraction accuracy, evaluating image quality impact on processing, and testing structured form data extraction at scale.
17+
18+
- **AWS Profile Support for CLI** — Added optional `--profile` parameter to specify AWS credentials profile. Can be placed anywhere in the command. Automatically applies to all AWS SDK calls.
19+
20+
- **Enhanced `status` CLI/MCP Command with Advanced Search, Filtering, and Analytics** — Added PK substring search (`--batch-id` now matches partial batch identifiers across multiple batches), `--object-status` filter for searching by processing status (COMPLETED, FAILED, etc.), `--get-time` flag for timing statistics (processing, queue, total time with min/max outlier tracking), `--include-metering` flag for Lambda GB-seconds usage and cost estimates, and `--show-details` flag for detailed document information. Introduces `TrackingTableSearcher` class for flexible DynamoDB tracking table queries. Fully backward compatible with existing usage.
21+
22+
- **Added Replace/Merge sync modes for BDA synchronization** — Both "Sync from BDA" and "Sync to BDA" now support two modes: **Replace** (default) aligns the target to match the source exactly, removing items not in the source; **Merge** adds source items to the target without removing existing items. The UI modal now always shows a mode selection and ARN input (pre-filled for linked projects).
23+
24+
25+
### Deprecated
26+
27+
- **Pattern-1 (BDA) and Pattern-2 (Pipeline) separate deployments** — Replaced by the Unified Pattern. Existing stacks are automatically upgraded. See the [Migration Guide](./docs/migration-v04-to-v05.md) for details.
28+
29+
- **Pattern-3 (UDOP + Bedrock)** — Pattern-3 is no longer available as a deployment option. If you are currently using Pattern-3 with a SageMaker UDOP endpoint, do not upgrade to v0.5.x without first testing in a non-production environment. You can use the [Lambda Inference Hooks](./docs/lambda-hook-inference.md) feature (introduced in v0.4.15) to call your existing SageMaker UDOP endpoint from the unified pattern's classification step via a custom Lambda function.
30+
31+
### Changed
32+
33+
- **Switched `idp_sdk` pyproject.toml to auto-discovery** — Replaced explicit subpackage listing with `setuptools.packages.find` using `include = ["idp_sdk*"]` so new subpackages are automatically included without manual pyproject.toml updates.
34+
35+
- **Resilient Test Set Deployment — Graceful Degradation on Download Failures** — All test set deployer Lambdas (RealKIE-FCC, OmniAI-OCR-Benchmark, DocSplit-Poly-Seq) now handle download failures gracefully instead of causing CloudFormation stack rollbacks. When a dataset source (HuggingFace) is unreachable or a download fails, the deployer creates a FAILED test set record in DynamoDB with a descriptive error message visible in the Test Studio UI, and sends `cfnresponse.SUCCESS` to CloudFormation so the stack deployment continues. Previously failed deployments are automatically retried on the next stack update. This ensures transient third-party service outages never block IDP infrastructure deployment.
36+
37+
- **Replaced PyMuPDF (AGPL-3.0) with pypdfium2 (Apache-2.0/BSD-3-Clause) for PDF rendering** — Resolves license incompatibility with the project's MIT-0 license. pypdfium2 provides equivalent PDF-to-image rendering using PDFium engine. Page rendering is now performed sequentially before parallel OCR processing to ensure thread-safety.
38+
39+
### Fixed
40+
41+
- **Fixed "Sync from BDA" not removing IDP classes absent from BDA project** — Previously, "Sync from BDA" only added new classes from the BDA project without removing classes that weren't in BDA. Now defaults to "Replace" mode which fully aligns the config version's classes with the BDA project, removing classes not present in BDA. A new "Merge" mode is also available to preserve the legacy additive behavior.
42+
43+
- **Fixed insufficient Lambda memory for Extraction, Assessment, and Evaluation functions in unified pattern template** — Increased MemorySize from 512 MB (Extraction, Assessment) and 1024 MB (Evaluation) to 4096 MB to match all other document processing Lambda functions, preventing potential out-of-memory errors during document processing. ([#205](https://github.com/aws-solutions-library-samples/accelerated-intelligent-document-processing-on-aws/issues/205))
44+
45+
- **Fixed DOCX processing to extract text from embedded images and correct page splitting** — DOCX files with embedded images (e.g., `<w:drawing>` elements) now have image content OCR'd and included in the extracted text instead of being silently skipped. Page splitting now uses DOCX metadata (explicit page breaks, image display dimensions from `wp:extent`, section properties) instead of inaccurate height estimates, producing correct page boundaries.
46+
47+
### Templates
48+
- us-west-2: `https://s3.us-west-2.amazonaws.com/aws-ml-blog-us-west-2/artifacts/genai-idp/idp-main_0.5.0.yaml`
49+
- us-east-1: `https://s3.us-east-1.amazonaws.com/aws-ml-blog-us-east-1/artifacts/genai-idp/idp-main_0.5.0.yaml`
50+
- eu-central-1: `https://s3.eu-central-1.amazonaws.com/aws-ml-blog-eu-central-1/artifacts/genai-idp/idp-main_0.5.0.yaml`
51+
852
## [0.4.16]
953

1054
### Added
@@ -25,12 +69,14 @@ SPDX-License-Identifier: MIT-0
2569
- **Added support for Claude Sonnet 4.6 model and Long Context (1M) variant**
2670
- **Included MCP tools `process`, `reprocess`, `status`, `search` for document processing**
2771
- **Added `process` and `reprocess` CLI commands for batch operations via command line**
28-
- **Maintained `run-inference` and `rerun-inference` CLI commands with deprecation notices**
72+
- **Added external mcp client example `examples/external-mcp-client`**
73+
- **Maintained `run-inference` and `rerun-inference` CLI commands with deprecation notices**
2974

3075
### Fixed
3176

3277
- **Fixed DynamoDB 400KB item size limit blocking configs with 45+ document classes** — Configuration data is now gzip-compressed before storing to DynamoDB, achieving 37-95x compression ratios. Supports 3,000+ document classes within the 400KB limit. Fully backward compatible with existing deployments. ([#200](https://github.com/aws-solutions-library-samples/accelerated-intelligent-document-processing-on-aws/issues/200), [#201](https://github.com/aws-solutions-library-samples/accelerated-intelligent-document-processing-on-aws/pull/201))
3378
- **Fixed Processing Flow chart using active stack config instead of the document's actual config version** for determining disabled steps (assessment, summarization, etc.)
79+
- **Fixed `idp_sdk` pip install from GitHub missing subpackages** — Non-editable pip installs of `idp_sdk` from GitHub were missing `core/`, `models/`, and `operations/` subpackages, causing `ModuleNotFoundError`. Fixed by explicitly declaring all subpackages in `pyproject.toml`. ([#196](https://github.com/aws-solutions-library-samples/accelerated-intelligent-document-processing-on-aws/issues/196))
3480

3581
### Templates
3682
- us-west-2: `https://s3.us-west-2.amazonaws.com/aws-ml-blog-us-west-2/artifacts/genai-idp/idp-main_0.4.16.yaml`

CLAUDE.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -177,7 +177,6 @@ The solution uses a modular architecture with the main template (`template.yaml`
177177
- OCR with Amazon Textract
178178
- Classification with UDOP model on SageMaker
179179
- Extraction with Bedrock
180-
- Location: `patterns/pattern-3/`
181180

182181
### Document Processing Flow
183182

Makefile

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -49,13 +49,26 @@ ui-start:
4949
@echo "Starting UI development server..."
5050
cd src/ui && npm run start
5151

52-
# Run tests in idp_common_pkg, idp_cli, idp_sdk, and capacity planning Lambda
52+
# Run tests in idp_common_pkg, idp_cli, idp_sdk, capacity planning Lambda, and config library
5353
test:
5454
$(MAKE) -C lib/idp_common_pkg test
5555
cd lib/idp_cli_pkg && python -m pytest -v
5656
cd lib/idp_sdk && python -m pytest -m "not integration" -v
5757
@echo "Running capacity planning Lambda tests..."
5858
cd src/lambda/calculate_capacity && python -m pytest -v
59+
@echo "Validating config library files..."
60+
python -m pytest config_library/test_config_library.py -v
61+
62+
# Run only config library validation tests
63+
test-config-library:
64+
@echo "Validating config library YAML/JSON files..."
65+
python -m pytest config_library/test_config_library.py -v
66+
67+
# Run only IDP CLI tests
68+
test-cli:
69+
@echo "Running IDP CLI tests..."
70+
cd lib/idp_cli_pkg && python -m pytest -v
71+
@echo -e "$(GREEN)✅ All CLI tests passed!$(NC)"
5972

6073
# Run only capacity planning tests
6174
test-capacity:
@@ -170,9 +183,9 @@ ui-lint:
170183
STORED_HASH=$$(test -f src/ui/.checksum && cat src/ui/.checksum || echo ""); \
171184
if [ "$$CURRENT_HASH" != "$$STORED_HASH" ]; then \
172185
echo "UI code checksum changed - running lint..."; \
173-
cd src/ui && npm ci --prefer-offline --no-audit && npm run lint -- --fix && \
186+
cd src/ui && npm ci --prefer-offline --no-audit && npm run lint -- --fix && npm run typecheck && \
174187
echo "$$CURRENT_HASH" > .checksum; \
175-
echo -e "$(GREEN)✅ UI lint completed and checksum updated$(NC)"; \
188+
echo -e "$(GREEN)✅ UI lint and typecheck completed and checksum updated$(NC)"; \
176189
else \
177190
echo -e "$(GREEN)✅ UI code checksum unchanged - skipping lint$(NC)"; \
178191
fi

README.md

Lines changed: 11 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -57,8 +57,7 @@ Concierge support for customization, deployment, and integration of production u
5757
- **Comprehensive Monitoring**: Rich CloudWatch dashboard with detailed metrics and logs
5858
- **Web User Interface**: Modern UI for inspecting document workflow status and results
5959
- **Configuration Versioning**: Support for multiple configuration versions with version-specific processing and test comparison
60-
- **Human-in-the-Loop (HITL)**: Built-in review system for human validation workflows (Pattern 1 & Pattern 2)
61-
- **Note**: When deploying multiple patterns with HITL, reuse existing private workteam ARN due to AWS account limits
60+
- **Human-in-the-Loop (HITL)**: Built-in review system for human validation workflows
6261
- **AI-Powered Evaluation**: Framework to assess accuracy against baseline data
6362
- **Extraction Confidence Assessment**: LLM-powered assessment of extraction confidence with multimodal document analysis
6463
- **Document Knowledge Base Query**: Ask questions about your processed documents
@@ -67,14 +66,13 @@ Concierge support for customization, deployment, and integration of production u
6766

6867
## Architecture Overview
6968

70-
![Architecture Diagram](./images/IDP.drawio.png)
69+
![Architecture Diagram](./images/IDP.UnifiedPatterns.drawio.png)
7170

7271
The solution uses a modular architecture with nested CloudFormation stacks to support multiple document processing patterns while maintaining common infrastructure for queueing, tracking, and monitoring.
7372

74-
Current patterns include:
75-
- Pattern 1: Packet or Media processing with Bedrock Data Automation (BDA)
76-
- Pattern 2: OCR → Bedrock Classification (page-level or holistic) → Bedrock Extraction
77-
- Pattern 3: OCR → UDOP Classification (SageMaker) → Bedrock Extraction
73+
The unified pattern supports two processing modes, controlled by the `use_bda` configuration flag:
74+
- **Pipeline mode** (default): OCR → Bedrock Classification (page-level or holistic) → Bedrock Extraction → Assessment → Rule Validation → Summarization
75+
- **BDA mode**: End-to-end processing with Bedrock Data Automation (BDA) → Rule Validation → Summarization
7876

7977
## Quick Start
8078

@@ -101,8 +99,7 @@ After deployment, choose the processing method that fits your use case:
10199
1. Open the Web UI URL from CloudFormation stack Outputs
102100
2. Log in and click "Upload Document"
103101
3. Upload a sample document:
104-
- For Patterns 1 & 2: [samples/lending_package.pdf](./samples/lending_package.pdf)
105-
- For Pattern 3: [samples/rvl_cdip_package.pdf](./samples/rvl_cdip_package.pdf)
102+
- [samples/lending_package.pdf](./samples/lending_package.pdf)
106103
4. Monitor processing and view results in the dashboard
107104

108105
#### Method 2: Direct S3 Upload (Simple)
@@ -161,8 +158,7 @@ To update an existing GenAIIDP stack to a new version:
161158
7. For detailed instructions, see the [Deployment Guide](./docs/deployment.md#updating-an-existing-stack)
162159

163160
For testing, use these sample files:
164-
- For Patterns 1 (BDA) and Pattern 2: Use [samples/lending_package.pdf](./samples/lending_package.pdf)
165-
- For Pattern 3 (UDOP): Use [samples/rvl_cdip_package.pdf](./samples/rvl_cdip_package.pdf)
161+
- Use [samples/lending_package.pdf](./samples/lending_package.pdf) for both Pipeline and BDA modes
166162

167163
For detailed deployment and testing instructions, see the [Deployment Guide](./docs/deployment.md).
168164

@@ -194,11 +190,11 @@ For detailed deployment and testing instructions, see the [Deployment Guide](./d
194190
- [Reporting Database](./docs/reporting-database.md) - Analytics database for evaluation metrics and metering data
195191
- [Troubleshooting](./docs/troubleshooting.md) - Troubleshooting and performance guides
196192

197-
### Processing Patterns
193+
### Processing Modes
198194

199-
- [Pattern 1: BDA](./docs/pattern-1.md) - Packet or Media processing with Bedrock Data Automation (BDA)
200-
- [Pattern 2: Textract + Bedrock](./docs/pattern-2.md) - OCR with Textract and generative AI with Bedrock
201-
- [Pattern 3: Textract + UDOP + Bedrock](./docs/pattern-3.md) - OCR with Textract, UDOP Classification, and Bedrock extraction
195+
- [Architecture](./docs/architecture.md) - Unified pattern with BDA and Pipeline processing modes
196+
- [BDA Mode Reference](./docs/pattern-1.md) - Bedrock Data Automation (BDA) concepts and behavior
197+
- [Pipeline Mode Reference](./docs/pattern-2.md) - Textract + Bedrock classification and extraction
202198
- [Few-Shot Examples](./docs/few-shot-examples.md) - Implementing few-shot examples for improved accuracy
203199

204200
### Python Development

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
0.4.16
1+
0.5.0

0 commit comments

Comments
 (0)