You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.md
+33Lines changed: 33 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,6 +5,39 @@ SPDX-License-Identifier: MIT-0
5
5
6
6
## [Unreleased]
7
7
8
+
## [0.5.4]
9
+
10
+
### Added
11
+
12
+
-**MLflow Experiment Tracking Integration** — Optional integration with Amazon SageMaker MLflow for automated test run logging. When enabled (`EnableMLflow=true`), every Test Studio run automatically logs metrics (accuracy, cost, field-level scores), configuration parameters (model IDs, temperatures, inference settings), and artifacts (full config snapshots, class definitions, cost breakdowns) to an MLflow tracking server. Fire-and-forget async invocation — never blocks or delays test results. Zero resources created when disabled. See `docs/mlflow-integration.md`.
13
+
14
+
-**BDA Blueprint Optimization** — Automatically improves BDA extraction accuracy using the `InvokeBlueprintOptimizationAsync` API. When discovery includes a ground truth file and `enable_blueprint_optimization: true` is set, the system optimizes the BDA blueprint by comparing extraction results against ground truth, evaluates before/after metrics, and updates the blueprint schema if improved. Disabled by default. See `docs/discovery.md` — Blueprint Optimization section.
15
+
16
+
-**idp_common API Reference & Documentation** — Added `docs/idpcommon-api-reference.md` covering all 22 modules, created 6 missing module READMEs (discovery, schema, image, s3, utils, metrics), updated core data model docs to match current code, fixed `IDPConfig` lazy-loading bug in `__init__.py`, and integrated into docs-site sidebar.
17
+
18
+
-**Consolidated publish and headless deploy into `idp-cli`** — All build/publish/deploy functionality now available through the CLI, deprecating standalone scripts:
19
+
-`publish.py` and `publish.sh` are deprecated — use `idp-cli publish` instead. `publish.py` remains as a thin backward-compatibility wrapper. `publish.sh` has been removed.
20
+
-`scripts/generate_govcloud_template.py` is deprecated — use `idp-cli publish --headless` or `idp-cli deploy --headless` instead. The script remains as a thin wrapper.
21
+
- New `--template-file` option on `idp-cli deploy` for deploying from a local CloudFormation template file produced by a previous `idp-cli publish`.
22
+
-`idp-cli deploy --headless` (without `--from-code`) now downloads the published template, transforms to headless with GovCloud config defaults, uploads to S3, and deploys — all in one command.
23
+
24
+
### Fixed
25
+
26
+
-**HITL review start overwrites document sections** — Fixed the Start Review action to update only the Review Status and Review Owner fields, preserving all existing document sections and other fields.
27
+
28
+
-**Evaluation schema error for free-form objects** — Stickler mapper now detects and skips unevaluable object schemas (e.g., objects with `additionalProperties` but no defined `properties`, and arrays of such objects) instead of raising validation errors.
29
+
30
+
-**Full document reprocess not re-running OCR** — Fixed bug where clicking "Reprocess" in the UI reused stale OCR results from the previous run instead of re-executing OCR with the current configuration. The reprocess resolver now deletes previous output data from S3 before queuing, preventing the OCR function's retry-safe recovery from reinstalling old results.
31
+
32
+
- **Agentic extraction timeout on long documents** — Fixed repeated Lambda timeouts when agentic extraction exceeds the 15-minute limit on large documents (e.g., 25-page brokerage statements with 600+ holdings). Added incremental S3 checkpointing that saves extraction state after each tool call — covers both the extraction tools path (`extraction_tool`, `apply_json_patches`, `make_buffer_data_final_extraction`) and the buffer tools path (`patch_buffer_data`) that the agent uses for very large batched extractions. The checkpoint format tracks which state was saved (`current_extraction` vs `intermediate_extraction` buffer) so the correct resume path is used. On Step Function retry, the Lambda loads the checkpoint and the agent resumes from where it left off rather than restarting from scratch. No CloudFormation or Step Function changes required — the existing `Sandbox.Timedout` retry mechanism now makes incremental progress. Only active when agentic extraction is enabled; standard extraction is unaffected.
33
+
34
+
-**Agentic extraction fails on Bedrock InternalServerException without retrying** — Fixed `InternalServerException` errors (transient Bedrock server-side errors) causing immediate Lambda failure after only botocore's fast 7 retries, bypassing the application-level retry decorator (50 retries with 5s→1800s exponential backoff). Root cause: `InternalServerException` and `InternalServerError` were missing from all three retry layers — the `async_exponential_backoff_retry` decorator's `DEFAULT_RETRYABLE_ERRORS` set (`bedrock_utils.py`), the `BedrockClient._invoke_with_retry()` retryable errors list (`bedrock/client.py`), and the Step Functions ExtractionStep Retry `ErrorEquals` list (`workflow.asl.json`). All three layers now include these transient errors, providing proper exponential backoff retry at the application level and Lambda-level retry via Step Functions as a safety net.
The unified architecture supports two processing modes, controlled by the `use_bda` configuration flag:
164
+
165
+
1.**BDA Mode** (formerly Pattern 1)
165
166
- Uses AWS Bedrock Data Automation for end-to-end processing
166
167
- Handles packet or media documents with integrated OCR, classification, and extraction
167
-
- Location: `patterns/pattern-1/`
168
168
169
-
2.**Pattern 2: Textract + Bedrock**
169
+
2.**Pipeline Mode** (formerly Pattern 2)
170
170
- OCR with Amazon Textract
171
171
- Classification with Bedrock (page-level or holistic)
172
172
- Extraction with Bedrock
173
173
- Supports few-shot examples
174
-
- Location: `patterns/pattern-2/`
175
174
176
-
3.**Pattern 3: Textract + UDOP + Bedrock**
177
-
- OCR with Amazon Textract
178
-
- Classification with UDOP model on SageMaker
179
-
- Extraction with Bedrock
175
+
> **Note**: The separate `patterns/pattern-1/`, `patterns/pattern-2/`, and `patterns/pattern-3/` directories have been removed. All processing is now in `patterns/unified/`. See [pattern-1.md](docs/pattern-1.md) and [pattern-2.md](docs/pattern-2.md) for historical reference.
Copy file name to clipboardExpand all lines: README.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -187,6 +187,7 @@ For detailed deployment and testing instructions, see the [Deployment Guide](./d
187
187
-[Assessment](./docs/assessment.md) - Extraction confidence evaluation using LLMs
188
188
-[Rule Validation](./docs/rule-validation.md) - Business rule validation and compliance checking
189
189
-[Evaluation Framework](./docs/evaluation.md) - Accuracy assessment system with analytics database and reporting
190
+
-[MLflow Experiment Tracking](./docs/mlflow-integration.md) - Optional MLflow integration for tracking metrics, model parameters, and prompts across test runs
190
191
-[Knowledge Base](./docs/knowledge-base.md) - Document knowledge base query feature
191
192
-[Monitoring](./docs/monitoring.md) - Monitoring and logging capabilities
192
193
-[IDP Accelerator Help Chat Bot](./docs/code-intelligence.md) - Chat bot for asking question about the IDP code base and features
0 commit comments