Skip to content

Commit 18efdc0

Browse files
committed
Merge branch 'feature/bda_prompt_optimization_integration' into 'develop'
BDA blueprint optimization feature added See merge request genaiic-reusable-assets/engagement-artifacts/genaiic-idp-accelerator!595
2 parents 408306c + 7b50ac5 commit 18efdc0

17 files changed

Lines changed: 2644 additions & 41 deletions

File tree

docs/discovery.md

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -844,13 +844,111 @@ BDABlueprintPermissions:
844844
- bedrock:ListBlueprints
845845
- bedrock:GetBlueprint
846846
- bedrock:DeleteBlueprint
847+
- bedrock:InvokeBlueprintOptimizationAsync
848+
- bedrock:GetBlueprintOptimizationStatus
847849
```
848850
849851
**Monitoring:**
850852
- Blueprint creation/update activities are logged to CloudWatch
851853
- Schema conversion details are captured
852854
- Error conditions are clearly documented
853855
856+
### Blueprint Optimization
857+
858+
The Blueprint Optimization feature uses the BDA `InvokeBlueprintOptimizationAsync` API to automatically improve extraction accuracy for discovered document classes. When a discovery job includes a ground truth file, the system can optimize the BDA blueprint by comparing extraction results against the ground truth and refining the blueprint schema.
859+
860+
#### How It Works
861+
862+
1. **Blueprint Lookup**: The optimizer checks if a blueprint already exists for the discovered class in the BDA project. If found, it reuses the existing blueprint; otherwise, it creates a new one following the standard naming convention (`{StackName}-{ClassName}-{hash}`).
863+
2. **S3 Asset Preparation**: The sample document (PDF) and ground truth (JSON) S3 URIs are constructed from the discovery bucket.
864+
3. **Optimization Invocation**: The `InvokeBlueprintOptimizationAsync` API is called with the blueprint ARN, sample document, ground truth, and an output S3 prefix.
865+
4. **Status Polling**: The system polls `GetBlueprintOptimizationStatus` with exponential backoff (5s initial, 30s max, 15-minute timeout) until a terminal state is reached.
866+
5. **Results Evaluation**: The optimization results (stored at `{outputPrefix}/optimization_results.json`) contain before/after metrics. The system compares `exactMatch` and `f1` scores.
867+
6. **Schema Application**: If the optimized schema shows improvement, the blueprint is updated with the new schema, a new version is created, and the IDP class configuration is updated.
868+
869+
#### Optimization Flow
870+
871+
```mermaid
872+
graph TD
873+
A[Discovery Completes with Ground Truth] --> B[Blueprint Optimization Lambda]
874+
B --> C{Existing Blueprint?}
875+
C -->|Yes| D[Reuse Existing Blueprint]
876+
C -->|No| E[Create New Blueprint]
877+
D --> F[Invoke Optimization API]
878+
E --> F
879+
F --> G[Poll for Completion]
880+
G --> H{Result?}
881+
H -->|Success| I[Fetch Results from S3]
882+
H -->|ServiceError/ClientError| J[Report Failure]
883+
H -->|Timeout| J
884+
I --> K{Improved?}
885+
K -->|Yes| L[Update Blueprint Schema]
886+
K -->|No| M[Keep Original Schema]
887+
L --> N[Create Blueprint Version]
888+
N --> O[Update IDP Config]
889+
O --> P[Report OPTIMIZATION_COMPLETED]
890+
M --> P
891+
J --> Q[Report OPTIMIZATION_FAILED]
892+
```
893+
894+
#### UI Status Display
895+
896+
The Discovery Panel shows optimization progress with dedicated status indicators:
897+
898+
| Status | UI Label | Description |
899+
|--------|----------|-------------|
900+
| `OPTIMIZATION_IN_PROGRESS` | Optimizing | Optimization is running (blueprint creation, API invocation, polling) |
901+
| `OPTIMIZATION_COMPLETED` | Optimized | Optimization finished (improved or no improvement) |
902+
| `OPTIMIZATION_FAILED` | Optimization Failed | Optimization encountered an error |
903+
904+
The Result column shows additional context:
905+
- **Improved**: Class name badge + accuracy improvement message (e.g., "exactMatch: 0.78 → 0.91")
906+
- **No improvement**: Message indicating original schema was kept
907+
- **Failed**: Expandable error details
908+
909+
#### Components
910+
911+
- **`BlueprintOptimizer`** (`lib/idp_common_pkg/idp_common/bda/blueprint_optimizer.py`): Core orchestrator — manages the full optimization lifecycle including blueprint lookup/creation, API invocation, polling, evaluation, and schema application.
912+
- **`blueprint_optimization` Lambda** (`src/lambda/blueprint_optimization/index.py`): Async Lambda handler invoked by the discovery processor. Manages AppSync status updates and error reporting.
913+
- **`OptimizationResult`**: Dataclass returned by the optimizer with status, metrics, blueprint ARN, and optionally the optimized schema.
914+
915+
#### Configuration
916+
917+
Blueprint optimization is disabled by default. To enable it, set both `use_bda: true` and `enable_blueprint_optimization: true` in your configuration version via the View/Edit Configuration UI or directly in the config YAML:
918+
919+
```yaml
920+
use_bda: true
921+
enable_blueprint_optimization: true
922+
```
923+
924+
When enabled, the optimizer uses:
925+
- The same BDA project as the main blueprint service (per configuration version)
926+
- The same blueprint naming convention (`{StackName}-{ClassName}-{hash}`)
927+
- The discovery bucket for S3 input/output URIs
928+
- The `bedrock-data-automation` client with `boto3>=1.42.0` (bundled in the Lambda function's `requirements.txt`)
929+
930+
#### IAM Permissions
931+
932+
The Blueprint Optimization Lambda requires these additional Bedrock permissions (configured in `template.yaml`):
933+
934+
```yaml
935+
- bedrock:InvokeBlueprintOptimizationAsync
936+
- bedrock:GetBlueprintOptimizationStatus
937+
```
938+
939+
Resource ARN patterns:
940+
```yaml
941+
- arn:${AWS::Partition}:bedrock:${AWS::Region}:${AWS::AccountId}:blueprint/*
942+
- arn:${AWS::Partition}:bedrock:${AWS::Region}:${AWS::AccountId}:blueprint-optimization-invocation/*
943+
```
944+
945+
#### Retry and Error Handling
946+
947+
- **S3 Eventual Consistency**: The optimization results file may not be immediately available after the API reports success. The system retries up to 5 times with 2-second delays.
948+
- **Polling Timeout**: If optimization doesn't complete within 15 minutes, the result is `TIMED_OUT`.
949+
- **API Errors**: `ServiceError` and `ClientError` from the BDA API are captured and reported as `OPTIMIZATION_FAILED`.
950+
- **Blueprint Not Found**: If the blueprint stage doesn't match (must be `LIVE`), the API returns `ResourceNotFoundException`.
951+
854952
### BdaIDP Sync Feature
855953

856954
The BdaIDP Sync feature provides bidirectional synchronization between BDA (Bedrock Data Automation) blueprints and IDP custom classes. This feature enables seamless integration between BDA's blueprint management system and IDP's document class configuration, with support for AWS Standard blueprints, optimized parallel processing, and configurable **Replace** or **Merge** sync modes.

lib/idp_common_pkg/idp_common/bda/README.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,9 @@ The BDA module enables seamless integration with Amazon Bedrock Data Automation
1818

1919
- **BdaService**: Main service class for interacting with BDA
2020
- **BdaInvocation**: Data class for handling BDA job results
21+
- **BdaBlueprintService**: Blueprint lifecycle management, schema conversion, and project synchronization
22+
- **BDABlueprintCreator**: Blueprint CRUD operations (create, update, version, delete)
23+
- **BlueprintOptimizer**: Orchestrates BDA blueprint optimization using ground truth data to improve extraction accuracy
2124
- **CloudFormation Templates**: Templates for creating BDA projects and blueprints
2225

2326
## Usage
@@ -157,3 +160,46 @@ For optimal performance with BDA:
157160
## Thread Safety
158161

159162
The BDA service is designed to be thread-safe, supporting concurrent processing of multiple documents in parallel workloads.
163+
164+
## Blueprint Optimization
165+
166+
The `BlueprintOptimizer` class uses the BDA `InvokeBlueprintOptimizationAsync` API to improve extraction accuracy by comparing results against ground truth data.
167+
168+
### Usage
169+
170+
```python
171+
from idp_common.bda.blueprint_optimizer import BlueprintOptimizer, OptimizationStatus
172+
from idp_common.bda.bda_blueprint_service import BdaBlueprintService
173+
from idp_common.bda.bda_blueprint_creator import BDABlueprintCreator
174+
from idp_common.config.configuration_manager import ConfigurationManager
175+
176+
optimizer = BlueprintOptimizer(
177+
blueprint_service=BdaBlueprintService(),
178+
blueprint_creator=BDABlueprintCreator(),
179+
config_manager=ConfigurationManager(),
180+
)
181+
182+
result = optimizer.optimize(
183+
class_schema=class_schema, # IDP JSON Schema dict
184+
document_key="path/to/doc.pdf", # S3 key for sample document
185+
ground_truth_key="path/to/gt.json", # S3 key for ground truth
186+
bucket="my-bucket",
187+
version="default",
188+
status_callback=lambda msg: print(msg), # Optional progress callback
189+
)
190+
191+
if result.status == OptimizationStatus.IMPROVED:
192+
print(f"Accuracy improved: {result.before_metrics.exact_match:.2f}{result.after_metrics.exact_match:.2f}")
193+
elif result.status == OptimizationStatus.NO_IMPROVEMENT:
194+
print("No improvement detected, original schema kept")
195+
elif result.status == OptimizationStatus.FAILED:
196+
print(f"Optimization failed: {result.error_message}")
197+
```
198+
199+
### Key Behaviors
200+
201+
- **Blueprint Reuse**: Looks up existing blueprints in the BDA project before creating new ones
202+
- **Standard Naming**: New blueprints follow `{StackName}-{ClassName}-{hash}` convention
203+
- **LIVE Stage**: Blueprints are created and referenced in `LIVE` stage
204+
- **S3 Retry**: Results fetched with retry logic (10 attempts, 2s delay) for S3 eventual consistency
205+
- **Polling**: Exponential backoff from 5s to 30s, 15-minute timeout

0 commit comments

Comments
 (0)