You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/discovery.md
+98Lines changed: 98 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -844,13 +844,111 @@ BDABlueprintPermissions:
844
844
- bedrock:ListBlueprints
845
845
- bedrock:GetBlueprint
846
846
- bedrock:DeleteBlueprint
847
+
- bedrock:InvokeBlueprintOptimizationAsync
848
+
- bedrock:GetBlueprintOptimizationStatus
847
849
```
848
850
849
851
**Monitoring:**
850
852
- Blueprint creation/update activities are logged to CloudWatch
851
853
- Schema conversion details are captured
852
854
- Error conditions are clearly documented
853
855
856
+
### Blueprint Optimization
857
+
858
+
The Blueprint Optimization feature uses the BDA `InvokeBlueprintOptimizationAsync` API to automatically improve extraction accuracy for discovered document classes. When a discovery job includes a ground truth file, the system can optimize the BDA blueprint by comparing extraction results against the ground truth and refining the blueprint schema.
859
+
860
+
#### How It Works
861
+
862
+
1. **Blueprint Lookup**: The optimizer checks if a blueprint already exists for the discovered class in the BDA project. If found, it reuses the existing blueprint; otherwise, it creates a new one following the standard naming convention (`{StackName}-{ClassName}-{hash}`).
863
+
2. **S3 Asset Preparation**: The sample document (PDF) and ground truth (JSON) S3 URIs are constructed from the discovery bucket.
864
+
3. **Optimization Invocation**: The `InvokeBlueprintOptimizationAsync` API is called with the blueprint ARN, sample document, ground truth, and an output S3 prefix.
865
+
4. **Status Polling**: The system polls `GetBlueprintOptimizationStatus` with exponential backoff (5s initial, 30s max, 15-minute timeout) until a terminal state is reached.
866
+
5. **Results Evaluation**: The optimization results (stored at `{outputPrefix}/optimization_results.json`) contain before/after metrics. The system compares `exactMatch` and `f1` scores.
867
+
6. **Schema Application**: If the optimized schema shows improvement, the blueprint is updated with the new schema, a new version is created, and the IDP class configuration is updated.
868
+
869
+
#### Optimization Flow
870
+
871
+
```mermaid
872
+
graph TD
873
+
A[Discovery Completes with Ground Truth] --> B[Blueprint Optimization Lambda]
874
+
B --> C{Existing Blueprint?}
875
+
C -->|Yes| D[Reuse Existing Blueprint]
876
+
C -->|No| E[Create New Blueprint]
877
+
D --> F[Invoke Optimization API]
878
+
E --> F
879
+
F --> G[Poll for Completion]
880
+
G --> H{Result?}
881
+
H -->|Success| I[Fetch Results from S3]
882
+
H -->|ServiceError/ClientError| J[Report Failure]
883
+
H -->|Timeout| J
884
+
I --> K{Improved?}
885
+
K -->|Yes| L[Update Blueprint Schema]
886
+
K -->|No| M[Keep Original Schema]
887
+
L --> N[Create Blueprint Version]
888
+
N --> O[Update IDP Config]
889
+
O --> P[Report OPTIMIZATION_COMPLETED]
890
+
M --> P
891
+
J --> Q[Report OPTIMIZATION_FAILED]
892
+
```
893
+
894
+
#### UI Status Display
895
+
896
+
The Discovery Panel shows optimization progress with dedicated status indicators:
897
+
898
+
| Status | UI Label | Description |
899
+
|--------|----------|-------------|
900
+
| `OPTIMIZATION_IN_PROGRESS` | Optimizing | Optimization is running (blueprint creation, API invocation, polling) |
901
+
| `OPTIMIZATION_COMPLETED` | Optimized | Optimization finished (improved or no improvement) |
- **Improved**: Class name badge + accuracy improvement message (e.g., "exactMatch: 0.78 → 0.91")
906
+
- **No improvement**: Message indicating original schema was kept
907
+
- **Failed**: Expandable error details
908
+
909
+
#### Components
910
+
911
+
- **`BlueprintOptimizer`** (`lib/idp_common_pkg/idp_common/bda/blueprint_optimizer.py`): Core orchestrator — manages the full optimization lifecycle including blueprint lookup/creation, API invocation, polling, evaluation, and schema application.
912
+
- **`blueprint_optimization` Lambda** (`src/lambda/blueprint_optimization/index.py`): Async Lambda handler invoked by the discovery processor. Manages AppSync status updates and error reporting.
913
+
- **`OptimizationResult`**: Dataclass returned by the optimizer with status, metrics, blueprint ARN, and optionally the optimized schema.
914
+
915
+
#### Configuration
916
+
917
+
Blueprint optimization is disabled by default. To enable it, set both `use_bda: true` and `enable_blueprint_optimization: true` in your configuration version via the View/Edit Configuration UI or directly in the config YAML:
918
+
919
+
```yaml
920
+
use_bda: true
921
+
enable_blueprint_optimization: true
922
+
```
923
+
924
+
When enabled, the optimizer uses:
925
+
- The same BDA project as the main blueprint service (per configuration version)
926
+
- The same blueprint naming convention (`{StackName}-{ClassName}-{hash}`)
927
+
- The discovery bucket for S3 input/output URIs
928
+
- The `bedrock-data-automation` client with `boto3>=1.42.0` (bundled in the Lambda function's `requirements.txt`)
929
+
930
+
#### IAM Permissions
931
+
932
+
The Blueprint Optimization Lambda requires these additional Bedrock permissions (configured in `template.yaml`):
- **S3 Eventual Consistency**: The optimization results file may not be immediately available after the API reports success. The system retries up to 5 times with 2-second delays.
948
+
- **Polling Timeout**: If optimization doesn't complete within 15 minutes, the result is `TIMED_OUT`.
949
+
- **API Errors**: `ServiceError` and `ClientError` from the BDA API are captured and reported as `OPTIMIZATION_FAILED`.
950
+
- **Blueprint Not Found**: If the blueprint stage doesn't match (must be `LIVE`), the API returns `ResourceNotFoundException`.
951
+
854
952
### BdaIDP Sync Feature
855
953
856
954
The BdaIDP Sync feature provides bidirectional synchronization between BDA (Bedrock Data Automation) blueprints and IDP custom classes. This feature enables seamless integration between BDA's blueprint management system and IDP's document class configuration, with support for AWS Standard blueprints, optimized parallel processing, and configurable **Replace** or **Merge** sync modes.
-**BlueprintOptimizer**: Orchestrates BDA blueprint optimization using ground truth data to improve extraction accuracy
21
24
-**CloudFormation Templates**: Templates for creating BDA projects and blueprints
22
25
23
26
## Usage
@@ -157,3 +160,46 @@ For optimal performance with BDA:
157
160
## Thread Safety
158
161
159
162
The BDA service is designed to be thread-safe, supporting concurrent processing of multiple documents in parallel workloads.
163
+
164
+
## Blueprint Optimization
165
+
166
+
The `BlueprintOptimizer` class uses the BDA `InvokeBlueprintOptimizationAsync` API to improve extraction accuracy by comparing results against ground truth data.
167
+
168
+
### Usage
169
+
170
+
```python
171
+
from idp_common.bda.blueprint_optimizer import BlueprintOptimizer, OptimizationStatus
172
+
from idp_common.bda.bda_blueprint_service import BdaBlueprintService
173
+
from idp_common.bda.bda_blueprint_creator import BDABlueprintCreator
174
+
from idp_common.config.configuration_manager import ConfigurationManager
175
+
176
+
optimizer = BlueprintOptimizer(
177
+
blueprint_service=BdaBlueprintService(),
178
+
blueprint_creator=BDABlueprintCreator(),
179
+
config_manager=ConfigurationManager(),
180
+
)
181
+
182
+
result = optimizer.optimize(
183
+
class_schema=class_schema, # IDP JSON Schema dict
184
+
document_key="path/to/doc.pdf", # S3 key for sample document
185
+
ground_truth_key="path/to/gt.json", # S3 key for ground truth
0 commit comments