aws-solutions-library-samples
diff --git a/‎CHANGELOG.md‎
Lines changed: 6 additions & 1 deletion b/‎CHANGELOG.md‎
Lines changed: 6 additions & 1 deletion
diff --git a/‎docs/idp-cli.md‎
Lines changed: 98 additions & 0 deletions b/‎docs/idp-cli.md‎
Lines changed: 98 additions & 0 deletions
diff --git a/‎docs/idp-sdk.md‎
Lines changed: 62 additions & 1 deletion b/‎docs/idp-sdk.md‎
Lines changed: 62 additions & 1 deletion
@@ -5,18 +5,23 @@ SPDX-License-Identifier: MIT-0
 
 ## [Unreleased]
 
+### Added
+
+- **Configuration Version in Metering Database** — Added `config_version` field to the metering database to enable cost tracking and analytics per configuration version. The metering Glue table now includes a `config_version` column, and all metering Parquet files store the configuration version used for each document. Enables Athena queries to compare costs across different configurations, support A/B testing analytics, and optimize per-version costs. Documents without a config version default to "default".
+
 ## [0.5.6]
 
 ### Added
 
+- **Test Studio CLI Commands** — `idp-cli test-result` to retrieve test results with automatic evaluation triggering and `--wait`/`--output-dir` options, and `idp-cli test-compare` to compare multiple test runs with JSON/CSV export. See `docs/idp-cli.md`.
+
 - **Custom Model Fine-Tuning** — Fine-tune Amazon Nova 2 models (Lite and Pro) for document classification and extraction using your own labeled Test Sets. The end-to-end workflow — validate data, generate training data, train via Bedrock, and deploy an on-demand custom model endpoint — is driven from a new **Custom Models** page in the Web UI. Custom models can then be selected in any configuration version for classification and/or extraction. Available to Admin and Author roles. **Note:** currently requires deployment in `us-east-1`. See `docs/custom-model-finetuning.md`.
 
 - **External SAML/OIDC Identity Provider Federation** — Optional support for federating authentication through an external SAML or OIDC identity provider via Amazon Cognito. Enables organizations to use existing enterprise identity providers (PingOne, Okta, Microsoft Entra ID, etc.) for single sign-on. All federation functionality is opt-in through 12 new CloudFormation parameters — leaving them empty results in zero additional resources and identical behavior to existing Cognito-native authentication. See `docs/external-idp.md`.
 
 - **Private Network Deployment** — Deploy the IDP Accelerator in fully private / air-gapped environments. New `AppSyncVisibility` parameter (`GLOBAL` | `PRIVATE`) makes the AppSync API accessible only from inside the VPC. All processing Lambda functions (21 across 3 templates) are conditionally placed in customer VPC subnets with an HTTPS-only security group. Includes a separate VPC endpoint CloudFormation template (`scripts/vpc-endpoints.yaml`) with 16 interface endpoints (AppSync, Bedrock, SQS, DynamoDB, S3, Lambda, SSM, KMS, STS, Textract, and more) and per-endpoint creation flags to skip pre-existing endpoints. All features are off by default — existing deployments are completely unaffected. See `docs/deployment-private-network.md`.
 
 - **Enhanced Information Panels** — Added comprehensive help content to the Information (ⓘ) panel on every page in the Web UI. Each panel now includes a feature summary, list of key capabilities, and "Learn more" links to relevant docs-site documentation pages. Created new panels for 8 pages that previously had none (Pricing, Capacity Planning, Custom Models, Discovery, User Management, Test Studio), and enriched the existing 7 panels with fuller descriptions and documentation links.
-
 ### Changed
 
 - **Removed Claude Sonnet 4:1m and Sonnet 4.5:1m model variants** — The 1M context window beta for Claude Sonnet 4 (`claude-sonnet-4-20250514-v1:0:1m`) and Sonnet 4.5 (`claude-sonnet-4-5-20250929-v1:0:1m`) is being retired effective April 30, 2026. These `:1m` model variants have been removed from all enum lists, UI dropdowns, quota code mappings, pricing, and documentation. Users needing 1M context windows should migrate to Claude Sonnet 4.6 (`claude-sonnet-4-6:1m`), where the 1M context window is generally available (GA).
 
@@ -50,6 +50,8 @@ https://github.com/user-attachments/assets/3d448a74-ba5b-4a4a-96ad-ec03ac0b4d7d
   - [config-list](#config-list)
   - [config-activate](#config-activate)
   - [config-delete](#config-delete)
+  - [test-result](#test-result)
+  - [test-compare](#test-compare)
   - [chat](#chat)
 - [Complete Evaluation Workflow](#complete-evaluation-workflow)
   - [Step 1: Deploy Your Stack](#step-1-deploy-your-stack)
@@ -2055,6 +2057,102 @@ This uses the same mechanism as the Web UI configuration management system.
 
 ---
 
+### `test-result`
+
+Get test results for a specific Test Studio test run with automatic evaluation triggering.
+
+**Usage:**
+```bash
+idp-cli test-result [OPTIONS]
+```
+
+**Options:**
+- `--stack-name` (required): CloudFormation stack name
+- `--test-run-id` (required): Test run ID to retrieve results for
+- `--wait`: Wait for evaluation to complete (polls until metrics are calculated)
+- `--timeout`: Timeout in seconds when using `--wait` (default: 600)
+- `--output-dir`: Directory to save results as JSON file
+- `--region`: AWS region (optional)
+
+**Examples:**
+```bash
+# Get results immediately (may show "EVALUATING" status if metrics not ready)
+idp-cli test-result \
+  --stack-name my-stack \
+  --test-run-id fake-w2-20260409-123456
+
+# Wait for evaluation to complete (recommended for CI/CD)
+idp-cli test-result \
+  --stack-name my-stack \
+  --test-run-id fake-w2-20260409-123456 \
+  --wait --timeout 900
+
+# Save results to JSON file
+idp-cli test-result \
+  --stack-name my-stack \
+  --test-run-id fake-w2-20260409-123456 \
+  --wait --output-dir ./results
+```
+
+**Output:**
+- Overall accuracy, precision, recall, F1 score
+- Total cost
+- Files completed/failed
+- Created/completed timestamps
+- JSON file: `<test-run-id>-result.json` (when `--output-dir` specified)
+
+**Behavior:**
+- Triggers lazy evaluation if metrics not yet calculated (first call after test run completes)
+- Polls Lambda every 10 seconds when `--wait` is used
+- Returns complete test run data including field-level metrics and cost breakdown
+
+---
+
+### `test-compare`
+
+Compare metrics and configurations from multiple Test Studio test runs.
+
+**Usage:**
+```bash
+idp-cli test-compare [OPTIONS]
+```
+
+**Options:**
+- `--stack-name` (required): CloudFormation stack name
+- `--test-run-ids` (required): Comma-separated list of test run IDs to compare (minimum 2)
+- `--output-dir`: Directory to save comparison as JSON and CSV files
+- `--region`: AWS region (optional)
+
+**Examples:**
+```bash
+# Compare two test runs
+idp-cli test-compare \
+  --stack-name my-stack \
+  --test-run-ids "fake-w2-20260409-123456,fake-w2-20260409-234567"
+
+# Compare multiple runs and export to files
+idp-cli test-compare \
+  --stack-name my-stack \
+  --test-run-ids "run1,run2,run3" \
+  --output-dir ./comparisons
+```
+
+**Output:**
+- **Console**: Side-by-side table with accuracy, precision, recall, F1 score, and cost for each test run
+- **JSON file**: `comparison-<timestamp>.json` - Complete comparison data with full test results and config differences
+- **CSV file**: `comparison-<timestamp>.csv` - Metrics table suitable for spreadsheets
+
+**Configuration Differences:**
+- Automatically detects and displays configuration differences between test runs
+- Shows nested config paths (e.g., `classification.model`, `extraction.temperature`)
+- Highlights values that differ across test runs
+
+**Requirements:**
+- All test runs must be in `COMPLETE` or `PARTIAL_COMPLETE` status
+- Minimum 2 test runs required for comparison
+
+---
+
 ### `discover`
 
 Discover document class schemas from sample documents using Amazon Bedrock.
 
@@ -1611,7 +1611,7 @@ else:
 
 ## Testing Operations
 
-Operations for load testing and performance validation.
+Operations for load testing, Test Studio evaluation results, and performance validation.
 
 ### testing.load_test()
 
@@ -1640,6 +1640,65 @@ print(f"Total files: {result.total_files}")
 print(f"Success: {result.success}")
 ```
 
+### testing.get_test_result()
+
+Get Test Studio evaluation results for a specific test run.
+
+**Parameters:**
+- `test_run_id` (str, required): Test run identifier
+- `stack_name` (str, optional): Stack name override
+- `wait` (bool, optional): Wait for test run to complete if still in progress (default: False)
+- `timeout` (int, optional): Maximum wait time in seconds (default: 300)
+- `poll_interval` (int, optional): Polling interval in seconds (default: 5)
+
+**Returns:** `TestRunResult` with evaluation metrics
+
+```python
+# Get result immediately (may be evaluating)
+result = client.testing.get_test_result(
+    test_run_id="Fake-W2-Tax-Forms-20260410-173735"
+)
+
+# Wait for evaluation to complete
+result = client.testing.get_test_result(
+    test_run_id="Fake-W2-Tax-Forms-20260410-173735",
+    wait=True,
+    timeout=900
+)
+
+print(f"Status: {result.status}")
+print(f"Overall Accuracy: {result.overall_accuracy:.2%}")
+print(f"Precision: {result.accuracy_breakdown['precision']:.2%}")
+print(f"Recall: {result.accuracy_breakdown['recall']:.2%}")
+print(f"F1 Score: {result.accuracy_breakdown['f1_score']:.2%}")
+print(f"Total Cost: ${result.total_cost:.2f}")
+```
+
+### testing.compare_test_runs()
+
+Compare multiple Test Studio evaluation runs.
+
+**Parameters:**
+- `test_run_ids` (list[str], required): List of test run identifiers to compare (minimum 2)
+- `stack_name` (str, optional): Stack name override
+
+**Returns:** `TestComparisonResult` with metrics for each test run
+
+```python
+result = client.testing.compare_test_runs(
+    test_run_ids=[
+        "Fake-W2-Tax-Forms-20260410-173735",
+        "Fake-W2-Tax-Forms-20260409-191545"
+    ]
+)
+
+for test_run_id, metrics in result.metrics.items():
+    print(f"\nTest Run: {test_run_id}")
+    print(f"  Accuracy: {metrics['overallAccuracy']:.2%}")
+    print(f"  Completed: {metrics['completedFiles']}/{metrics['filesCount']}")
+    print(f"  Cost: ${metrics['totalCost']:.2f}")
+```
+
 ---
 
 ## Response Models
@@ -1728,6 +1787,8 @@ from idp_sdk import (
     ExecutionsStoppedResult,
     DocumentsAbortedResult,
     LoadTestResult,
+    TestRunResult,
+    TestComparisonResult,
 
     # Enums
     DocumentState,