Skip to content

Commit cc745ae

Browse files
rstrahanBob Strahan
andauthored
feat: Make Discovery accessible from CLI and SDK (#228) (#232)
* > Add discovery CLI command and SDK operations for automated schema generation * > Add discovery SDK/CLI with local and stack-connected modes, remove S3 upload dependency * > docs: add discovery CLI command and SDK operations documentation * > Add Discovery module documentation to README files * > Add batch ground truth matching and flexible output modes to discovery CLI * > Add stdout output for discover command in batch mode when no output file specified --------- Co-authored-by: Bob Strahan <strahans@amazon.com>
1 parent 46b6bae commit cc745ae

14 files changed

Lines changed: 1197 additions & 18 deletions

File tree

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,10 @@ SPDX-License-Identifier: MIT-0
55

66
## [Unreleased]
77

8+
### Added
9+
10+
- **Discovery accessible from CLI and SDK** — Discovery can now be run programmatically via the IDP SDK (`client.discovery.run()`) and CLI (`idp-cli discover`), enabling users with many document classes to automate schema generation without the Web UI. Supports both modes: without ground truth (exploratory) and with ground truth (optimized). ([#228](https://github.com/aws-solutions-library-samples/accelerated-intelligent-document-processing-on-aws/issues/228))
11+
812
### Changed
913

1014
- **Sync to BDA no longer auto-activates the config version** — Previously, performing "Sync to BDA" would automatically set the current config version as active. Since each config version now has its own BDA project, auto-activation is unnecessary. Users can manually choose which version to activate via the Versions table. The "Sync to BDA" confirmation modal text has been updated accordingly.

config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
key: value

docs/idp-cli.md

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1995,6 +1995,55 @@ This uses the same mechanism as the Web UI configuration management system.
19951995

19961996
---
19971997

1998+
### `discover`
1999+
2000+
Discover document class schemas from sample documents using Amazon Bedrock.
2001+
2002+
**Two modes:**
2003+
- **Stack-connected** (`--stack-name`): Uses stack's discovery config and saves schema to DynamoDB configuration
2004+
- **Local** (no `--stack-name`): Uses system default Bedrock settings, prints schema to stdout without saving
2005+
2006+
**Ground truth matching:** Ground truth files (`-g`) are auto-matched to documents (`-d`) by filename stem. For example, `invoice.pdf` matches `invoice.json`. Unmatched documents run without ground truth.
2007+
2008+
**Output behavior:**
2009+
- Single document: `-o` writes the schema to the specified file
2010+
- Batch + `-o` is a directory (or has no extension): writes one `{class_name}.json` per schema
2011+
- Batch + `-o` is a file: writes all schemas as a JSON array
2012+
2013+
```bash
2014+
# Single document (local mode — no stack needed)
2015+
idp-cli discover -d ./invoice.pdf
2016+
2017+
# With ground truth (matched by filename stem)
2018+
idp-cli discover -d ./invoice.pdf -g ./invoice.json
2019+
2020+
# Save schema to file
2021+
idp-cli discover -d ./form.pdf -o ./form-schema.json
2022+
2023+
# Batch with auto-matched ground truth
2024+
idp-cli discover -d ./invoice.pdf -d ./w2.pdf -g ./invoice.json -g ./w2.json
2025+
2026+
# Batch output to directory (one file per schema)
2027+
idp-cli discover -d ./invoice.pdf -d ./w2.pdf -o ./schemas/
2028+
2029+
# Batch output to single file (JSON array)
2030+
idp-cli discover -d ./invoice.pdf -d ./w2.pdf -o ./all-schemas.json
2031+
2032+
# Stack mode (saves to config)
2033+
idp-cli discover --stack-name my-stack -d ./invoice.pdf --config-version v2
2034+
```
2035+
2036+
| Option | Description |
2037+
|--------|-------------|
2038+
| `--stack-name` | CloudFormation stack name (optional — omit for local mode) |
2039+
| `-d, --document` | Path to document file (required, repeatable for batch) |
2040+
| `-g, --ground-truth` | Path to JSON ground truth file(s) (repeatable, auto-matched by filename stem) |
2041+
| `--config-version` | Config version to save to (stack mode only) |
2042+
| `-o, --output` | Output path: file (single/JSON array) or directory (one file per schema) |
2043+
| `--region` | AWS region |
2044+
2045+
---
2046+
19982047
## Troubleshooting
19992048

20002049
### Stack Not Found

docs/idp-sdk.md

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -944,6 +944,80 @@ else:
944944

945945
---
946946

947+
## Discovery Operations
948+
949+
Discover document class schemas from sample documents using Amazon Bedrock.
950+
951+
**Two modes:**
952+
- **Stack-connected** (with `stack_name`): Uses the stack's discovery config from DynamoDB, saves discovered schema to config
953+
- **Local** (without `stack_name`): Uses system default Bedrock settings, returns schema without saving
954+
955+
### discovery.run()
956+
957+
Analyze a document to generate a JSON Schema definition for a document class.
958+
959+
```python
960+
# Local mode — no stack needed
961+
client = IDPClient()
962+
result = client.discovery.run("./invoice.pdf")
963+
print(json.dumps(result.json_schema, indent=2))
964+
965+
# Stack mode — uses stack config, saves schema
966+
client = IDPClient(stack_name="my-stack")
967+
result = client.discovery.run("./w2-form.pdf")
968+
969+
# With ground truth for better accuracy
970+
result = client.discovery.run(
971+
"./invoice.pdf",
972+
ground_truth_path="./invoice-expected.json"
973+
)
974+
975+
# Save to specific config version
976+
result = client.discovery.run(
977+
"./form.pdf",
978+
config_version="v2"
979+
)
980+
```
981+
982+
**Parameters:**
983+
- `document_path` (str, required): Local path to document file (PDF, PNG, JPG, TIFF)
984+
- `ground_truth_path` (str, optional): Path to JSON ground truth file
985+
- `config_version` (str, optional): Config version to save to (stack mode only)
986+
- `stack_name` (str, optional): Stack name override
987+
988+
**Returns:** `DiscoveryResult` with `status`, `document_class`, `json_schema`, `config_version`, `document_path`, `error`
989+
990+
### discovery.run_batch()
991+
992+
Run discovery on multiple documents sequentially. Ground truth paths are
993+
auto-matched to documents by filename stem.
994+
995+
```python
996+
# Batch without ground truth
997+
result = client.discovery.run_batch([
998+
"./invoice.pdf",
999+
"./w2-form.pdf",
1000+
"./paystub.png",
1001+
])
1002+
print(f"Succeeded: {result.succeeded}/{result.total}")
1003+
1004+
# Batch with selective ground truth (matched by position)
1005+
result = client.discovery.run_batch(
1006+
["./invoice.pdf", "./w2.pdf"],
1007+
ground_truth_paths=[None, "./w2.json"],
1008+
)
1009+
```
1010+
1011+
**Parameters:**
1012+
- `document_paths` (list, required): List of local file paths
1013+
- `ground_truth_paths` (list, optional): Parallel list of ground truth paths (use None for docs without GT)
1014+
- `config_version` (str, optional): Config version to save to
1015+
- `stack_name` (str, optional): Stack name override
1016+
1017+
**Returns:** `DiscoveryBatchResult` with `total`, `succeeded`, `failed`, `results`
1018+
1019+
---
1020+
9471021
## Manifest Operations
9481022

9491023
Operations for manifest generation and validation.

0 commit comments

Comments
 (0)