Skip to content

Commit 8fae735

Browse files
committed
Merge branch 'develop'
2 parents d0cd8a9 + 15620aa commit 8fae735

98 files changed

Lines changed: 8479 additions & 5672 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CHANGELOG.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,30 @@ SPDX-License-Identifier: MIT-0
44
# Changelog
55

66
## [Unreleased]
7+
8+
## [0.5.8]
9+
10+
### Added
11+
12+
- **Excluded-class feature — skip static instruction / legal / boilerplate pages** — Government forms and similar packages often bundle static informational pages (legal warnings, fee instructions, tax notices, oaths) alongside the pages that carry applicant data. Mark a document class with `x-aws-idp-exclude-from-processing: true` and all downstream stages (extraction, assessment, summarization, rule validation, evaluation) skip sections classified as that class — making **zero LLM calls** on boilerplate pages.
13+
- Optional `x-aws-idp-exclusion-reason` ("instructions", "legal", "cover-page", …) surfaces as a grey **`Skipped: <reason>`** badge in the UI Sections panel and as an **"Excluded Sections (Not Evaluated)"** table in the evaluation markdown report.
14+
- Configurable via the **UI Configuration Editor** → Document Schema → select a document-type class → "Exclude from Processing" checkbox + "Exclusion Reason" input.
15+
- New end-to-end sample config at `config_library/unified/ds11-passport-application/` with a matching DS-11 U.S. Passport Application PDF fixture and a standalone demo notebook (`notebooks/usecase-specific-examples/ds11-passport-application/`).
16+
- Additive: classes without the new flag behave exactly as before.
17+
- See `docs/classification.md#excluding-static-pages-eg-instructions-legal-boilerplate`.
18+
19+
### Changed
720

21+
- **UI dependency cleanup — eliminated 11 of 12 npm deprecation warnings** — Replaced deprecated `@aws-sdk/*` packages with `@smithy/*` equivalents, removed unused Babel plugins, migrated ESLint 8→9 (flat config), upgraded Prettier 2→3, and upgraded jsdom 26→29. Added `"type": "module"` to `package.json`. Also added `caughtErrors: 'none'` to ESLint config to stop flagging unused catch clause variables. Added `FORCE=1` arg to `make ui-lint` to force re-run despite checksum match.
22+
23+
- **Headless deployment documentation generalized** — headless mode is no longer documented as a GovCloud-only capability. New `docs/headless-deployment.md` is the canonical guide covering headless deployment for both Commercial and GovCloud regions (API-only / pipeline integrations, organizational restrictions on UI-layer services, cost optimization, and required for GovCloud).
24+
25+
## Templates
26+
- us-west-2: `https://s3.us-west-2.amazonaws.com/aws-ml-blog-us-west-2/artifacts/genai-idp/idp-main_0.5.8.yaml`
27+
- us-east-1: `https://s3.us-east-1.amazonaws.com/aws-ml-blog-us-east-1/artifacts/genai-idp/idp-main_0.5.8.yaml`
28+
- eu-central-1: `https://s3.eu-central-1.amazonaws.com/aws-ml-blog-eu-central-1/artifacts/genai-idp/idp-main_0.5.8.yaml`
29+
30+
831
## [0.5.7]
932

1033
### Added

Makefile

Lines changed: 84 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,10 @@ else
2121
PIP := $(CURDIR)/$(VENV_DIR)/bin/pip
2222
endif
2323

24+
# idp-cli invocation — uses `python -m idp_cli.cli` so it works whether or not
25+
# the virtualenv is activated (picks up $(PYTHON) which prefers .venv).
26+
IDP_CLI := $(PYTHON) -m idp_cli.cli
27+
2428
##@ General
2529
.PHONY: help
2630
help: ## Show this help message
@@ -101,7 +105,7 @@ setup-venv: ## Create .venv and install all packages into it
101105
@echo -e "$(YELLOW) To activate manually: source $(VENV_DIR)/bin/activate$(NC)"
102106

103107
##@ Code Quality
104-
lint: ruff-lint format check-arn-partitions validate-buildspec ui-lint codegen-check ## Run all linting (ruff, format, ARN checks, buildspec, UI, codegen)
108+
lint: ruff-lint format check-arn-partitions validate-buildspec ui-lint codegen-check ## Run all linting (ruff, format, ARN checks, buildspec, UI, codegen). Use FORCE=1 to force UI lint re-run despite checksum match.
105109
fastlint: ruff-lint format check-arn-partitions validate-buildspec ## Quick lint without UI checks
106110

107111
ruff-lint: ## Run ruff linting with auto-fix
@@ -251,17 +255,21 @@ endif
251255
@echo "Starting UI development server..."
252256
cd src/ui && npm run start
253257

254-
ui-lint: ## Run UI linting with checksum caching (skips if unchanged)
258+
ui-lint: ## Run UI linting with checksum caching (skips if unchanged). Use FORCE=1 to force re-run.
255259
@echo "Checking if UI lint is needed..."
256260
@CURRENT_HASH=$$($(PYTHON) -c "from publish import IDPPublisher; p = IDPPublisher(); print(p.get_directory_checksum('src/ui'))"); \
257261
STORED_HASH=$$(test -f src/ui/.checksum && cat src/ui/.checksum || echo ""); \
258-
if [ "$$CURRENT_HASH" != "$$STORED_HASH" ]; then \
259-
echo "UI code checksum changed - running lint..."; \
262+
if [ -n "$(FORCE)" ] || [ "$$CURRENT_HASH" != "$$STORED_HASH" ]; then \
263+
if [ -n "$(FORCE)" ]; then \
264+
echo "FORCE=1 set - running lint..."; \
265+
else \
266+
echo "UI code checksum changed - running lint..."; \
267+
fi; \
260268
cd src/ui && npm ci --prefer-offline --no-audit && npm run lint -- --fix && npm run typecheck || exit 1; \
261269
echo "$$CURRENT_HASH" > .checksum; \
262270
echo -e "$(GREEN)✅ UI lint and typecheck completed and checksum updated$(NC)"; \
263271
else \
264-
echo -e "$(GREEN)✅ UI code checksum unchanged - skipping lint$(NC)"; \
272+
echo -e "$(GREEN)✅ UI code checksum unchanged - skipping lint (use FORCE=1 to force re-run)$(NC)"; \
265273
fi
266274

267275
ui-build: ## Build UI for production
@@ -393,3 +401,74 @@ dsr-scan: ## Run DSR security scan
393401
dsr-fix: ## Run DSR interactive fix
394402
@echo "Running DSR interactive fix..."
395403
$(PYTHON) scripts/dsr/fix.py
404+
405+
##@ Deploy
406+
# Thin wrappers around `idp-cli publish` / `deploy` / `delete` for the common
407+
# 80% case. Uncommon flags can still be passed via EXTRA_ARGS="--foo --bar".
408+
# See 'docs/idp-cli.md' (or 'idp-cli <cmd> --help') for the full option list.
409+
410+
.PHONY: publish deploy delete-stack
411+
412+
# Usage examples:
413+
# make publish REGION=us-east-1
414+
# make publish REGION=us-east-1 BUCKET_BASENAME=my-idp-artifacts PREFIX=v1
415+
# make publish REGION=us-gov-west-1 HEADLESS=1
416+
# make publish REGION=us-east-1 PUBLIC=1 EXTRA_ARGS="--clean-build --verbose"
417+
publish: ## Build & publish IDP artifacts to S3 (Usage: make publish REGION=... [BUCKET_BASENAME=...] [PREFIX=...] [HEADLESS=1] [PUBLIC=1] [EXTRA_ARGS=...])
418+
ifndef REGION
419+
$(error REGION is not set. Usage: make publish REGION=us-east-1 [BUCKET_BASENAME=...] [PREFIX=...] [HEADLESS=1] [PUBLIC=1] [EXTRA_ARGS=...])
420+
endif
421+
@echo -e "$(CYAN)Running idp-cli publish (region=$(REGION))...$(NC)"
422+
$(IDP_CLI) publish \
423+
--source-dir . \
424+
--region $(REGION) \
425+
$(if $(BUCKET_BASENAME),--bucket-basename $(BUCKET_BASENAME)) \
426+
$(if $(PREFIX),--prefix $(PREFIX)) \
427+
$(if $(HEADLESS),--headless) \
428+
$(if $(PUBLIC),--public) \
429+
$(EXTRA_ARGS)
430+
431+
# Usage examples:
432+
# make deploy STACK_NAME=my-idp ADMIN_EMAIL=me@example.com # create new stack
433+
# make deploy STACK_NAME=my-idp # update existing stack
434+
# make deploy STACK_NAME=my-idp-dev ADMIN_EMAIL=me@example.com FROM_CODE=1 # build & deploy from local source
435+
# make deploy STACK_NAME=my-idp ADMIN_EMAIL=me@example.com HEADLESS=1 # headless (no UI)
436+
# make deploy STACK_NAME=my-idp CUSTOM_CONFIG=./my-config.yaml # update config on existing stack
437+
# make deploy STACK_NAME=my-idp NO_WAIT=1 # fire-and-forget (default is --wait)
438+
# make deploy STACK_NAME=my-idp EXTRA_ARGS="--max-concurrent 200 --log-level DEBUG"
439+
deploy: ## Deploy/update IDP CloudFormation stack (Usage: make deploy STACK_NAME=... [ADMIN_EMAIL=...] [REGION=...] [FROM_CODE=1] [HEADLESS=1] [CUSTOM_CONFIG=...] [TEMPLATE_URL=...] [TEMPLATE_FILE=...] [NO_WAIT=1] [EXTRA_ARGS=...])
440+
ifndef STACK_NAME
441+
$(error STACK_NAME is not set. Usage: make deploy STACK_NAME=my-stack [ADMIN_EMAIL=...] [REGION=...] [FROM_CODE=1] [HEADLESS=1] [CUSTOM_CONFIG=...] [NO_WAIT=1] [EXTRA_ARGS=...])
442+
endif
443+
@echo -e "$(CYAN)Running idp-cli deploy (stack=$(STACK_NAME))...$(NC)"
444+
$(IDP_CLI) deploy \
445+
--stack-name $(STACK_NAME) \
446+
$(if $(ADMIN_EMAIL),--admin-email $(ADMIN_EMAIL)) \
447+
$(if $(REGION),--region $(REGION)) \
448+
$(if $(FROM_CODE),--from-code .) \
449+
$(if $(HEADLESS),--headless) \
450+
$(if $(CUSTOM_CONFIG),--custom-config $(CUSTOM_CONFIG)) \
451+
$(if $(TEMPLATE_URL),--template-url $(TEMPLATE_URL)) \
452+
$(if $(TEMPLATE_FILE),--template-file $(TEMPLATE_FILE)) \
453+
$(if $(NO_WAIT),,--wait) \
454+
$(EXTRA_ARGS)
455+
456+
# Usage examples:
457+
# make delete-stack STACK_NAME=test-stack # interactive
458+
# make delete-stack STACK_NAME=test-stack FORCE=1 # skip confirmation
459+
# make delete-stack STACK_NAME=test-stack FORCE=1 EMPTY_BUCKETS=1 # empty buckets first
460+
# make delete-stack STACK_NAME=test-stack FORCE=1 FORCE_DELETE_ALL=1 # comprehensive cleanup
461+
delete-stack: ## Delete an IDP CloudFormation stack (Usage: make delete-stack STACK_NAME=... [FORCE=1] [EMPTY_BUCKETS=1] [FORCE_DELETE_ALL=1] [REGION=...] [NO_WAIT=1] [EXTRA_ARGS=...])
462+
ifndef STACK_NAME
463+
$(error STACK_NAME is not set. Usage: make delete-stack STACK_NAME=my-stack [FORCE=1] [EMPTY_BUCKETS=1] [FORCE_DELETE_ALL=1])
464+
endif
465+
@echo -e "$(YELLOW)Running idp-cli delete (stack=$(STACK_NAME))...$(NC)"
466+
$(IDP_CLI) delete \
467+
--stack-name $(STACK_NAME) \
468+
$(if $(FORCE),--force) \
469+
$(if $(EMPTY_BUCKETS),--empty-buckets) \
470+
$(if $(FORCE_DELETE_ALL),--force-delete-all) \
471+
$(if $(REGION),--region $(REGION)) \
472+
$(if $(NO_WAIT),,--wait) \
473+
$(EXTRA_ARGS)
474+

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -174,6 +174,7 @@ For detailed deployment and testing instructions, see the [Deployment Guide](./d
174174
- [Architecture](./docs/architecture.md) - Detailed component architecture and data flow
175175
- [Demo Videos](./docs/demo-videos.md) - Comprehensive collection of feature demonstration videos
176176
- [Deployment](./docs/deployment.md) - Build, publish, deploy, and test instructions
177+
- [Headless Deployment](./docs/headless-deployment.md) - Backend-only deployment (no UI/AppSync/Cognito/WAF) for API-only use cases; required for GovCloud
177178
- [IDP CLI](./docs/idp-cli.md) - Command line interface for batch processing, evaluation workflows, and interactive Agent Chat
178179
- [Web UI](./docs/web-ui.md) - Web interface features and usage
179180
- [Agent Analysis](./docs/agent-analysis.md) - Natural language analytics and data visualization feature

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
0.5.7
1+
0.5.8
Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
# DS-11 U.S. Passport Application Sample
2+
3+
This sample configuration demonstrates the **excluded-class feature** — a way
4+
to tell the IDP pipeline that a particular document class contains only
5+
static/boilerplate pages (instructions, legal warnings, cover pages, tax
6+
notices, etc.) and should be skipped during extraction, assessment,
7+
summarization, rule validation, and evaluation.
8+
9+
## What it demonstrates
10+
11+
`samples/DS11-USPassportApplication.pdf` is a 6-page US State Department
12+
passport application form in which:
13+
14+
| Page | Content | Nature |
15+
|------|------------------------------------------------------|------------------|
16+
| 1 | WARNING: False statements… legal warning | Static legal |
17+
| 2 | Passport fee and payment instructions | Static instructions |
18+
| 3 | DS-11 FEDERAL TAX LAW (Section 6039E) notice | Static legal |
19+
| 4 | DS-11 ACTS OR CONDITIONS affidavit | Static oath |
20+
| 5 | APPLICATION FOR A U.S. PASSPORT (form front) | Dynamic form |
21+
| 6 | Travel Plans / Permanent Address (form back) | Dynamic form |
22+
23+
This config is a **minimal override config** — it only declares `notes` +
24+
`classes`. All other settings (`classification:`, `extraction:`,
25+
`assessment:`, `summarization:`, `ocr:`, `evaluation:`) are inherited
26+
from the bundled system defaults via `merge_config_with_defaults()` at
27+
deploy time (production) or at notebook-load time (demos). You only
28+
need to declare the classes you care about.
29+
30+
With this config:
31+
32+
1. The classifier sees **two** classes, `PassportApplicationInstructions`
33+
and `PassportApplication`.
34+
2. The **primary classification mechanism** is the LLM multimodal
35+
page-level classifier: each page is sent (image + OCR text) to
36+
Bedrock and the best-matching class is chosen using the class
37+
`description` field. This is robust to form revisions, OCR quirks,
38+
and wording differences.
39+
3. The **optional regex fast-path** on the excluded class
40+
(`x-aws-idp-document-page-content-regex`) short-circuits pages whose
41+
OCR text matches a known stable boilerplate phrase. If the regex
42+
misses, the LLM still catches the page via the description. The
43+
regex is narrowly scoped to a single conservative anchor; see the
44+
comment in `config.yaml` for details.
45+
4. The document is segmented into two sections via the existing BIO-like
46+
section-boundary logic. The classification service propagates the
47+
`excluded` flag from the class config onto the `Section`.
48+
5. Downstream services (extraction, assessment, summarization, rule
49+
validation) see `section.excluded == True` and **skip** those
50+
sections. They still write a small `result.json` stub so the UI and
51+
reporting database have something to show:
52+
53+
```json
54+
{
55+
"status": "skipped_excluded_class",
56+
"stage": "extraction",
57+
"section_id": "1",
58+
"classification": "PassportApplicationInstructions",
59+
"excluded": true,
60+
"exclusion_reason": "instructions",
61+
"page_ids": ["1", "2", "3", "4"],
62+
"message": "Section 1 classified as 'PassportApplicationInstructions' …"
63+
}
64+
```
65+
66+
6. The evaluation service filters excluded sections out of the
67+
precision/recall/F1 calculation and appends an **Excluded Sections**
68+
table to the markdown report so nothing is silently dropped.
69+
70+
7. The UI renders excluded sections in the Sections panel with a grey
71+
`Skipped: instructions` badge next to the class name.
72+
73+
## How to try it
74+
75+
### 1. As a library / test fixture
76+
77+
```bash
78+
# From the repo root
79+
python -c "
80+
from idp_common.models import Document, Section
81+
from idp_common.section_exclusion import is_section_excluded, build_skipped_stub_result
82+
83+
doc = Document(id='ds11-demo')
84+
sec = Section(
85+
section_id='1',
86+
classification='PassportApplicationInstructions',
87+
page_ids=['1','2','3','4'],
88+
excluded=True,
89+
exclusion_reason='instructions',
90+
)
91+
assert is_section_excluded(sec)
92+
print(build_skipped_stub_result(doc, sec, stage='extraction'))
93+
"
94+
```
95+
96+
### 2. In a live deployment
97+
98+
1. Load this config into your stack:
99+
100+
```bash
101+
idp-cli configuration create \\
102+
--stack-name <your-stack> \\
103+
--version-name ds11 \\
104+
--path config_library/unified/ds11-passport-application/config.yaml
105+
idp-cli configuration activate --stack-name <your-stack> --version-name ds11
106+
```
107+
108+
2. Upload `samples/DS11-USPassportApplication.pdf` through the web UI or
109+
CLI, and inspect the resulting sections in the Sections panel — the
110+
first section (pages 1–4) will display a **Skipped: instructions**
111+
badge and the extraction/summary panels for that section will show
112+
the skipped-stub message. Only the second section (pages 5–6) will
113+
be extracted.
114+
115+
## Key schema extensions
116+
117+
Two new class-level extensions power the feature:
118+
119+
| Key | Type | Meaning |
120+
|-----|------|---------|
121+
| `x-aws-idp-exclude-from-processing` | boolean | When `true`, downstream services skip sections classified as this class. |
122+
| `x-aws-idp-exclusion-reason` | string | Optional short reason (`"instructions"`, `"legal"`, `"cover-page"`) shown in UI badges and evaluation reports. |
123+
124+
The existing
125+
`x-aws-idp-document-page-content-regex` extension is used as a fast path
126+
so the LLM doesn't have to classify boilerplate pages that clearly
127+
contain anchor phrases from the form template.
128+
129+
## Notes & caveats
130+
131+
- The regex fast path relies on OCR text being available. When OCR is
132+
disabled (e.g. image-only mode), the LLM still recognizes
133+
`PassportApplicationInstructions` visually thanks to the detailed
134+
class `description`.
135+
- The `properties: {}` on the excluded class is intentional — there's
136+
nothing to extract from boilerplate pages. The classifier doesn't
137+
require properties.
138+
- Regex patterns can be tuned to match additional state-department
139+
revisions of DS-11. The (`?is`) flags make matching case-insensitive
140+
and tolerant of OCR line-break artefacts.

0 commit comments

Comments
 (0)