|
| 1 | +# DS-11 U.S. Passport Application Sample |
| 2 | + |
| 3 | +This sample configuration demonstrates the **excluded-class feature** — a way |
| 4 | +to tell the IDP pipeline that a particular document class contains only |
| 5 | +static/boilerplate pages (instructions, legal warnings, cover pages, tax |
| 6 | +notices, etc.) and should be skipped during extraction, assessment, |
| 7 | +summarization, rule validation, and evaluation. |
| 8 | + |
| 9 | +## What it demonstrates |
| 10 | + |
| 11 | +`samples/DS11-USPassportApplication.pdf` is a 6-page US State Department |
| 12 | +passport application form in which: |
| 13 | + |
| 14 | +| Page | Content | Nature | |
| 15 | +|------|------------------------------------------------------|------------------| |
| 16 | +| 1 | WARNING: False statements… legal warning | Static legal | |
| 17 | +| 2 | Passport fee and payment instructions | Static instructions | |
| 18 | +| 3 | DS-11 FEDERAL TAX LAW (Section 6039E) notice | Static legal | |
| 19 | +| 4 | DS-11 ACTS OR CONDITIONS affidavit | Static oath | |
| 20 | +| 5 | APPLICATION FOR A U.S. PASSPORT (form front) | Dynamic form | |
| 21 | +| 6 | Travel Plans / Permanent Address (form back) | Dynamic form | |
| 22 | + |
| 23 | +This config is a **minimal override config** — it only declares `notes` + |
| 24 | +`classes`. All other settings (`classification:`, `extraction:`, |
| 25 | +`assessment:`, `summarization:`, `ocr:`, `evaluation:`) are inherited |
| 26 | +from the bundled system defaults via `merge_config_with_defaults()` at |
| 27 | +deploy time (production) or at notebook-load time (demos). You only |
| 28 | +need to declare the classes you care about. |
| 29 | + |
| 30 | +With this config: |
| 31 | + |
| 32 | +1. The classifier sees **two** classes, `PassportApplicationInstructions` |
| 33 | + and `PassportApplication`. |
| 34 | +2. The **primary classification mechanism** is the LLM multimodal |
| 35 | + page-level classifier: each page is sent (image + OCR text) to |
| 36 | + Bedrock and the best-matching class is chosen using the class |
| 37 | + `description` field. This is robust to form revisions, OCR quirks, |
| 38 | + and wording differences. |
| 39 | +3. The **optional regex fast-path** on the excluded class |
| 40 | + (`x-aws-idp-document-page-content-regex`) short-circuits pages whose |
| 41 | + OCR text matches a known stable boilerplate phrase. If the regex |
| 42 | + misses, the LLM still catches the page via the description. The |
| 43 | + regex is narrowly scoped to a single conservative anchor; see the |
| 44 | + comment in `config.yaml` for details. |
| 45 | +4. The document is segmented into two sections via the existing BIO-like |
| 46 | + section-boundary logic. The classification service propagates the |
| 47 | + `excluded` flag from the class config onto the `Section`. |
| 48 | +5. Downstream services (extraction, assessment, summarization, rule |
| 49 | + validation) see `section.excluded == True` and **skip** those |
| 50 | + sections. They still write a small `result.json` stub so the UI and |
| 51 | + reporting database have something to show: |
| 52 | + |
| 53 | + ```json |
| 54 | + { |
| 55 | + "status": "skipped_excluded_class", |
| 56 | + "stage": "extraction", |
| 57 | + "section_id": "1", |
| 58 | + "classification": "PassportApplicationInstructions", |
| 59 | + "excluded": true, |
| 60 | + "exclusion_reason": "instructions", |
| 61 | + "page_ids": ["1", "2", "3", "4"], |
| 62 | + "message": "Section 1 classified as 'PassportApplicationInstructions' …" |
| 63 | + } |
| 64 | + ``` |
| 65 | + |
| 66 | +6. The evaluation service filters excluded sections out of the |
| 67 | + precision/recall/F1 calculation and appends an **Excluded Sections** |
| 68 | + table to the markdown report so nothing is silently dropped. |
| 69 | + |
| 70 | +7. The UI renders excluded sections in the Sections panel with a grey |
| 71 | + `Skipped: instructions` badge next to the class name. |
| 72 | + |
| 73 | +## How to try it |
| 74 | + |
| 75 | +### 1. As a library / test fixture |
| 76 | + |
| 77 | +```bash |
| 78 | +# From the repo root |
| 79 | +python -c " |
| 80 | +from idp_common.models import Document, Section |
| 81 | +from idp_common.section_exclusion import is_section_excluded, build_skipped_stub_result |
| 82 | +
|
| 83 | +doc = Document(id='ds11-demo') |
| 84 | +sec = Section( |
| 85 | + section_id='1', |
| 86 | + classification='PassportApplicationInstructions', |
| 87 | + page_ids=['1','2','3','4'], |
| 88 | + excluded=True, |
| 89 | + exclusion_reason='instructions', |
| 90 | +) |
| 91 | +assert is_section_excluded(sec) |
| 92 | +print(build_skipped_stub_result(doc, sec, stage='extraction')) |
| 93 | +" |
| 94 | +``` |
| 95 | + |
| 96 | +### 2. In a live deployment |
| 97 | + |
| 98 | +1. Load this config into your stack: |
| 99 | + |
| 100 | + ```bash |
| 101 | + idp-cli configuration create \\ |
| 102 | + --stack-name <your-stack> \\ |
| 103 | + --version-name ds11 \\ |
| 104 | + --path config_library/unified/ds11-passport-application/config.yaml |
| 105 | + idp-cli configuration activate --stack-name <your-stack> --version-name ds11 |
| 106 | + ``` |
| 107 | + |
| 108 | +2. Upload `samples/DS11-USPassportApplication.pdf` through the web UI or |
| 109 | + CLI, and inspect the resulting sections in the Sections panel — the |
| 110 | + first section (pages 1–4) will display a **Skipped: instructions** |
| 111 | + badge and the extraction/summary panels for that section will show |
| 112 | + the skipped-stub message. Only the second section (pages 5–6) will |
| 113 | + be extracted. |
| 114 | + |
| 115 | +## Key schema extensions |
| 116 | + |
| 117 | +Two new class-level extensions power the feature: |
| 118 | + |
| 119 | +| Key | Type | Meaning | |
| 120 | +|-----|------|---------| |
| 121 | +| `x-aws-idp-exclude-from-processing` | boolean | When `true`, downstream services skip sections classified as this class. | |
| 122 | +| `x-aws-idp-exclusion-reason` | string | Optional short reason (`"instructions"`, `"legal"`, `"cover-page"`) shown in UI badges and evaluation reports. | |
| 123 | + |
| 124 | +The existing |
| 125 | +`x-aws-idp-document-page-content-regex` extension is used as a fast path |
| 126 | +so the LLM doesn't have to classify boilerplate pages that clearly |
| 127 | +contain anchor phrases from the form template. |
| 128 | + |
| 129 | +## Notes & caveats |
| 130 | + |
| 131 | +- The regex fast path relies on OCR text being available. When OCR is |
| 132 | + disabled (e.g. image-only mode), the LLM still recognizes |
| 133 | + `PassportApplicationInstructions` visually thanks to the detailed |
| 134 | + class `description`. |
| 135 | +- The `properties: {}` on the excluded class is intentional — there's |
| 136 | + nothing to extract from boilerplate pages. The classifier doesn't |
| 137 | + require properties. |
| 138 | +- Regex patterns can be tuned to match additional state-department |
| 139 | + revisions of DS-11. The (`?is`) flags make matching case-insensitive |
| 140 | + and tolerant of OCR line-break artefacts. |
0 commit comments