Skip to content

Commit 9d5b50a

Browse files
committed
Per-Class Extraction Model Override
1 parent 5d1dcd7 commit 9d5b50a

9 files changed

Lines changed: 373 additions & 2 deletions

File tree

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,8 @@ SPDX-License-Identifier: MIT-0
3030

3131
- **Chandra OCR Lambda Hook Sample** — New `GENAIIDP-chandra-ocr-hook` sample in `samples/lambda-hook-inference/` that integrates [Datalab Chandra OCR 2](https://github.com/datalab-to/chandra) with the LambdaHook feature for high-quality OCR. Supports 90+ languages, math, tables, forms, and handwriting. Uses the Datalab hosted async API (`/api/v1/convert`) with configurable output format (markdown/json/html) and conversion mode (fast/balanced/accurate). Includes standalone SAM template, local test script, and deployment instructions. See `docs/lambda-hook-inference.md` — Chandra OCR Integration section.
3232

33+
- **Per-Class Extraction Model Override** — New `x-aws-idp-extraction-model` JSON Schema extension allows overriding the global `extraction.model` on a per-document-class basis. Useful when certain document types benefit from a different model (e.g., a more powerful model for complex financial forms, a faster/cheaper model for simple documents). Classes without the extension continue to use the global default. Works with both traditional and agentic extraction modes. See `docs/extraction.md` — Per-Class Extraction Model Override section.
34+
3335
- **Wildcard pattern support for delete-documents**`idp-cli delete-documents` and `client.batch.delete_documents()` now accept a `--pattern` / `pattern` parameter for fnmatch-style wildcard matching (e.g. `"batch-123/*.pdf"`, `"*invoice*"`). Combines with `--status-filter` to delete e.g. all failed invoices across batches.
3436
- **Prompt Preview** — New "Prompt Preview" tab in the Configuration page lets you preview the actual prompts sent to the LLM for each processing step (Classification, Extraction, Assessment, Summarization). Config-derived placeholders are filled in with real values (class names, cleaned JSON Schema), while document-specific placeholders are shown as highlighted markers. Includes token estimates, copy-to-clipboard, and a substitution details panel showing the exact schema sent to the LLM. Helps optimize document class schemas and prompt templates.
3537

docs/extraction.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,45 @@ classes:
112112
description: "The date by which payment is due, typically labeled as 'Due Date', 'Payment Due', or similar"
113113
```
114114
115+
### Per-Class Extraction Model Override
116+
117+
By default, all document classes use the model specified in `extraction.model`. You can override this on a per-class basis using the `x-aws-idp-extraction-model` extension on any class schema. This is useful when certain document types benefit from a different model — for example, using a more powerful model for complex financial forms while keeping a faster, cheaper model for simpler documents.
118+
119+
Classes without the override continue to use the global `extraction.model`. The override works with both **traditional** and **agentic** extraction modes.
120+
121+
```yaml
122+
extraction:
123+
model: us.amazon.nova-pro-v1:0 # Default for most classes
124+
125+
classes:
126+
# This class uses the default model (us.amazon.nova-pro-v1:0)
127+
- $schema: "https://json-schema.org/draft/2020-12/schema"
128+
$id: simple-receipt
129+
x-aws-idp-document-type: simple-receipt
130+
type: object
131+
properties:
132+
total:
133+
type: string
134+
description: "Total amount"
135+
136+
# This class overrides the extraction model
137+
- $schema: "https://json-schema.org/draft/2020-12/schema"
138+
$id: complex-financial-form
139+
x-aws-idp-document-type: complex-financial-form
140+
x-aws-idp-extraction-model: us.anthropic.claude-sonnet-4-20250514-v1:0 # Override!
141+
type: object
142+
properties:
143+
account_number:
144+
type: string
145+
description: "Account number"
146+
```
147+
148+
When a per-class model override is active, it is logged at INFO level:
149+
150+
```
151+
Using per-class extraction model override for 'complex-financial-form': us.anthropic.claude-sonnet-4-20250514-v1:0
152+
```
153+
115154
### Extraction Instructions
116155

117156
### Model and Prompt Configuration

lib/idp_common_pkg/idp_common/config/schema_constants.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,12 @@
2929
X_AWS_IDP_DOCUMENT_NAME_REGEX = "x-aws-idp-document-name-regex"
3030
X_AWS_IDP_PAGE_CONTENT_REGEX = "x-aws-idp-document-page-content-regex"
3131

32+
# ============================================================================
33+
# AWS IDP Extraction Extensions
34+
# ============================================================================
35+
# Per-class model override for extraction (overrides extraction.model)
36+
X_AWS_IDP_EXTRACTION_MODEL = "x-aws-idp-extraction-model"
37+
3238
# ============================================================================
3339
# Legacy Attribute Type Values (for migration only)
3440
# ============================================================================

lib/idp_common_pkg/idp_common/extraction/service.py

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
ID_FIELD,
2424
SCHEMA_PROPERTIES,
2525
X_AWS_IDP_DOCUMENT_TYPE,
26+
X_AWS_IDP_EXTRACTION_MODEL,
2627
)
2728
from idp_common.models import Document
2829
from idp_common.utils.few_shot_example_builder import (
@@ -1380,8 +1381,15 @@ def _invoke_extraction_model(
13801381
f"Extracting fields for {section_info.class_label} document, section"
13811382
)
13821383

1383-
# Get extraction config
1384-
model_id = self.config.extraction.model
1384+
# Get extraction config — use per-class model override if specified,
1385+
# otherwise fall back to the global extraction model.
1386+
class_model_override = self._class_schema.get(X_AWS_IDP_EXTRACTION_MODEL)
1387+
model_id = class_model_override or self.config.extraction.model
1388+
if class_model_override:
1389+
logger.info(
1390+
f"Using per-class extraction model override for "
1391+
f"'{section_info.class_label}': {model_id}"
1392+
)
13851393
temperature = self.config.extraction.temperature
13861394
top_k = self.config.extraction.top_k
13871395
top_p = self.config.extraction.top_p

lib/idp_common_pkg/tests/unit/extraction/test_extraction_service.py

Lines changed: 208 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -448,3 +448,211 @@ def test_extract_json_simple(self, service):
448448
text = "No JSON here"
449449
result = extract_json_from_text(text)
450450
assert result == "No JSON here"
451+
452+
453+
@pytest.mark.unit
454+
class TestPerClassExtractionModelOverride:
455+
"""Tests for the per-class extraction model override feature (x-aws-idp-extraction-model)."""
456+
457+
@pytest.fixture
458+
def config_with_override(self):
459+
"""Config where one class has x-aws-idp-extraction-model and another does not."""
460+
return {
461+
"classes": [
462+
{
463+
"$schema": "https://json-schema.org/draft/2020-12/schema",
464+
"$id": "simple-receipt",
465+
"x-aws-idp-document-type": "simple-receipt",
466+
"type": "object",
467+
"description": "A simple receipt",
468+
"properties": {
469+
"total": {
470+
"type": "string",
471+
"description": "Total amount",
472+
},
473+
},
474+
},
475+
{
476+
"$schema": "https://json-schema.org/draft/2020-12/schema",
477+
"$id": "complex-form",
478+
"x-aws-idp-document-type": "complex-form",
479+
"x-aws-idp-extraction-model": "us.anthropic.claude-sonnet-4-20250514-v1:0",
480+
"type": "object",
481+
"description": "A complex financial form",
482+
"properties": {
483+
"account_number": {
484+
"type": "string",
485+
"description": "Account number",
486+
},
487+
},
488+
},
489+
],
490+
"extraction": {
491+
"model": "us.amazon.nova-pro-v1:0",
492+
"temperature": 0.0,
493+
"top_k": 5,
494+
"system_prompt": "You are a document extraction assistant.",
495+
"task_prompt": dedent("""
496+
Extract fields from this {DOCUMENT_CLASS} document:
497+
{ATTRIBUTE_NAMES_AND_DESCRIPTIONS}
498+
Document text: {DOCUMENT_TEXT}
499+
{DOCUMENT_IMAGE}
500+
"""),
501+
},
502+
}
503+
504+
@pytest.fixture
505+
def service_with_override(self, config_with_override):
506+
"""ExtractionService with per-class model override config."""
507+
return ExtractionService(region="us-west-2", config=config_with_override)
508+
509+
@patch("idp_common.bedrock.invoke_model")
510+
def test_uses_global_model_when_no_override(
511+
self, mock_invoke_model, service_with_override
512+
):
513+
"""When the class schema has no x-aws-idp-extraction-model, use the global model."""
514+
from idp_common.extraction.service import SectionInfo
515+
516+
# Set up context for a class WITHOUT override
517+
service_with_override._class_schema = service_with_override._get_class_schema(
518+
"simple-receipt"
519+
)
520+
service_with_override._class_label = "simple-receipt"
521+
service_with_override._page_images = []
522+
523+
mock_invoke_model.return_value = {
524+
"response": {
525+
"output": {"message": {"content": [{"text": '{"total": "$42.00"}'}]}}
526+
},
527+
"metering": {"tokens": 100},
528+
}
529+
530+
section_info = SectionInfo(
531+
class_label="simple-receipt",
532+
sorted_page_ids=["1"],
533+
page_indices=[0],
534+
output_bucket="bucket",
535+
output_key="key",
536+
output_uri="s3://bucket/key",
537+
start_page=1,
538+
end_page=1,
539+
)
540+
541+
service_with_override._invoke_extraction_model(
542+
content=[{"text": "test"}],
543+
system_prompt="test",
544+
section_info=section_info,
545+
)
546+
547+
# Verify the global model was used
548+
mock_invoke_model.assert_called_once()
549+
call_kwargs = mock_invoke_model.call_args
550+
assert call_kwargs.kwargs["model_id"] == "us.amazon.nova-pro-v1:0"
551+
552+
@patch("idp_common.bedrock.invoke_model")
553+
def test_uses_override_model_when_specified(
554+
self, mock_invoke_model, service_with_override
555+
):
556+
"""When the class schema has x-aws-idp-extraction-model, use the override model."""
557+
from idp_common.extraction.service import SectionInfo
558+
559+
# Set up context for a class WITH override
560+
service_with_override._class_schema = service_with_override._get_class_schema(
561+
"complex-form"
562+
)
563+
service_with_override._class_label = "complex-form"
564+
service_with_override._page_images = []
565+
566+
mock_invoke_model.return_value = {
567+
"response": {
568+
"output": {
569+
"message": {"content": [{"text": '{"account_number": "12345"}'}]}
570+
}
571+
},
572+
"metering": {"tokens": 100},
573+
}
574+
575+
section_info = SectionInfo(
576+
class_label="complex-form",
577+
sorted_page_ids=["1"],
578+
page_indices=[0],
579+
output_bucket="bucket",
580+
output_key="key",
581+
output_uri="s3://bucket/key",
582+
start_page=1,
583+
end_page=1,
584+
)
585+
586+
service_with_override._invoke_extraction_model(
587+
content=[{"text": "test"}],
588+
system_prompt="test",
589+
section_info=section_info,
590+
)
591+
592+
# Verify the per-class override model was used
593+
mock_invoke_model.assert_called_once()
594+
call_kwargs = mock_invoke_model.call_args
595+
assert (
596+
call_kwargs.kwargs["model_id"]
597+
== "us.anthropic.claude-sonnet-4-20250514-v1:0"
598+
)
599+
600+
@patch("idp_common.bedrock.invoke_model")
601+
def test_override_is_logged(self, mock_invoke_model, service_with_override, caplog):
602+
"""Verify that using a per-class model override produces an info log message."""
603+
import logging
604+
605+
from idp_common.extraction.service import SectionInfo
606+
607+
service_with_override._class_schema = service_with_override._get_class_schema(
608+
"complex-form"
609+
)
610+
service_with_override._class_label = "complex-form"
611+
service_with_override._page_images = []
612+
613+
mock_invoke_model.return_value = {
614+
"response": {
615+
"output": {
616+
"message": {"content": [{"text": '{"account_number": "12345"}'}]}
617+
}
618+
},
619+
"metering": {"tokens": 100},
620+
}
621+
622+
section_info = SectionInfo(
623+
class_label="complex-form",
624+
sorted_page_ids=["1"],
625+
page_indices=[0],
626+
output_bucket="bucket",
627+
output_key="key",
628+
output_uri="s3://bucket/key",
629+
start_page=1,
630+
end_page=1,
631+
)
632+
633+
with caplog.at_level(logging.INFO, logger="idp_common.extraction.service"):
634+
service_with_override._invoke_extraction_model(
635+
content=[{"text": "test"}],
636+
system_prompt="test",
637+
section_info=section_info,
638+
)
639+
640+
assert any(
641+
"per-class extraction model override" in record.message
642+
and "complex-form" in record.message
643+
for record in caplog.records
644+
)
645+
646+
def test_schema_constant_exists(self):
647+
"""Verify the X_AWS_IDP_EXTRACTION_MODEL constant is defined."""
648+
from idp_common.config.schema_constants import X_AWS_IDP_EXTRACTION_MODEL
649+
650+
assert X_AWS_IDP_EXTRACTION_MODEL == "x-aws-idp-extraction-model"
651+
652+
def test_clean_schema_removes_extraction_model(self, service_with_override):
653+
"""Verify that x-aws-idp-extraction-model is stripped from prompts."""
654+
schema_with_override = service_with_override._get_class_schema("complex-form")
655+
assert "x-aws-idp-extraction-model" in schema_with_override
656+
657+
cleaned = service_with_override._clean_schema_for_prompt(schema_with_override)
658+
assert "x-aws-idp-extraction-model" not in cleaned

patterns/unified/template.yaml

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -377,6 +377,44 @@ Resources:
377377
type: string
378378
description: "Optional regex pattern to match against page content text. When matched during multi-modal page-level classification, the page will be classified as this class type without LLM processing."
379379
order: 2.6
380+
extraction_model:
381+
type: string
382+
description: "Optional per-class extraction model override. When set, this model is used for extraction instead of the global extraction.model. Useful for classes that benefit from a different model."
383+
order: 2.7
384+
enum:
385+
- ""
386+
- "us.amazon.nova-lite-v1:0"
387+
- "us.amazon.nova-pro-v1:0"
388+
- "us.amazon.nova-premier-v1:0"
389+
- "us.amazon.nova-2-lite-v1:0"
390+
- "us.anthropic.claude-haiku-4-5-20251001-v1:0"
391+
- "us.anthropic.claude-sonnet-4-5-20250929-v1:0"
392+
- "us.anthropic.claude-sonnet-4-5-20250929-v1:0:1m"
393+
- "us.anthropic.claude-sonnet-4-6"
394+
- "us.anthropic.claude-sonnet-4-6:1m"
395+
- "us.anthropic.claude-opus-4-5-20251101-v1:0"
396+
- "us.anthropic.claude-opus-4-6-v1"
397+
- "us.anthropic.claude-opus-4-6-v1:1m"
398+
- "eu.amazon.nova-lite-v1:0"
399+
- "eu.amazon.nova-pro-v1:0"
400+
- "eu.amazon.nova-2-lite-v1:0"
401+
- "eu.anthropic.claude-haiku-4-5-20251001-v1:0"
402+
- "eu.anthropic.claude-sonnet-4-5-20250929-v1:0"
403+
- "eu.anthropic.claude-sonnet-4-5-20250929-v1:0:1m"
404+
- "eu.anthropic.claude-sonnet-4-6"
405+
- "eu.anthropic.claude-sonnet-4-6:1m"
406+
- "eu.anthropic.claude-opus-4-5-20251101-v1:0"
407+
- "eu.anthropic.claude-opus-4-6-v1"
408+
- "eu.anthropic.claude-opus-4-6-v1:1m"
409+
- "global.amazon.nova-2-lite-v1:0"
410+
- "global.anthropic.claude-haiku-4-5-20251001-v1:0"
411+
- "global.anthropic.claude-sonnet-4-5-20250929-v1:0"
412+
- "global.anthropic.claude-sonnet-4-5-20250929-v1:0:1m"
413+
- "global.anthropic.claude-sonnet-4-6"
414+
- "global.anthropic.claude-sonnet-4-6:1m"
415+
- "global.anthropic.claude-opus-4-5-20251101-v1:0"
416+
- "global.anthropic.claude-opus-4-6-v1"
417+
- "global.anthropic.claude-opus-4-6-v1:1m"
380418
examples:
381419
type: array
382420
description: Class few-shot examples

0 commit comments

Comments
 (0)