Skip to content

Commit 8653a3f

Browse files
committed
Merge branch 'fix/custom-model-finetuning-mr-feedback' into 'develop'
fix: address Custom Model Fine-tuning MR feedback See merge request genaiic-reusable-assets/engagement-artifacts/genaiic-idp-accelerator!610
2 parents 060e6b7 + fd32f6c commit 8653a3f

14 files changed

Lines changed: 1082 additions & 612 deletions

File tree

CHANGELOG.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -197,7 +197,13 @@ SPDX-License-Identifier: MIT-0
197197

198198
- **Discovery accessible from CLI and SDK** — Discovery can now be run programmatically via the IDP SDK (`client.discovery.run()`) and CLI (`idp-cli discover`), enabling users with many document classes to automate schema generation without the Web UI. Supports both modes: without ground truth (exploratory) and with ground truth (optimized). ([#228](https://github.com/aws-solutions-library-samples/accelerated-intelligent-document-processing-on-aws/issues/228))
199199

200-
- **Custom Model Fine-tuning** — Fine-tune Amazon Nova models using processed IDP documents directly from the Web UI. Create a fine-tuning job from a Test Set (which provides documents with ground truth data), generate training data in JSONL format, submit fine-tuning jobs to Bedrock, and deploy the resulting custom models for use in extraction workflows. Includes Step Functions workflow orchestration, GraphQL API for job management, and "Custom Models" section in the navigation. See [Custom Model Fine-tuning](./docs/custom-model-finetuning.md) for details.
200+
- **Custom Model Fine-tuning** — Fine-tune Amazon Nova models using processed IDP documents directly from the Web UI. Create a fine-tuning job from a Test Set (which provides documents with ground truth data), generate training data in JSONL format, submit fine-tuning jobs to Bedrock, and deploy the resulting custom models for use in extraction workflows. See [Custom Model Fine-tuning](./docs/custom-model-finetuning.md) for details.
201+
- **Web UI**: New "Custom Models" page with job creation form (test set selector, base model selector, train/validation split), jobs table with status tracking, and detailed job view with deployment status and configuration version creation
202+
- **CLI / SDK**: `idp-cli finetuning create`, `idp-cli finetuning status`, `idp-cli finetuning list`, `idp-cli finetuning delete` commands for programmatic job management
203+
- **GraphQL API**: New `createFinetuningJob`, `getFinetuningJob`, `listFinetuningJobs`, `deleteFinetuningJob` mutations/queries with `FinetuningJob` type and real-time status fields
204+
- **Step Functions Workflow**: 7-Lambda orchestration pipeline — list documents, parallel document processing (Distributed Map), merge training data, create Bedrock fine-tuning job, poll job status, deploy custom model via Provisioned Throughput
205+
- **CloudFormation Resources**: `FinetuningDataBucket` (S3), `FinetuningStateMachine` (Step Functions), 7 Lambda functions with IAM roles, CloudWatch log groups, and Bedrock permissions for model customization and deployment
206+
- **Shared Training Data Utilities**: Common module (`idp_common.model_finetuning.training_data_utils`) for extraction field parsing, baseline formatting, PDF-to-image conversion, and document image handling — shared across Lambda functions to eliminate code duplication
201207

202208
### Changed
203209

docs/custom-model-finetuning.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ Once a fine-tuning job completes:
8181
### Validate a Test Set
8282

8383
```bash
84-
idp finetuning validate --test-set docsplit
84+
idp-cli finetuning validate --test-set docsplit
8585
```
8686

8787
Output:
@@ -109,7 +109,7 @@ Warnings:
109109
### Create a Fine-Tuning Job
110110

111111
```bash
112-
idp finetuning create \
112+
idp-cli finetuning create \
113113
--test-set docsplit \
114114
--base-model us.amazon.nova-pro-v1:0 \
115115
--name my-classifier
@@ -122,13 +122,13 @@ Fine-tuning job created:
122122
Status: VALIDATING
123123
124124
Monitor progress with:
125-
idp finetuning status --job-id abc123-def456-...
125+
idp-cli finetuning status --job-id abc123-def456-...
126126
```
127127

128128
### Check Job Status
129129

130130
```bash
131-
idp finetuning status --job-id abc123-def456-...
131+
idp-cli finetuning status --job-id abc123-def456-...
132132
```
133133

134134
Output:
@@ -151,13 +151,13 @@ Fine-Tuning Job Status:
151151
### List All Jobs
152152

153153
```bash
154-
idp finetuning list
154+
idp-cli finetuning list
155155
```
156156

157157
### Delete a Job
158158

159159
```bash
160-
idp finetuning delete --job-id abc123-def456-...
160+
idp-cli finetuning delete --job-id abc123-def456-...
161161
```
162162

163163
### List Available Models

lib/idp_common_pkg/idp_common/model_finetuning/__init__.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,13 @@
1212
ProvisionedThroughputResult,
1313
)
1414
from idp_common.model_finetuning.service import ModelFinetuningService
15+
from idp_common.model_finetuning.training_data_utils import (
16+
convert_pdf_to_images,
17+
format_baseline_for_training,
18+
get_document_images,
19+
get_document_images_from_uri,
20+
get_extraction_fields,
21+
)
1522

1623
__all__ = [
1724
"ModelFinetuningService",
@@ -20,4 +27,9 @@
2027
"JobStatus",
2128
"ProvisionedThroughputConfig",
2229
"ProvisionedThroughputResult",
30+
"convert_pdf_to_images",
31+
"format_baseline_for_training",
32+
"get_document_images",
33+
"get_document_images_from_uri",
34+
"get_extraction_fields",
2335
]

lib/idp_common_pkg/idp_common/model_finetuning/models.py

Lines changed: 0 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -31,15 +31,6 @@ class FinetuningWorkflowStatus(Enum):
3131
FAILED = "FAILED"
3232

3333

34-
class CustomModelDeploymentStatus(Enum):
35-
"""Status of a custom model deployment."""
36-
37-
CREATING = "Creating"
38-
ACTIVE = "Active"
39-
FAILED = "Failed"
40-
DELETING = "Deleting"
41-
42-
4334
@dataclass
4435
class FinetuningJobConfig:
4536
"""Configuration for a fine-tuning job."""
Lines changed: 249 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,249 @@
1+
"""
2+
Shared utility functions for fine-tuning training data generation.
3+
4+
These functions are used by both finetuning_data_generator and
5+
finetuning_process_document Lambda functions to avoid code duplication.
6+
"""
7+
8+
import base64
9+
import io
10+
import logging
11+
from typing import Any, Dict, List, Tuple
12+
13+
logger = logging.getLogger(__name__)
14+
15+
16+
def get_extraction_fields(baseline: Dict[str, Any]) -> List[str]:
17+
"""
18+
Extract field names from baseline extraction data.
19+
20+
Parses various baseline formats to build a flat list of field names
21+
for training data generation.
22+
23+
Supported formats:
24+
- Format 1: ``{"Sections": [{"Extraction": {"field": ...}}]}``
25+
- Format 2: ``{"Extraction": {"field": ...}}``
26+
- Format 3: ``{"fields": [{"name": "field"}, ...]}``
27+
28+
Args:
29+
baseline: The baseline extraction result dictionary.
30+
31+
Returns:
32+
List of field name strings (empty strings filtered out).
33+
"""
34+
fields: List[str] = []
35+
36+
# Format 1: Sections with Extraction
37+
sections = baseline.get("Sections", [])
38+
for section in sections:
39+
extraction = section.get("Extraction", {})
40+
fields.extend(extraction.keys())
41+
42+
# Format 2: Direct extraction at top level
43+
extraction = baseline.get("Extraction", {})
44+
fields.extend(extraction.keys())
45+
46+
# Format 3: Fields array
47+
for field in baseline.get("fields", []):
48+
if isinstance(field, dict):
49+
fields.append(field.get("name", ""))
50+
elif isinstance(field, str):
51+
fields.append(field)
52+
53+
return [f for f in fields if f] # Filter empty strings
54+
55+
56+
def format_baseline_for_training(baseline: Dict[str, Any]) -> Dict[str, Any]:
57+
"""
58+
Format baseline extraction data into the structure expected for training.
59+
60+
Maps baseline result fields to a clean ``field_name -> value`` dict.
61+
Handles nested value objects (``{"value": ...}`` or ``{"Value": ...}``).
62+
63+
Supported formats:
64+
- Format 1: ``{"Sections": [{"Extraction": {"field": value}}]}``
65+
- Format 2: ``{"Extraction": {"field": value}}``
66+
- Format 3: ``{"fields": [{"name": "field", "value": "val"}]}``
67+
- Fallback: returns all non-metadata keys from baseline.
68+
69+
Args:
70+
baseline: The baseline extraction result dictionary.
71+
72+
Returns:
73+
Dictionary mapping field names to their extracted values.
74+
"""
75+
result: Dict[str, Any] = {}
76+
77+
# Format 1: Sections with Extraction
78+
sections = baseline.get("Sections", [])
79+
for section in sections:
80+
extraction = section.get("Extraction", {})
81+
for key, value in extraction.items():
82+
if isinstance(value, dict):
83+
result[key] = value.get("value", value.get("Value", str(value)))
84+
else:
85+
result[key] = value
86+
87+
# Format 2: Direct extraction at top level
88+
extraction = baseline.get("Extraction", {})
89+
for key, value in extraction.items():
90+
if isinstance(value, dict):
91+
result[key] = value.get("value", value.get("Value", str(value)))
92+
else:
93+
result[key] = value
94+
95+
# Format 3: Fields array with values
96+
for field in baseline.get("fields", []):
97+
if isinstance(field, dict):
98+
name = field.get("name", "")
99+
value = field.get("value", "")
100+
if name:
101+
result[name] = value
102+
103+
# Fallback: return baseline as-is (cleaned up)
104+
if not result:
105+
for key, value in baseline.items():
106+
if key not in [
107+
"Sections",
108+
"metadata",
109+
"Metadata",
110+
"pageCount",
111+
"PageCount",
112+
]:
113+
result[key] = value
114+
115+
return result
116+
117+
118+
def convert_pdf_to_images(
119+
pdf_bytes: bytes,
120+
max_pages: int = 20,
121+
dpi: int = 150,
122+
) -> List[Tuple[bytes, str]]:
123+
"""
124+
Convert PDF bytes to a list of PNG image byte tuples (one per page).
125+
126+
Uses pypdfium2 for PDF rendering. Limits to *max_pages* to avoid
127+
excessive memory usage during training data generation.
128+
129+
Args:
130+
pdf_bytes: Raw PDF file content.
131+
max_pages: Maximum number of pages to convert.
132+
dpi: Resolution for rendering (default 150).
133+
134+
Returns:
135+
List of ``(png_bytes, "png")`` tuples, one per page.
136+
Returns an empty list if PDF conversion libraries are unavailable
137+
or an error occurs.
138+
"""
139+
try:
140+
import pypdfium2 as pdfium
141+
except ImportError:
142+
logger.error("pypdfium2 is required for PDF to image conversion")
143+
return []
144+
145+
images: List[Tuple[bytes, str]] = []
146+
try:
147+
pdf = pdfium.PdfDocument(pdf_bytes)
148+
num_pages = min(len(pdf), max_pages)
149+
150+
logger.info(f"Converting PDF with {len(pdf)} pages (processing {num_pages})")
151+
152+
for page_idx in range(num_pages):
153+
page = pdf[page_idx]
154+
scale = dpi / 72.0
155+
bitmap = page.render(scale=scale)
156+
pil_image = bitmap.to_pil()
157+
158+
buf = io.BytesIO()
159+
pil_image.save(buf, format="PNG", optimize=True)
160+
images.append((buf.getvalue(), "png"))
161+
162+
pdf.close()
163+
except Exception as e:
164+
logger.error(f"Error converting PDF to images: {e}", exc_info=True)
165+
return []
166+
167+
return images
168+
169+
170+
def get_document_images(
171+
s3_client: Any,
172+
bucket: str,
173+
document_key: str,
174+
) -> List[Tuple[str, str]]:
175+
"""
176+
Get base64-encoded images for a document stored in S3.
177+
178+
Supports PDF files (converted to images via :func:`convert_pdf_to_images`)
179+
and common image formats (PNG, JPG, JPEG, TIFF, BMP, GIF, WEBP).
180+
181+
Args:
182+
s3_client: Boto3 S3 client.
183+
bucket: S3 bucket name.
184+
document_key: S3 object key for the document.
185+
186+
Returns:
187+
List of ``(base64_string, format)`` tuples.
188+
Returns an empty list for unsupported file types.
189+
"""
190+
response = s3_client.get_object(Bucket=bucket, Key=document_key)
191+
file_bytes = response["Body"].read()
192+
content_type = response.get("ContentType", "")
193+
194+
lower_key = document_key.lower()
195+
196+
if lower_key.endswith(".pdf") or "pdf" in content_type:
197+
page_images = convert_pdf_to_images(file_bytes)
198+
return [
199+
(base64.b64encode(img_bytes).decode("utf-8"), fmt)
200+
for img_bytes, fmt in page_images
201+
]
202+
203+
# Map extensions to format names
204+
extension_map = {
205+
".png": "png",
206+
".jpg": "jpeg",
207+
".jpeg": "jpeg",
208+
".gif": "gif",
209+
".webp": "webp",
210+
".tiff": "tiff",
211+
".bmp": "bmp",
212+
}
213+
214+
for ext, fmt in extension_map.items():
215+
if lower_key.endswith(ext):
216+
return [(base64.b64encode(file_bytes).decode("utf-8"), fmt)]
217+
218+
logger.warning(f"Unsupported file type for {document_key}, treating as PNG")
219+
return [(base64.b64encode(file_bytes).decode("utf-8"), "png")]
220+
221+
222+
def get_document_images_from_uri(
223+
s3_client: Any,
224+
s3_uri: str,
225+
) -> List[Tuple[str, str]]:
226+
"""
227+
Convenience wrapper around :func:`get_document_images` that accepts an S3 URI.
228+
229+
Parses ``s3://bucket/key`` and delegates to :func:`get_document_images`.
230+
231+
Args:
232+
s3_client: Boto3 S3 client.
233+
s3_uri: Full S3 URI (e.g. ``s3://my-bucket/path/to/doc.pdf``).
234+
235+
Returns:
236+
List of ``(base64_string, format)`` tuples.
237+
238+
Raises:
239+
ValueError: If *s3_uri* does not start with ``s3://``.
240+
"""
241+
if not s3_uri.startswith("s3://"):
242+
raise ValueError(f"Invalid S3 URI: {s3_uri}")
243+
244+
path = s3_uri[5:]
245+
parts = path.split("/", 1)
246+
bucket = parts[0]
247+
key = parts[1] if len(parts) > 1 else ""
248+
249+
return get_document_images(s3_client, bucket, key)

0 commit comments

Comments
 (0)