Skip to content

Commit f76e939

Browse files
committed
Merge branch 'fix/discovery-robustness' into 'develop'
fix: improve discovery robustness (markdown stripping, retry, config-version handling) See merge request genaiic-reusable-assets/engagement-artifacts/genaiic-idp-accelerator!572
2 parents 7b35ec6 + f787841 commit f76e939

4 files changed

Lines changed: 146 additions & 56 deletions

File tree

lib/idp_cli_pkg/idp_cli/cli.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4545,10 +4545,10 @@ def discover(
45454545
)
45464546
console.print("[green]✓ Batch discovery complete[/green]")
45474547

4548-
if stack_name and succeeded > 0:
4548+
if stack_name and config_version and succeeded > 0:
45494549
console.print(
45504550
f"[green]✓ Schema(s) saved to configuration"
4551-
f"{' (version: ' + config_version + ')' if config_version else ''}[/green]"
4551+
f" (version: {config_version})[/green]"
45524552
)
45534553

45544554
if failed > 0:

lib/idp_common_pkg/idp_common/config/system_defaults/base-discovery.yaml

Lines changed: 17 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -58,22 +58,21 @@ discovery:
5858
extraction and ensure consistency with expected document structure and
5959
field definitions.
6060
max_tokens: "10000"
61-
top_p: "0.0"
62-
temperature: "0.0"
61+
top_p: "0.1"
62+
temperature: "1.0"
6363
user_prompt: >-
64-
This image contains unstructured data. Analyze the data line by line using the provided ground truth as reference.
64+
This image contains unstructured data. Analyze the data line by line using the provided ground truth as reference.
6565
<GROUND_TRUTH_REFERENCE>
6666
{ground_truth_json}
6767
</GROUND_TRUTH_REFERENCE>
6868
Ground truth reference JSON has the fields we are interested in extracting from the document/image. Use the ground truth to optimize field extraction. Match field names, data types, and groupings from the reference.
6969
Image may contain multiple pages, process all pages.
7070
Extract all field names including those without values.
7171
Do not change the group name and field name from ground truth in the extracted data json.
72-
Add field_description field for every field which will contain instruction to LLM to extract the field data from the image/document. Add data_type field for every field.
73-
Add two fields document_class and document_description.
74-
For document_class generate a short name based on the document content like W4, I-9, Paystub.
75-
For document_description generate a description about the document in less than 50 words.
76-
If the group repeats and follows table format, update the attributeType as "list".
72+
Add field_description field for every field which will contain instruction to LLM to extract the field data from the image/document. Add data_type field for every field.
73+
Make sure to fill out the top-level "$id" and "x-aws-idp-document-type" with the extracted document class, and the top-level "description" with a brief description of the document class.
74+
Nesting Groups: Do not nest the groups i.e. groups within groups. All groups should be directly associated under main "properties".
75+
If the group repeats and follows table format, update the attributeType as "list".
7776
Do not extract the values.
7877
Format the extracted data using the below JSON format:
7978
Format the extracted groups and fields using the below JSON format:
@@ -85,21 +84,20 @@ discovery:
8584
and organizational structure. Focus on creating comprehensive blueprints
8685
for document processing without extracting actual values.
8786
max_tokens: "10000"
88-
top_p: "0.0"
89-
temperature: "0.0"
87+
top_p: "0.1"
88+
temperature: "1.0"
9089
user_prompt: >-
9190
This image contains forms data. Analyze the form line by line.
92-
Image may contains multiple pages, process all the pages.
93-
Form may contain multiple name value pair in one line.
94-
Extract all the names in the form including the name value pair which doesn't have value.
95-
Organize them into groups, extract field_name, data_type and field description
96-
Field_name should be less than 60 characters, should not have space use '-' instead of space.
97-
field_description is a brief description of the field and the location of the field like box number or line number in the form and section of the form.
91+
Image may contain multiple pages, process all the pages.
92+
Form may contain multiple name value pair in one line.
93+
Extract all the names in the form including the name value pair which doesn't have value.
94+
Organize them into groups, extract field_name, data_type and field description.
95+
Field names should be less than 30 characters, use camelCase or snake_case, name should not start with number and name should not have special characters.
96+
Field descriptions should include location hints (box number, line number, section).
9897
Field_name should be unique within the group.
99-
Add two fields document_class and document_description.
100-
For document_class generate a short name based on the document content like W4, I-9, Paystub.
101-
For document_description generate a description about the document in less than 50 words.
98+
Make sure to fill out the top-level "$id" and "x-aws-idp-document-type" with the extracted document class, and the top-level "description" with a brief description of the document class.
10299
Group the fields based on the section they are grouped in the form. Group should have attributeType as "group".
100+
Nesting Groups: Do not nest the groups i.e. groups within groups. All groups should be directly associated under main "properties".
103101
If the group repeats and follows table format, update the attributeType as "list".
104102
Do not extract the values.
105103
Return the extracted data in JSON format.

lib/idp_common_pkg/idp_common/discovery/classes_discovery.py

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
import json
44
import logging
55
import os
6+
import re
67
from typing import Any, Dict, Optional, cast
78

89
import jsonschema
@@ -246,6 +247,14 @@ def _merge_and_save_class(self, new_class: Dict[str, Any]) -> None:
246247
"Config", existing_custom, version=self.version
247248
)
248249

250+
@staticmethod
251+
def _extract_json(text: str) -> str:
252+
"""Strip markdown code fences from LLM response before JSON parsing."""
253+
match = re.search(r"```(?:json)?\s*\n?(.*?)\n?\s*```", text, re.DOTALL)
254+
if match:
255+
return match.group(1)
256+
return text
257+
249258
def _validate_json_schema(self, schema: Dict[str, Any]) -> tuple[bool, str]:
250259
"""
251260
Validate that the response is a valid JSON Schema.
@@ -367,7 +376,7 @@ def _extract_data_from_document(
367376
)
368377

369378
# Parse JSON response
370-
schema = json.loads(content_text)
379+
schema = json.loads(self._extract_json(content_text))
371380

372381
# Validate the schema
373382
is_valid, error_msg = self._validate_json_schema(schema)
@@ -493,7 +502,7 @@ def _extract_data_from_document_with_ground_truth(
493502
)
494503

495504
# Parse JSON response
496-
schema = json.loads(content_text)
505+
schema = json.loads(self._extract_json(content_text))
497506

498507
# Validate the schema
499508
is_valid, error_msg = self._validate_json_schema(schema)

lib/idp_sdk/idp_sdk/operations/discovery.py

Lines changed: 116 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,9 @@
66
import json
77
import logging
88
import os
9+
import re
910
from pathlib import Path
10-
from typing import List, Optional
11+
from typing import Any, Dict, List, Optional
1112

1213
from idp_sdk.exceptions import IDPConfigurationError, IDPResourceNotFoundError
1314
from idp_sdk.models.discovery import DiscoveryBatchResult, DiscoveryResult
@@ -116,29 +117,47 @@ def _run_with_stack(
116117

117118
from idp_common.discovery.classes_discovery import ClassesDiscovery
118119

119-
# ClassesDiscovery reads config from DynamoDB but we pass file_bytes
120-
# so it never reads from S3. input_prefix is used only for extension.
121-
discovery = ClassesDiscovery(
122-
input_bucket="local",
123-
input_prefix=doc_path.name,
124-
region=self._client._region,
125-
version=config_version,
126-
)
120+
# Try loading the requested config version; fall back to active
121+
# config if the version doesn't exist yet (user wants to create it).
122+
try:
123+
discovery = ClassesDiscovery(
124+
input_bucket="local",
125+
input_prefix=doc_path.name,
126+
region=self._client._region,
127+
version=config_version,
128+
)
129+
except Exception:
130+
if config_version is None:
131+
raise
132+
logger.warning(
133+
f"Config version '{config_version}' not found, "
134+
f"reading from active config and saving to '{config_version}'"
135+
)
136+
discovery = ClassesDiscovery(
137+
input_bucket="local",
138+
input_prefix=doc_path.name,
139+
region=self._client._region,
140+
version=None,
141+
)
142+
discovery.version = config_version
143+
144+
# Only save to config if a version was explicitly specified
145+
save = config_version is not None
127146

128147
if gt_data:
129148
result = discovery.discovery_classes_with_document_and_ground_truth(
130149
input_bucket="local",
131150
input_prefix=doc_path.name,
132151
file_bytes=file_bytes,
133152
ground_truth_data=gt_data,
134-
save_to_config=True,
153+
save_to_config=save,
135154
)
136155
else:
137156
result = discovery.discovery_classes_with_document(
138157
input_bucket="local",
139158
input_prefix=doc_path.name,
140159
file_bytes=file_bytes,
141-
save_to_config=True,
160+
save_to_config=save,
142161
)
143162

144163
schema = result.get("schema")
@@ -167,6 +186,7 @@ def _run_local(
167186
doc_path: Path,
168187
file_bytes: bytes,
169188
gt_data: Optional[dict],
189+
max_retries: int = 3,
170190
) -> DiscoveryResult:
171191
"""Local mode: uses system defaults, no stack needed, no config save."""
172192
try:
@@ -202,8 +222,6 @@ def _run_local(
202222
else:
203223
user_prompt = mode_cfg.get("user_prompt") or _prompt_without_gt()
204224

205-
full_prompt = f"{user_prompt}\nFormat the extracted data using the below JSON format:\n{sample_format}"
206-
207225
# Create content with file bytes
208226
file_extension = doc_path.suffix.lower().lstrip(".")
209227
if file_extension == "pdf":
@@ -215,35 +233,74 @@ def _run_local(
215233
"source": {"bytes": file_bytes},
216234
}
217235
},
218-
{"text": full_prompt},
219236
]
220237
else:
221-
image_content = image.prepare_bedrock_image_attachment(file_bytes)
222-
content = [image_content, {"text": full_prompt}]
238+
content = [image.prepare_bedrock_image_attachment(file_bytes)]
223239

224-
# Call Bedrock
240+
# Call Bedrock with retry/validation loop
225241
region = self._client._region or os.environ.get("AWS_REGION", "us-west-2")
226-
client = bedrock.BedrockClient(region=region)
227-
response = client.invoke_model(
228-
model_id=model_id,
229-
system_prompt=system_prompt,
230-
content=content,
231-
temperature=temperature,
232-
top_p=top_p,
233-
max_tokens=max_tokens,
234-
context="ClassesDiscoveryLocal",
235-
)
242+
bedrock_client = bedrock.BedrockClient(region=region)
243+
244+
validation_feedback = ""
245+
for attempt in range(max_retries):
246+
try:
247+
retry_prompt = ""
248+
if attempt > 0 and validation_feedback:
249+
retry_prompt = (
250+
f"\n\nPREVIOUS ATTEMPT FAILED: {validation_feedback}\n"
251+
f"Please fix the issue and generate a valid JSON Schema.\n\n"
252+
)
253+
254+
full_prompt = (
255+
f"{retry_prompt}{user_prompt}\n"
256+
f"Format the extracted data using the below JSON format:\n{sample_format}"
257+
)
236258

237-
content_text = bedrock.extract_text_from_response(response)
238-
schema = json.loads(content_text)
259+
response = bedrock_client.invoke_model(
260+
model_id=model_id,
261+
system_prompt=system_prompt,
262+
content=content + [{"text": full_prompt}],
263+
temperature=temperature,
264+
top_p=top_p,
265+
max_tokens=max_tokens,
266+
context="ClassesDiscoveryLocal",
267+
)
239268

240-
doc_class = schema.get("$id") or schema.get("x-aws-idp-document-type")
269+
content_text = bedrock.extract_text_from_response(response)
270+
schema = json.loads(_extract_json(content_text))
271+
272+
is_valid, error_msg = _validate_json_schema(schema)
273+
if is_valid:
274+
logger.info(
275+
f"Successfully generated valid JSON Schema on attempt {attempt + 1}"
276+
)
277+
doc_class = schema.get("$id") or schema.get(
278+
"x-aws-idp-document-type"
279+
)
280+
return DiscoveryResult(
281+
status="SUCCESS",
282+
document_class=doc_class,
283+
json_schema=schema,
284+
document_path=str(doc_path),
285+
)
286+
else:
287+
validation_feedback = error_msg
288+
logger.warning(
289+
f"Invalid schema on attempt {attempt + 1}: {error_msg}"
290+
)
291+
292+
except json.JSONDecodeError as e:
293+
validation_feedback = f"Invalid JSON format: {str(e)}"
294+
logger.warning(f"JSON parse error on attempt {attempt + 1}: {e}")
295+
except Exception as e:
296+
logger.error(f"Error on attempt {attempt + 1}: {e}")
297+
if attempt == max_retries - 1:
298+
raise
241299

242300
return DiscoveryResult(
243-
status="SUCCESS",
244-
document_class=doc_class,
245-
json_schema=schema,
301+
status="FAILED",
246302
document_path=str(doc_path),
303+
error=f"Failed to generate valid schema after {max_retries} attempts",
247304
)
248305

249306
except Exception as e:
@@ -319,6 +376,32 @@ def _get_config_table(self, stack_name: str) -> str:
319376
raise IDPResourceNotFoundError("ConfigurationTable not found in stack.")
320377

321378

379+
# --- Helpers for local mode ---
380+
381+
382+
def _extract_json(text: str) -> str:
383+
"""Strip markdown code fences from LLM response before JSON parsing."""
384+
match = re.search(r"```(?:json)?\s*\n?(.*?)\n?\s*```", text, re.DOTALL)
385+
if match:
386+
return match.group(1)
387+
return text
388+
389+
390+
def _validate_json_schema(schema: Dict[str, Any]) -> tuple:
391+
"""Validate that the response is a valid JSON Schema."""
392+
required_fields = ["$schema", "$id", "type", "properties"]
393+
for field in required_fields:
394+
if field not in schema:
395+
return False, f"Missing required field: {field}"
396+
if "x-aws-idp-document-type" not in schema:
397+
return False, "Missing x-aws-idp-document-type field"
398+
if schema.get("type") != "object":
399+
return False, "Root type must be 'object'"
400+
if not isinstance(schema.get("properties"), dict):
401+
return False, "Properties must be an object"
402+
return True, ""
403+
404+
322405
# --- Standalone prompt helpers for local mode ---
323406

324407

0 commit comments

Comments
 (0)