You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: lib/idp_common_pkg/idp_common/config/system_defaults/base-discovery.yaml
+17-19Lines changed: 17 additions & 19 deletions
Original file line number
Diff line number
Diff line change
@@ -58,22 +58,21 @@ discovery:
58
58
extraction and ensure consistency with expected document structure and
59
59
field definitions.
60
60
max_tokens: "10000"
61
-
top_p: "0.0"
62
-
temperature: "0.0"
61
+
top_p: "0.1"
62
+
temperature: "1.0"
63
63
user_prompt: >-
64
-
This image contains unstructured data. Analyze the data line by line using the provided ground truth as reference.
64
+
This image contains unstructured data. Analyze the data line by line using the provided ground truth as reference.
65
65
<GROUND_TRUTH_REFERENCE>
66
66
{ground_truth_json}
67
67
</GROUND_TRUTH_REFERENCE>
68
68
Ground truth reference JSON has the fields we are interested in extracting from the document/image. Use the ground truth to optimize field extraction. Match field names, data types, and groupings from the reference.
69
69
Image may contain multiple pages, process all pages.
70
70
Extract all field names including those without values.
71
71
Do not change the group name and field name from ground truth in the extracted data json.
72
-
Add field_description field for every field which will contain instruction to LLM to extract the field data from the image/document. Add data_type field for every field.
73
-
Add two fields document_class and document_description.
74
-
For document_class generate a short name based on the document content like W4, I-9, Paystub.
75
-
For document_description generate a description about the document in less than 50 words.
76
-
If the group repeats and follows table format, update the attributeType as "list".
72
+
Add field_description field for every field which will contain instruction to LLM to extract the field data from the image/document. Add data_type field for every field.
73
+
Make sure to fill out the top-level "$id" and "x-aws-idp-document-type" with the extracted document class, and the top-level "description" with a brief description of the document class.
74
+
Nesting Groups: Do not nest the groups i.e. groups within groups. All groups should be directly associated under main "properties".
75
+
If the group repeats and follows table format, update the attributeType as "list".
77
76
Do not extract the values.
78
77
Format the extracted data using the below JSON format:
79
78
Format the extracted groups and fields using the below JSON format:
@@ -85,21 +84,20 @@ discovery:
85
84
and organizational structure. Focus on creating comprehensive blueprints
86
85
for document processing without extracting actual values.
87
86
max_tokens: "10000"
88
-
top_p: "0.0"
89
-
temperature: "0.0"
87
+
top_p: "0.1"
88
+
temperature: "1.0"
90
89
user_prompt: >-
91
90
This image contains forms data. Analyze the form line by line.
92
-
Image may contains multiple pages, process all the pages.
93
-
Form may contain multiple name value pair in one line.
94
-
Extract all the names in the form including the name value pair which doesn't have value.
95
-
Organize them into groups, extract field_name, data_type and field description
96
-
Field_name should be less than 60 characters, should not have space use '-' instead of space.
97
-
field_description is a brief description of the field and the location of the field like box number or line number in the form and section of the form.
91
+
Image may contain multiple pages, process all the pages.
92
+
Form may contain multiple name value pair in one line.
93
+
Extract all the names in the form including the name value pair which doesn't have value.
94
+
Organize them into groups, extract field_name, data_type and field description.
95
+
Field names should be less than 30 characters, use camelCase or snake_case, name should not start with number and name should not have special characters.
96
+
Field descriptions should include location hints (box number, line number, section).
98
97
Field_name should be unique within the group.
99
-
Add two fields document_class and document_description.
100
-
For document_class generate a short name based on the document content like W4, I-9, Paystub.
101
-
For document_description generate a description about the document in less than 50 words.
98
+
Make sure to fill out the top-level "$id" and "x-aws-idp-document-type" with the extracted document class, and the top-level "description" with a brief description of the document class.
102
99
Group the fields based on the section they are grouped in the form. Group should have attributeType as "group".
100
+
Nesting Groups: Do not nest the groups i.e. groups within groups. All groups should be directly associated under main "properties".
103
101
If the group repeats and follows table format, update the attributeType as "list".
0 commit comments