Skip to content

Commit 227061e

Browse files
committed
feat(docs): add new guide for creating custom test sets with ground truth data
1 parent a0a327b commit 227061e

3 files changed

Lines changed: 170 additions & 0 deletions

File tree

CHANGELOG.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,9 @@ SPDX-License-Identifier: MIT-0
77

88
### Added
99

10+
11+
- **Creating Custom Test Sets Guide** — New tutorial-style documentation (`docs/creating-custom-test-sets.md`) walking through the end-to-end workflow for creating custom test sets with ground truth data from scratch: configure for max accuracy, discover document schema, process samples, review/edit predictions, save evaluation baselines, register test sets, and run comparative test executions to evaluate cost vs. accuracy tradeoffs. Referenced from `docs/demo-videos.md`.
12+
1013
- **Configuration Version Tracking Across All Analytics Tables** — Added `config_version` field to all analytics tables (metering, document_evaluations, section_evaluations, attribute_evaluations, and document_sections_*) to enable comprehensive tracking and analytics per configuration version. All Glue tables now include a `config_version` column, and all Parquet files store the configuration version used for each document. Enables direct filtering and comparison queries without complex JOINs - users can query "show me W2 documents processed with config v2.1" or "compare accuracy for configs v2.0 vs v2.1" with simple WHERE clauses. Supports cost analysis, A/B testing, quality comparison, and data lineage tracking. Documents without a config version default to "default".
1114

1215
### Fixed

docs/creating-custom-test-sets.md

Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
---
2+
title: "Creating Custom Test Sets with Ground Truth"
3+
---
4+
5+
Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
6+
SPDX-License-Identifier: MIT-0
7+
8+
# Creating Custom Test Sets with Ground Truth
9+
10+
This guide walks through the end-to-end workflow for creating a custom test set with ground truth (evaluation baseline) data from scratch. Once created, the test set can be used for:
11+
12+
- **Benchmarking** — Compare accuracy across different models and configurations
13+
- **Cost optimization** — Find the cheapest model that meets your accuracy requirements
14+
- **Prompt engineering** — Measure the impact of prompt and schema changes
15+
- **Custom model training** — Provide labeled training data for fine-tuning (see [Custom Model Fine-Tuning](./custom-model-finetuning.md))
16+
17+
> **Pre-deployed test sets**: The accelerator ships with four ready-to-use benchmark datasets. If you just want to run tests against those, see [Test Studio — Pre-Deployed Test Sets](./test-studio.md#pre-deployed-test-sets). This guide is for creating your **own** test set from your own documents.
18+
19+
## Workflow Overview
20+
21+
```
22+
┌─────────────┐ ┌─────────────┐ ┌──────────────┐ ┌──────────────┐ ┌─────────────┐ ┌───────────────┐
23+
│ 1. Configure │───▶│ 2. Discover │───▶│ 3. Process │───▶│ 4. Review & │───▶│ 5. Create │───▶│ 6. Run Test │
24+
│ Models │ │ Schema │ │ Documents │ │ Correct │ │ Test Set │ │ Executions │
25+
└─────────────┘ └─────────────┘ └──────────────┘ └──────────────┘ └─────────────┘ └───────────────┘
26+
Use the best Bootstrap Process sample Edit predictions Save as eval Compare models,
27+
model for high document classes docs with your and fix errors baseline & prompts, and
28+
accuracy from samples configuration in the UI editor register set configurations
29+
```
30+
31+
## Step 1: Configure for Maximum Accuracy
32+
33+
The goal of this initial run is to produce predictions that are as accurate as possible, minimizing the amount of manual editing you'll need to do later. Use the best available model for both classification and extraction.
34+
35+
1. Go to **Configuration** in the web UI
36+
2. Create a new configuration version (or edit an existing one)
37+
3. Set both the **classification model** and **extraction model** to a high-accuracy model (e.g., Claude Opus)
38+
4. Save the configuration version
39+
40+
> **Tip**: You can always create a cheaper configuration later for production use. The expensive model is only used here to bootstrap high-quality ground truth.
41+
42+
For details on configuration management, see [Configuration](./configuration.md) and [Configuration Versions](./configuration-versions.md).
43+
44+
## Step 2: Discover the Document Schema
45+
46+
If you don't already have document classes defined for your document type, use Discovery to bootstrap the schema automatically.
47+
48+
1. Go to **Discovery** in the web UI
49+
2. Select your high-accuracy configuration version
50+
3. Upload a representative sample document
51+
4. Run discovery — it will analyze the document and populate document classes and attributes
52+
53+
After discovery completes, verify the schema in your configuration under **Document Schema**. You should see the discovered document class with its attributes populated.
54+
55+
For details on discovery modes and options, see [Discovery](./discovery.md).
56+
57+
## Step 3: Process Your Sample Documents
58+
59+
Now process a set of sample documents that will become your test set.
60+
61+
1. Go to **Upload Documents** in the web UI
62+
2. Select your high-accuracy configuration version
63+
3. Upload your sample documents
64+
4. Wait for all documents to finish processing
65+
66+
> **How many documents?** For illustration, a handful of documents is fine. For a meaningful benchmark test set, aim for a larger representative sample. For custom model training, you'll need a significant number of labeled documents — see [Custom Model Fine-Tuning](./custom-model-finetuning.md) for guidance on training data requirements.
67+
68+
## Step 4: Review, Edit, and Save Ground Truth
69+
70+
This is the most important step. You'll review each document's predictions, correct any errors, and save the corrected version as evaluation baseline (ground truth).
71+
72+
### Review and Edit Predictions
73+
74+
For each processed document:
75+
76+
1. Open the document from the document list
77+
2. Click **View Data** to see the extracted information
78+
3. Click **Edit Data** to enter edit mode
79+
4. Review each extracted field:
80+
- Click on a field to highlight it in the document viewer
81+
- Compare the extracted value against the source document
82+
- Correct any errors by editing the field value directly
83+
5. **Save** your changes — the system creates a revision history of all edits
84+
85+
> **Tip**: The solution generates a confidence score for each field. To save time, you could focus on reviewing lower-confidence fields first. However, for the highest quality ground truth, review all fields.
86+
87+
### Save as Evaluation Baseline
88+
89+
Once you're confident the predictions are correct for a document:
90+
91+
1. Click the **Use as Evaluation Baseline** button
92+
2. The system copies the corrected predictions to the evaluation baseline bucket
93+
94+
Repeat this for every document you want to include in your test set.
95+
96+
For details on the editing interface, see [Web UI — Edit Data](./web-ui.md#edit-data). For details on the evaluation baseline concept, see [Evaluation Framework](./evaluation.md).
97+
98+
## Step 5: Create the Test Set
99+
100+
Now register a test set that references your documents and their ground truth.
101+
102+
1. Go to **Test Studio****Test Sets** tab
103+
2. Click **Add Test Set**
104+
3. Give the test set a name
105+
4. Specify the input bucket path containing your processed files
106+
5. Verify the file count matches your expectations
107+
6. Click **Add Test Set**
108+
109+
For details on test set management, see [Test Studio](./test-studio.md).
110+
111+
## Step 6: Run Test Executions and Compare
112+
113+
With your test set created, you can now run test executions to compare different configurations.
114+
115+
### Run a Baseline Test
116+
117+
1. Go to **Test Studio****Test Executions** tab
118+
2. Select your test set
119+
3. Choose the high-accuracy configuration version you used to create the ground truth
120+
4. Run the test
121+
122+
This establishes your baseline — it should show near-perfect accuracy since the ground truth was generated from these same model predictions.
123+
124+
### Compare with Alternative Configurations
125+
126+
Create and test alternative configurations to find the best cost/accuracy balance:
127+
128+
1. Create a new configuration version with a cheaper model (e.g., Nova Lite)
129+
2. Run a test execution against the same test set using the new configuration
130+
3. Use the **comparison view** to analyze the results side-by-side
131+
132+
### Analyzing Results
133+
134+
The comparison view shows:
135+
136+
- **Overall accuracy** — How each configuration performed against the ground truth
137+
- **Cost comparison** — Total processing cost for each configuration
138+
- **Field-level metrics** — Which specific fields lost accuracy with the cheaper model
139+
140+
This data helps you identify:
141+
- Whether a cheaper model meets your accuracy requirements
142+
- Which fields need attention (e.g., improved prompts, better attribute descriptions)
143+
- The cost/accuracy tradeoff for your specific document type
144+
145+
For details on evaluation metrics and reporting, see [Evaluation Framework](./evaluation.md) and [Enhanced Reporting](./evaluation-enhanced-reporting.md).
146+
147+
## Next Steps
148+
149+
- **Improve accuracy**: Use field-level metrics to refine your document class descriptions, attribute prompts, and few-shot examples. See [IDP Configuration Best Practices](./idp-configuration-best-practices.md) and [Few-Shot Examples](./few-shot-examples.md).
150+
- **Train a custom model**: If your test set is large enough, use it to fine-tune a custom model. See [Custom Model Fine-Tuning](./custom-model-finetuning.md).
151+
- **Automate with CLI/SDK**: Create and run test sets programmatically. See [IDP CLI](./idp-cli.md) and [IDP SDK](./idp-sdk.md).
152+
153+
## Related Documentation
154+
155+
- [Configuration](./configuration.md)
156+
- [Discovery](./discovery.md)
157+
- [Test Studio](./test-studio.md)
158+
- [Evaluation Framework](./evaluation.md)
159+
- [Web UI](./web-ui.md)
160+
- [Custom Model Fine-Tuning](./custom-model-finetuning.md)

docs/demo-videos.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -277,6 +277,13 @@ https://github.com/user-attachments/assets/d952fd37-1bd0-437f-8f67-5a634e9422e0
277277

278278
---
279279

280+
### Creating Custom Test Sets with Ground Truth
281+
End-to-end workflow for creating your own test set from scratch — configure for high accuracy, discover the schema, process and review documents, save ground truth, and compare model accuracy vs. cost.
282+
283+
**Related Documentation**: [Creating Custom Test Sets](./creating-custom-test-sets.md)
284+
285+
---
286+
280287
## Rule Validation
281288

282289
### Rule Validation Demo

0 commit comments

Comments
 (0)