feat(docs): add new guide for creating custom test sets with ground truth data

rstrahan · rstrahan · commit 227061ed78bf · 2026-04-16T12:21:20.000-04:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,9 @@ SPDX-License-Identifier: MIT-0
 
 ### Added
 
+
+- **Creating Custom Test Sets Guide** — New tutorial-style documentation (`docs/creating-custom-test-sets.md`) walking through the end-to-end workflow for creating custom test sets with ground truth data from scratch: configure for max accuracy, discover document schema, process samples, review/edit predictions, save evaluation baselines, register test sets, and run comparative test executions to evaluate cost vs. accuracy tradeoffs. Referenced from `docs/demo-videos.md`.
+  
 - **Configuration Version Tracking Across All Analytics Tables** — Added `config_version` field to all analytics tables (metering, document_evaluations, section_evaluations, attribute_evaluations, and document_sections_*) to enable comprehensive tracking and analytics per configuration version. All Glue tables now include a `config_version` column, and all Parquet files store the configuration version used for each document. Enables direct filtering and comparison queries without complex JOINs - users can query "show me W2 documents processed with config v2.1" or "compare accuracy for configs v2.0 vs v2.1" with simple WHERE clauses. Supports cost analysis, A/B testing, quality comparison, and data lineage tracking. Documents without a config version default to "default".
 
 ### Fixed
diff --git a/docs/creating-custom-test-sets.md b/docs/creating-custom-test-sets.md
@@ -0,0 +1,160 @@
+---
+title: "Creating Custom Test Sets with Ground Truth"
+---
+
+Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+SPDX-License-Identifier: MIT-0
+
+# Creating Custom Test Sets with Ground Truth
+
+This guide walks through the end-to-end workflow for creating a custom test set with ground truth (evaluation baseline) data from scratch. Once created, the test set can be used for:
+
+- **Benchmarking** — Compare accuracy across different models and configurations
+- **Cost optimization** — Find the cheapest model that meets your accuracy requirements
+- **Prompt engineering** — Measure the impact of prompt and schema changes
+- **Custom model training** — Provide labeled training data for fine-tuning (see [Custom Model Fine-Tuning](./custom-model-finetuning.md))
+
+> **Pre-deployed test sets**: The accelerator ships with four ready-to-use benchmark datasets. If you just want to run tests against those, see [Test Studio — Pre-Deployed Test Sets](./test-studio.md#pre-deployed-test-sets). This guide is for creating your **own** test set from your own documents.
+
+## Workflow Overview
+
+```
+┌─────────────┐    ┌─────────────┐    ┌──────────────┐    ┌──────────────┐    ┌─────────────┐    ┌───────────────┐
+│ 1. Configure │───▶│ 2. Discover │───▶│ 3. Process   │───▶│ 4. Review &  │───▶│ 5. Create   │───▶│ 6. Run Test   │
+│    Models    │    │    Schema   │    │    Documents  │    │    Correct   │    │    Test Set  │    │    Executions │
+└─────────────┘    └─────────────┘    └──────────────┘    └──────────────┘    └─────────────┘    └───────────────┘
+  Use the best       Bootstrap          Process sample      Edit predictions    Save as eval       Compare models,
+  model for high     document classes    docs with your      and fix errors      baseline &         prompts, and
+  accuracy           from samples        configuration       in the UI editor    register set       configurations
+```
+
+## Step 1: Configure for Maximum Accuracy
+
+The goal of this initial run is to produce predictions that are as accurate as possible, minimizing the amount of manual editing you'll need to do later. Use the best available model for both classification and extraction.
+
+1. Go to **Configuration** in the web UI
+2. Create a new configuration version (or edit an existing one)
+3. Set both the **classification model** and **extraction model** to a high-accuracy model (e.g., Claude Opus)
+4. Save the configuration version
+
+> **Tip**: You can always create a cheaper configuration later for production use. The expensive model is only used here to bootstrap high-quality ground truth.
+
+For details on configuration management, see [Configuration](./configuration.md) and [Configuration Versions](./configuration-versions.md).
+
+## Step 2: Discover the Document Schema
+
+If you don't already have document classes defined for your document type, use Discovery to bootstrap the schema automatically.
+
+1. Go to **Discovery** in the web UI
+2. Select your high-accuracy configuration version
+3. Upload a representative sample document
+4. Run discovery — it will analyze the document and populate document classes and attributes
+
+After discovery completes, verify the schema in your configuration under **Document Schema**. You should see the discovered document class with its attributes populated.
+
+For details on discovery modes and options, see [Discovery](./discovery.md).
+
+## Step 3: Process Your Sample Documents
+
+Now process a set of sample documents that will become your test set.
+
+1. Go to **Upload Documents** in the web UI
+2. Select your high-accuracy configuration version
+3. Upload your sample documents
+4. Wait for all documents to finish processing
+
+> **How many documents?** For illustration, a handful of documents is fine. For a meaningful benchmark test set, aim for a larger representative sample. For custom model training, you'll need a significant number of labeled documents — see [Custom Model Fine-Tuning](./custom-model-finetuning.md) for guidance on training data requirements.
+
+## Step 4: Review, Edit, and Save Ground Truth
+
+This is the most important step. You'll review each document's predictions, correct any errors, and save the corrected version as evaluation baseline (ground truth).
+
+### Review and Edit Predictions
+
+For each processed document:
+
+1. Open the document from the document list
+2. Click **View Data** to see the extracted information
+3. Click **Edit Data** to enter edit mode
+4. Review each extracted field:
+   - Click on a field to highlight it in the document viewer
+   - Compare the extracted value against the source document
+   - Correct any errors by editing the field value directly
+5. **Save** your changes — the system creates a revision history of all edits
+
+> **Tip**: The solution generates a confidence score for each field. To save time, you could focus on reviewing lower-confidence fields first. However, for the highest quality ground truth, review all fields.
+
+### Save as Evaluation Baseline
+
+Once you're confident the predictions are correct for a document:
+
+1. Click the **Use as Evaluation Baseline** button
+2. The system copies the corrected predictions to the evaluation baseline bucket
+
+Repeat this for every document you want to include in your test set.
+
+For details on the editing interface, see [Web UI — Edit Data](./web-ui.md#edit-data). For details on the evaluation baseline concept, see [Evaluation Framework](./evaluation.md).
+
+## Step 5: Create the Test Set
+
+Now register a test set that references your documents and their ground truth.
+
+1. Go to **Test Studio** → **Test Sets** tab
+2. Click **Add Test Set**
+3. Give the test set a name
+4. Specify the input bucket path containing your processed files
+5. Verify the file count matches your expectations
+6. Click **Add Test Set**
+
+For details on test set management, see [Test Studio](./test-studio.md).
+
+## Step 6: Run Test Executions and Compare
+
+With your test set created, you can now run test executions to compare different configurations.
+
+### Run a Baseline Test
+
+1. Go to **Test Studio** → **Test Executions** tab
+2. Select your test set
+3. Choose the high-accuracy configuration version you used to create the ground truth
+4. Run the test
+
+This establishes your baseline — it should show near-perfect accuracy since the ground truth was generated from these same model predictions.
+
+### Compare with Alternative Configurations
+
+Create and test alternative configurations to find the best cost/accuracy balance:
+
+1. Create a new configuration version with a cheaper model (e.g., Nova Lite)
+2. Run a test execution against the same test set using the new configuration
+3. Use the **comparison view** to analyze the results side-by-side
+
+### Analyzing Results
+
+The comparison view shows:
+
+- **Overall accuracy** — How each configuration performed against the ground truth
+- **Cost comparison** — Total processing cost for each configuration
+- **Field-level metrics** — Which specific fields lost accuracy with the cheaper model
+
+This data helps you identify:
+- Whether a cheaper model meets your accuracy requirements
+- Which fields need attention (e.g., improved prompts, better attribute descriptions)
+- The cost/accuracy tradeoff for your specific document type
+
+For details on evaluation metrics and reporting, see [Evaluation Framework](./evaluation.md) and [Enhanced Reporting](./evaluation-enhanced-reporting.md).
+
+## Next Steps
+
+- **Improve accuracy**: Use field-level metrics to refine your document class descriptions, attribute prompts, and few-shot examples. See [IDP Configuration Best Practices](./idp-configuration-best-practices.md) and [Few-Shot Examples](./few-shot-examples.md).
+- **Train a custom model**: If your test set is large enough, use it to fine-tune a custom model. See [Custom Model Fine-Tuning](./custom-model-finetuning.md).
+- **Automate with CLI/SDK**: Create and run test sets programmatically. See [IDP CLI](./idp-cli.md) and [IDP SDK](./idp-sdk.md).
+
+## Related Documentation
+
+- [Configuration](./configuration.md)
+- [Discovery](./discovery.md)
+- [Test Studio](./test-studio.md)
+- [Evaluation Framework](./evaluation.md)
+- [Web UI](./web-ui.md)
+- [Custom Model Fine-Tuning](./custom-model-finetuning.md)
diff --git a/docs/demo-videos.md b/docs/demo-videos.md
@@ -277,6 +277,13 @@ https://github.com/user-attachments/assets/d952fd37-1bd0-437f-8f67-5a634e9422e0
 
 ---
 
+### Creating Custom Test Sets with Ground Truth
+End-to-end workflow for creating your own test set from scratch — configure for high accuracy, discover the schema, process and review documents, save ground truth, and compare model accuracy vs. cost.
+
+**Related Documentation**: [Creating Custom Test Sets](./creating-custom-test-sets.md)
+
+---
+
 ## Rule Validation
 
 ### Rule Validation Demo