Skip to content

Commit 91a287e

Browse files
committed
Merge branch 'feature/add-config-version-to-metering' into 'develop'
Adding config verison in schema provider See merge request genaiic-reusable-assets/engagement-artifacts/genaiic-idp-accelerator!623
2 parents a5f0224 + 5306ec6 commit 91a287e

7 files changed

Lines changed: 190 additions & 11 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ SPDX-License-Identifier: MIT-0
77

88
### Added
99

10-
- **Configuration Version in Metering Database** — Added `config_version` field to the metering database to enable cost tracking and analytics per configuration version. The metering Glue table now includes a `config_version` column, and all metering Parquet files store the configuration version used for each document. Enables Athena queries to compare costs across different configurations, support A/B testing analytics, and optimize per-version costs. Documents without a config version default to "default".
10+
- **Configuration Version Tracking Across All Analytics Tables** — Added `config_version` field to all analytics tables (metering, document_evaluations, section_evaluations, attribute_evaluations, and document_sections_*) to enable comprehensive tracking and analytics per configuration version. All Glue tables now include a `config_version` column, and all Parquet files store the configuration version used for each document. Enables direct filtering and comparison queries without complex JOINs - users can query "show me W2 documents processed with config v2.1" or "compare accuracy for configs v2.0 vs v2.1" with simple WHERE clauses. Supports cost analysis, A/B testing, quality comparison, and data lineage tracking. Documents without a config version default to "default".
1111

1212
### Fixed
1313

docs/reporting-database.md

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ The `document_evaluations` table contains document-level evaluation metrics:
4646
| false_alarm_rate | double | False alarm rate (0-1) |
4747
| false_discovery_rate | double | False discovery rate (0-1) |
4848
| execution_time | double | Time taken to evaluate (seconds) |
49+
| config_version | string | Configuration version used for processing |
4950

5051
This table is partitioned by date (YYYY-MM-DD format).
5152

@@ -65,6 +66,7 @@ The `section_evaluations` table contains section-level evaluation metrics:
6566
| false_alarm_rate | double | Section false alarm rate (0-1) |
6667
| false_discovery_rate | double | Section false discovery rate (0-1) |
6768
| evaluation_date | timestamp | When the evaluation was performed |
69+
| config_version | string | Configuration version used for processing |
6870

6971
This table is partitioned by date (YYYY-MM-DD format).
7072

@@ -87,6 +89,7 @@ The `attribute_evaluations` table contains attribute-level evaluation metrics:
8789
| confidence | string | Confidence score from extraction |
8890
| confidence_threshold | string | Confidence threshold used |
8991
| evaluation_date | timestamp | When the evaluation was performed |
92+
| config_version | string | Configuration version used for processing |
9093

9194
This table is partitioned by date (YYYY-MM-DD format).
9295

@@ -144,6 +147,7 @@ The `metering` table captures detailed usage metrics and cost information for ea
144147
| unit_cost | double | Cost per unit in USD (e.g., cost per token, cost per page) |
145148
| estimated_cost | double | Calculated total cost in USD (value × unit_cost) |
146149
| timestamp | timestamp | When the operation was performed |
150+
| config_version | string | Configuration version used for processing |
147151

148152
This table is partitioned by date (YYYY-MM-DD format).
149153

@@ -214,6 +218,7 @@ Document sections are stored in dynamically created tables based on the section
214218
| section_classification | string | Type/class of the section |
215219
| section_confidence | double | Confidence score for the section classification |
216220
| timestamp | timestamp | When the document was processed |
221+
| config_version | string | Configuration version used for processing |
217222

218223
**Dynamic Data Columns:**
219224
The remaining columns are dynamically inferred from the JSON extraction results and vary by section type. Common patterns include:
@@ -502,6 +507,68 @@ GROUP BY
502507
context
503508
ORDER BY
504509
total_cost DESC;
510+
511+
-- Cost analysis by configuration version
512+
SELECT
513+
config_version,
514+
COUNT(DISTINCT document_id) as document_count,
515+
SUM(estimated_cost) as total_cost,
516+
AVG(estimated_cost) as avg_cost_per_record
517+
FROM
518+
metering
519+
WHERE
520+
date >= '2024-01-01'
521+
GROUP BY
522+
config_version
523+
ORDER BY
524+
total_cost DESC;
525+
```
526+
527+
**Configuration version analysis:**
528+
```sql
529+
-- Compare accuracy across configuration versions
530+
SELECT
531+
config_version,
532+
AVG(accuracy) as avg_accuracy,
533+
AVG(f1_score) as avg_f1_score,
534+
COUNT(DISTINCT document_id) as document_count
535+
FROM
536+
document_evaluations
537+
WHERE
538+
date >= '2024-01-01'
539+
GROUP BY
540+
config_version
541+
ORDER BY
542+
avg_f1_score DESC;
543+
544+
-- Filter documents by configuration version
545+
SELECT
546+
document_id,
547+
section_classification,
548+
timestamp
549+
FROM
550+
document_sections_w2
551+
WHERE
552+
config_version = 'v2.1'
553+
AND date >= '2024-01-01'
554+
LIMIT 100;
555+
556+
-- Cost vs quality by configuration version
557+
SELECT
558+
m.config_version,
559+
AVG(e.weighted_overall_score) as avg_quality,
560+
SUM(m.estimated_cost) as total_cost,
561+
COUNT(DISTINCT m.document_id) as document_count
562+
FROM
563+
document_evaluations e
564+
JOIN
565+
metering m ON e.document_id = m.document_id AND e.config_version = m.config_version
566+
WHERE
567+
e.date >= '2024-01-01'
568+
GROUP BY
569+
m.config_version
570+
ORDER BY
571+
avg_quality DESC;
505572
```
506573

507574
### Creating Dashboards

lib/idp_common_pkg/idp_common/agents/analytics/assets/db_description.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,22 +5,22 @@ The solution creates several predefined tables in the Glue Data Catalog:
55

66
1. Document Evaluations Table (document_evaluations)
77
* Contains document-level evaluation metrics
8-
* Columns include: document_id, input_key, evaluation_date, accuracy, precision, recall, f1_score, false_alarm_rate, false_discovery_rate, execution_time
8+
* Columns include: document_id, input_key, evaluation_date, accuracy, precision, recall, f1_score, false_alarm_rate, false_discovery_rate, execution_time, config_version
99
* Partitioned by date (YYYY-MM-DD format)
1010

1111
2. Section Evaluations Table (section_evaluations)
1212
* Contains section-level evaluation metrics
13-
* Columns include: document_id, section_id, section_type, accuracy, precision, recall, f1_score, false_alarm_rate, false_discovery_rate, evaluation_date
13+
* Columns include: document_id, section_id, section_type, accuracy, precision, recall, f1_score, false_alarm_rate, false_discovery_rate, evaluation_date, config_version
1414
* Partitioned by date (YYYY-MM-DD format)
1515

1616
3. Attribute Evaluations Table (attribute_evaluations)
1717
* Contains attribute-level evaluation metrics
18-
* Columns include: document_id, section_id, section_type, attribute_name, expected, actual, matched, score, reason, evaluation_method, confidence, confidence_threshold, evaluation_date
18+
* Columns include: document_id, section_id, section_type, attribute_name, expected, actual, matched, score, reason, evaluation_method, confidence, confidence_threshold, evaluation_date, config_version
1919
* Partitioned by date (YYYY-MM-DD format)
2020

2121
4. Metering Table (metering)
2222
* Captures detailed usage metrics for document processing operations
23-
* Columns include: document_id, context, service_api, unit, value, number_of_pages, timestamp
23+
* Columns include: document_id, context, service_api, unit, value, number_of_pages, timestamp, config_version
2424
* Partitioned by date (YYYY-MM-DD format)
2525

2626
5. Rule Validation Summary Table (rule_validation_summary)
@@ -39,7 +39,7 @@ In addition to the predefined tables, the solution also creates dynamic tables f
3939

4040
* Tables are automatically created by an AWS Glue Crawler based on the section classification
4141
* Each section type gets its own table (e.g., document_sections_invoice, document_sections_receipt)
42-
* Common columns include: section_id, document_id, section_classification, section_confidence, timestamp
42+
* Common columns include: section_id, document_id, section_classification, section_confidence, timestamp, config_version
4343
* Additional columns are dynamically inferred from the JSON extraction results
4444
* Tables are partitioned by date (YYYY-MM-DD format)
4545

lib/idp_common_pkg/idp_common/agents/analytics/schema_provider.py

Lines changed: 55 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,7 @@ def get_metering_table_description() -> str:
5454
- `unit_cost` (double): Cost per unit in USD
5555
- `estimated_cost` (double): Calculated total cost (value × unit_cost)
5656
- `timestamp` (timestamp): When the operation was performed
57+
- `config_version` (string): Configuration version used for processing (defaults to "default" if not specified)
5758
5859
**Partitioned by**: date (YYYY-MM-DD format)
5960
@@ -89,12 +90,21 @@ def get_metering_table_description() -> str:
8990
ORDER BY total_cost DESC
9091
9192
-- Token usage by model
92-
SELECT "service_api",
93+
SELECT "service_api",
9394
SUM(CASE WHEN "unit" = 'inputTokens' THEN "value" ELSE 0 END) as input_tokens,
9495
SUM(CASE WHEN "unit" = 'outputTokens' THEN "value" ELSE 0 END) as output_tokens
95-
FROM metering
96+
FROM metering
9697
WHERE "unit" IN ('inputTokens', 'outputTokens')
9798
GROUP BY "service_api"
99+
100+
-- Cost by configuration version
101+
SELECT "config_version",
102+
SUM("estimated_cost") as total_cost,
103+
COUNT(DISTINCT "document_id") as document_count,
104+
SUM("estimated_cost") / COUNT(DISTINCT "document_id") as avg_cost_per_doc
105+
FROM metering
106+
GROUP BY "config_version"
107+
ORDER BY total_cost DESC
98108
```
99109
"""
100110

@@ -187,6 +197,7 @@ def get_evaluation_tables_description() -> str:
187197
- Use `document_id` to join between all three tables
188198
- Use `section_id` and `document_id` to join section and attribute evaluations
189199
- Join with metering table on `document_id` for cost vs accuracy analysis
200+
- `config_version` is available directly in all evaluation tables (no join needed)
190201
191202
### Sample Queries:
192203
```sql
@@ -211,13 +222,40 @@ def get_evaluation_tables_description() -> str:
211222
WHERE "confidence" IS NOT NULL
212223
GROUP BY confidence_band
213224
214-
-- Cost per accuracy point by document type
225+
-- Cost per accuracy point by document type
215226
SELECT se."section_type",
216227
AVG(se."accuracy") as avg_accuracy,
217228
SUM(m."estimated_cost") / COUNT(DISTINCT m."document_id") as avg_cost_per_doc
218229
FROM section_evaluations se
219-
JOIN metering m ON se."document_id" = m."document_id"
230+
JOIN metering m ON se."document_id" = m."document_id"
220231
GROUP BY se."section_type"
232+
233+
-- Filter by config_version (available directly in evaluation tables)
234+
SELECT "document_id",
235+
"accuracy",
236+
"f1_score",
237+
"config_version"
238+
FROM document_evaluations
239+
WHERE "config_version" = 'your-config-version'
240+
241+
-- Compare accuracy across configuration versions
242+
SELECT "config_version",
243+
AVG("accuracy") as avg_accuracy,
244+
AVG("f1_score") as avg_f1_score,
245+
COUNT(DISTINCT "document_id") as document_count
246+
FROM document_evaluations
247+
GROUP BY "config_version"
248+
ORDER BY avg_f1_score DESC
249+
250+
-- Cost vs quality analysis by config version
251+
SELECT e."config_version",
252+
AVG(e."weighted_overall_score") as avg_quality_score,
253+
SUM(m."estimated_cost") as total_cost,
254+
SUM(m."estimated_cost") / AVG(e."weighted_overall_score") as cost_per_quality_point
255+
FROM document_evaluations e
256+
JOIN metering m ON e."document_id" = m."document_id"
257+
GROUP BY e."config_version"
258+
ORDER BY avg_quality_score DESC
221259
```
222260
"""
223261

@@ -371,6 +409,7 @@ def get_dynamic_document_sections_description(
371409
" - `timestamp` (timestamp): When the document was processed\n"
372410
)
373411
description += " - `date` (string): Partition key in YYYY-MM-DD format\n"
412+
description += " - `config_version` (string): Configuration version used for processing\n"
374413
description += (
375414
" - Various `metadata.*` columns (strings): Processing metadata\n"
376415
)
@@ -450,6 +489,16 @@ def get_dynamic_document_sections_description(
450489
JOIN metering m ON ds."document_id" = m."document_id"
451490
WHERE ds."document_class.type" = 'W2'
452491
GROUP BY ds."section_classification", ds."document_class.type"
492+
493+
-- CORRECT: Filter by configuration version
494+
SELECT "document_id",
495+
"document_class.type",
496+
"config_version",
497+
"timestamp"
498+
FROM document_sections_w2
499+
WHERE "config_version" = 'fake_w2'
500+
AND date >= '2024-01-01'
501+
ORDER BY "timestamp" DESC
453502
```
454503
455504
**This schema information is generated from your actual configuration and shows exactly what tables and columns exist in your deployment.**
@@ -757,12 +806,13 @@ def _get_specific_document_sections_table_info(
757806
758807
#### Standard Columns (present in all document_sections tables):
759808
- `"document_id"` (string): Unique identifier for the document
760-
- `"section_id"` (string): Unique identifier for the section
809+
- `"section_id"` (string): Unique identifier for the section
761810
- `"section_classification"` (string): Type/class of the document section
762811
- `"section_confidence"` (string): Confidence score for classification
763812
- `"explainability_info"` (string): JSON with extraction field confidence scores and geometry
764813
- `"timestamp"` (timestamp): When document was processed in YYYY-MM-DD hh:mm:ss.ms format
765814
- `"date"` (string): Partition key in YYYY-MM-DD format
815+
- `"config_version"` (string): Configuration version used for processing
766816
767817
#### Columns specific to this table:
768818
"""

lib/idp_common_pkg/idp_common/reporting/save_reporting_data.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -599,6 +599,7 @@ def save_evaluation_results(self, document: Document) -> Optional[Dict[str, Any]
599599
("correctly_classified_pages", pa.int32()),
600600
("correctly_split_without_order", pa.int32()),
601601
("correctly_split_with_order", pa.int32()),
602+
("config_version", pa.string()),
602603
]
603604
)
604605

@@ -615,6 +616,7 @@ def save_evaluation_results(self, document: Document) -> Optional[Dict[str, Any]
615616
("false_discovery_rate", pa.float64()),
616617
("weighted_overall_score", pa.float64()),
617618
("evaluation_date", pa.timestamp("ms")),
619+
("config_version", pa.string()),
618620
]
619621
)
620622

@@ -634,6 +636,7 @@ def save_evaluation_results(self, document: Document) -> Optional[Dict[str, Any]
634636
("confidence_threshold", pa.string()),
635637
("weight", pa.float64()),
636638
("evaluation_date", pa.timestamp("ms")),
639+
("config_version", pa.string()),
637640
]
638641
)
639642

@@ -730,6 +733,7 @@ def save_evaluation_results(self, document: Document) -> Optional[Dict[str, Any]
730733
if doc_split_metrics
731734
else None
732735
),
736+
"config_version": document.config_version or "default",
733737
}
734738

735739
# Save document metrics in Parquet format
@@ -768,6 +772,7 @@ def save_evaluation_results(self, document: Document) -> Optional[Dict[str, Any]
768772
"weighted_overall_score", 0.0
769773
),
770774
"evaluation_date": evaluation_date, # Use document's initial_event_time
775+
"config_version": document.config_version or "default",
771776
}
772777
section_records.append(section_record)
773778

@@ -804,6 +809,7 @@ def save_evaluation_results(self, document: Document) -> Optional[Dict[str, Any]
804809
),
805810
"weight": weight, # Explicitly handle None values
806811
"evaluation_date": evaluation_date, # Use document's initial_event_time
812+
"config_version": document.config_version or "default",
807813
}
808814
attribute_records.append(attribute_record)
809815
logger.debug(
@@ -1292,6 +1298,9 @@ def save_document_sections(self, document: Document) -> Optional[Dict[str, Any]]
12921298
flattened_data["section_classification"] = section.classification
12931299
flattened_data["section_confidence"] = section.confidence
12941300
flattened_data["timestamp"] = timestamp
1301+
flattened_data["config_version"] = (
1302+
document.config_version or "default"
1303+
)
12951304

12961305
section_records.append(flattened_data)
12971306

@@ -1311,6 +1320,9 @@ def save_document_sections(self, document: Document) -> Optional[Dict[str, Any]]
13111320
)
13121321
flattened_item["section_confidence"] = section.confidence
13131322
flattened_item["record_index"] = i
1323+
flattened_item["config_version"] = (
1324+
document.config_version or "default"
1325+
)
13141326

13151327
section_records.append(flattened_item)
13161328
else:
@@ -1321,6 +1333,7 @@ def save_document_sections(self, document: Document) -> Optional[Dict[str, Any]]
13211333
"section_classification": section.classification,
13221334
"section_confidence": section.confidence,
13231335
"value": str(extraction_data),
1336+
"config_version": document.config_version or "default",
13241337
}
13251338
section_records.append(record)
13261339

lib/idp_common_pkg/tests/unit/reporting/test_save_reporting_data.py

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -446,3 +446,46 @@ def test_save_with_multiple_data_types(
446446
mock_save_metering.assert_called_once_with(document_with_sections)
447447
mock_save_sections.assert_called_once_with(document_with_sections)
448448
assert len(results) == 2
449+
450+
def test_config_version_in_schemas_and_records(self, mock_s3_client):
451+
"""Test that config_version is included in schemas and populated in records."""
452+
import pyarrow as pa
453+
454+
# Test metering schema
455+
metering_schema = pa.schema(
456+
[
457+
("document_id", pa.string()),
458+
("context", pa.string()),
459+
("service_api", pa.string()),
460+
("unit", pa.string()),
461+
("value", pa.float64()),
462+
("number_of_pages", pa.int32()),
463+
("unit_cost", pa.float64()),
464+
("estimated_cost", pa.float64()),
465+
("timestamp", pa.timestamp("ms")),
466+
("config_version", pa.string()),
467+
]
468+
)
469+
assert "config_version" in [field.name for field in metering_schema]
470+
471+
# Test document evaluation schema (check in save_evaluation_results source)
472+
# Create a test document with config_version
473+
doc_with_config = Document(
474+
id="test-doc", input_key="test/doc.pdf", config_version="test-v1.0"
475+
)
476+
doc_without_config = Document(id="test-doc2", input_key="test/doc2.pdf")
477+
478+
# Verify config_version fallback behavior
479+
assert doc_with_config.config_version == "test-v1.0"
480+
assert doc_without_config.config_version is None
481+
482+
# Simulate record creation with fallback
483+
record_with_config = {
484+
"config_version": doc_with_config.config_version or "default"
485+
}
486+
record_without_config = {
487+
"config_version": doc_without_config.config_version or "default"
488+
}
489+
490+
assert record_with_config["config_version"] == "test-v1.0"
491+
assert record_without_config["config_version"] == "default"

0 commit comments

Comments
 (0)