Skip to content

Latest commit

 

History

History
847 lines (706 loc) · 32 KB

File metadata and controls

847 lines (706 loc) · 32 KB

Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. SPDX-License-Identifier: MIT-0

Evaluation Service

The Evaluation Service component provides functionality to evaluate document extraction results by comparing extracted attributes against expected values.

Backend Integration

The evaluation service uses Stickler as its backend evaluation engine. Stickler is an AWS open-source library that provides sophisticated comparison algorithms and flexible configuration options. The IDP evaluation service provides an abstraction layer through SticklerConfigMapper that:

  • Translates IDP evaluation extensions (x-aws-idp-evaluation-*) to Stickler format
  • Maintains backend-agnostic configuration in IDP
  • Enables seamless integration with Stickler's advanced evaluation capabilities
  • Tracks Stickler version information for compatibility and debugging

For version information and features available in the current Stickler integration, see stickler_version.py.

Features

  • Compares document extraction results with expected (ground truth) results
  • Supports multiple evaluation methods:
    • Exact match - Character-for-character comparison after normalizing whitespace and punctuation
    • Numeric exact match - Value-based comparison after normalizing numeric formats
    • Fuzzy string matching - Similarity-based matching with configurable thresholds
    • Hungarian algorithm - Optimal matching for lists of values
    • Semantic similarity - Meaning-based comparison using Bedrock Titan embeddings
    • LLM-based semantic evaluation - Advanced meaning comparison with explanation using Bedrock models
  • Smart attribute discovery and evaluation:
    • Automatically discovers attributes in the extraction results not defined in the configuration
    • Handles attributes found only in expected data, only in actual data, or in both
    • Applies default comparison method (LLM) for unconfigured attributes with clear indication
  • Assessment Confidence Integration:
    • Automatically extracts and displays confidence scores from assessment results
    • Shows confidence (extraction confidence)
    • Integrates with explainability_info from the assessment feature
    • Provides insights into data quality for both baseline and extraction results
  • Calculates key metrics including:
    • Precision, Recall, and F1 score
    • Accuracy and Error rates
    • False alarm rate and False discovery rate
  • Generates rich, visual evaluation reports with:
    • Color-coded status indicators
    • Performance ratings
    • Progress bar visualizations
    • Detailed attribute comparisons
    • Confidence score columns for quality analysis
  • Supports both JSON and Markdown report formats
  • Fully integrated with the Document model architecture
  • Document Split Classification Metrics:
    • Evaluates document splitting and classification accuracy
    • Calculates page-level classification accuracy
    • Measures split accuracy (with and without page order consideration)
    • Provides detailed per-page and per-section analysis
    • Generates comprehensive markdown reports with visual indicators

Usage

from idp_common.models import Document, Status
from idp_common import ocr, classification, extraction, evaluation

# Get configuration (with evaluation methods specified for all attribute types)
config = {
    "evaluation": {
        "llm_method": {
            "model": "anthropic.claude-3-sonnet-20240229-v1:0",
            "temperature": 0.0,
            "top_k": 5,
            "system_prompt": "You are an evaluator that helps determine if the predicted and expected values match...",
            "task_prompt": "I need to evaluate attribute extraction for a document of class: {DOCUMENT_CLASS}..."
        }
    },
    "classes": [
        {
            "name": "Bank Statement",
            "attributes": [
                # Simple Attributes
                {
                    "name": "Account Number",
                    "description": "Primary account identifier",
                    "attributeType": "simple",  # or omit for default
                    "evaluation_method": "EXACT"
                },
                {
                    "name": "Statement Period",
                    "description": "Statement period (e.g., January 2024)",
                    "attributeType": "simple",
                    "evaluation_method": "FUZZY",
                    "evaluation_threshold": 0.8
                },
                
                # Group Attributes - nested object structures
                {
                    "name": "Account Holder Address",
                    "description": "Complete address information for the account holder",
                    "attributeType": "group",
                    "groupAttributes": [
                        {
                            "name": "Street Number",
                            "description": "House or building number",
                            "evaluation_method": "FUZZY",
                            "evaluation_threshold": 0.9
                        },
                        {
                            "name": "Street Name",
                            "description": "Name of the street",
                            "evaluation_method": "FUZZY",
                            "evaluation_threshold": 0.8
                        },
                        {
                            "name": "City",
                            "description": "City name",
                            "evaluation_method": "FUZZY",
                            "evaluation_threshold": 0.9
                        },
                        {
                            "name": "State",
                            "description": "State abbreviation",
                            "evaluation_method": "EXACT"
                        },
                        {
                            "name": "ZIP Code",
                            "description": "Postal code",
                            "evaluation_method": "EXACT"
                        }
                    ]
                },
                
                # List Attributes - arrays of items with consistent structure
                {
                    "name": "Transactions",
                    "description": "List of all transactions in the statement period",
                    "attributeType": "list",
                    "listItemTemplate": {
                        "itemDescription": "Individual transaction record",
                        "itemAttributes": [
                            {
                                "name": "Date",
                                "description": "Transaction date",
                                "evaluation_method": "FUZZY",
                                "evaluation_threshold": 0.9
                            },
                            {
                                "name": "Description",
                                "description": "Transaction description or merchant name",
                                "evaluation_method": "SEMANTIC",
                                "evaluation_threshold": 0.7
                            },
                            {
                                "name": "Amount",
                                "description": "Transaction amount",
                                "evaluation_method": "NUMERIC_EXACT"
                            }
                        ]
                    }
                }
            ]
        }
    ]
}

# Create evaluation service
evaluation_service = evaluation.EvaluationService(config=config)

# Evaluate documents (stores results in S3 by default)
result_document = evaluation_service.evaluate_document(
    actual_document=processed_document,
    expected_document=expected_document
)

# Access evaluation report URI
evaluation_report_uri = result_document.evaluation_report_uri

# You can also access the evaluation result directly
evaluation_result = result_document.evaluation_result
overall_metrics = evaluation_result.overall_metrics
section_results = evaluation_result.section_results

# Or skip storage if needed (for quick memory-only evaluations)
memory_only_document = evaluation_service.evaluate_document(
    actual_document=processed_document,
    expected_document=expected_document,
    store_results=False
)

Evaluation Methods

The service supports multiple evaluation methods that can be configured for each attribute:

  • EXACT: Exact string match (after normalizing whitespace and punctuation)
  • NUMERIC_EXACT: Exact match for numeric values (after normalizing currency symbols)
  • FUZZY: Fuzzy string matching with configurable evaluation_threshold
  • HUNGARIAN: Optimal matching for lists of values using the Hungarian algorithm with configurable comparator types:
    • EXACT: Default comparator for exact string matching (after normalization)
    • FUZZY: Fuzzy string matching with configurable threshold
    • NUMERIC: Numeric comparison after normalizing currency symbols and formats
  • SEMANTIC: Efficient semantic similarity comparison using Bedrock Titan embeddings (amazon.titan-embed-text-v1)
  • LLM: LLM-based evaluation using Bedrock models (Claude or Titan) for semantically comparable values with detailed explanations

Semantic vs LLM Evaluation

The service offers two approaches for semantic evaluation:

  • SEMANTIC Method: Uses embedding-based comparison with Bedrock Titan embeddings

    • Faster and more cost-effective than LLM-based evaluation
    • Provides similarity scores without explanations
    • Great for high-volume comparisons where speed is important
    • Configurable threshold for matching sensitivity
  • LLM Method: Uses Bedrock Claude or other LLM models

    • Provides detailed reasoning for why values match or don't match
    • Better at handling implicit/explicit information differences
    • More nuanced understanding of semantic equivalence
    • Ideal for cases where understanding the rationale is important
    • Used as the default method for attributes discovered in the data but not in the configuration

Output

The evaluation produces:

  1. JSON Results: Detailed evaluation results with metrics
  2. Markdown Report: Human-readable report with tables and summaries

Metrics

The evaluation calculates the following metrics:

  • Precision: Accuracy of positive predictions (TP / (TP + FP))
  • Recall: Coverage of actual positive cases (TP / (TP + FN))
  • F1 Score: Harmonic mean of precision and recall
  • Accuracy: Overall correctness (TP + TN) / (TP + TN + FP + FN)
  • False Alarm Rate (FAR): Rate of false positives among negatives (FP / (FP + TN))
    • Measures how often the system extracts information that wasn't present in the document
  • False Discovery Rate (FDR): Rate of false positives among positive predictions (FP / (FP + TP))
    • Measures what proportion of the extracted information is incorrect

These metrics are calculated at both the attribute level (per field), section level (per document class), and document level (overall performance).

Visual Reporting

The evaluation module produces richly formatted Markdown reports with:

  1. Summary Dashboard:

    • Overall match rate with visual progress bar
    • Color-coded indicators for key metrics (🟢 Excellent, 🟡 Good, 🟠 Fair, 🔴 Poor)
    • Fraction of matched attributes (e.g., 8/10 attributes matched)
  2. Performance Tables:

    • Metrics tables with value ratings
    • First-column status indicators (✅/❌) for immediate identification of matches
    • Detailed attribution of evaluation methods used for each field, including:
      • Method types (EXACT, FUZZY, HUNGARIAN, etc.)
      • Thresholds for fuzzy and semantic matching methods
      • Comparator types for the Hungarian method
      • Combined display for HUNGARIAN with FUZZY comparator showing both comparator type and threshold
  3. Method Explanations:

    • Clear documentation of evaluation methods
    • Descriptions of scoring mechanisms
    • Guidance on interpreting results
    • Indications for attributes that were discovered in the data but not in the configuration

Examples of method display in reports:

  • EXACT - Simple exact matching
  • FUZZY (threshold: 0.8) - Fuzzy matching with threshold
  • HUNGARIAN (comparator: EXACT) - Hungarian algorithm with exact matching
  • HUNGARIAN (comparator: FUZZY, threshold: 0.7) - Hungarian with fuzzy matching and threshold
  • HUNGARIAN (comparator: NUMERIC) - Hungarian with numeric comparison

The reports are designed to provide both at-a-glance performance assessment and detailed diagnostic information.

Auto-Discovery of Attributes

The EvaluationService can automatically discover and evaluate attributes that exist in the data but are not defined in the configuration:

# Sample extracted data may have more attributes than configured
actual_results = {
    "invoice_number": "INV-12345",          # In configuration
    "amount_due": 1250.00,                  # In configuration
    "issue_date": "2023-01-15",             # Not in configuration
    "due_date": "2023-02-15"                # Not in configuration
}

expected_results = {
    "invoice_number": "INV-12345",          # In configuration
    "amount_due": "$1,250.00",              # In configuration 
    "issue_date": "01/15/2023",             # Not in configuration
    "reference_number": "REF-98765"         # Not in configuration, missing in actual
}

# The service will:
# 1. Evaluate invoice_number and amount_due using methods in configuration
# 2. Discover issue_date (in both) and evaluate using LLM (default method)
# 3. Discover due_date (only in actual) and evaluate as not matched
# 4. Discover reference_number (only in expected) and evaluate as not matched
# 5. Add "[Default method - attribute not specified in the configuration]" to reason for discovered attributes

This capability is particularly useful for:

  • Exploratory evaluation when the complete schema is not yet defined
  • Handling variations in extraction outputs that may contain additional information
  • Identifying potential new attributes to add to the configuration
  • Ensuring all extracted data is evaluated, even without explicit configuration

Assessment Confidence Integration

The evaluation service automatically integrates with the assessment feature to display confidence scores alongside evaluation results. When extraction results include explainability_info (generated by the assessment feature), the confidence scores are automatically extracted and displayed in both JSON and Markdown reports.

Confidence Score Types

  • Confidence: Confidence score for the extraction results being evaluated

Enhanced Report Format

JSON Output with Confidence

{
  "attributes": [
    {
      "name": "invoice_number",
      "expected": "INV-2024-001",
      "actual": "INV-2024-001",
      "matched": true,
      "score": 1.0,
      "confidence": 0.92,
      "evaluation_method": "EXACT"
    }
  ]
}

Markdown Table with Confidence

| Status | Attribute | Expected | Actual | Confidence | Score | Method | Reason |
| :----: | --------- | -------- | ------ | :---------------: | ----- | ------ | ------ |
| ✅ | invoice_number | INV-2024-001 | INV-2024-001 | 0.92 | 1.00 | EXACT | Exact match |
| ❌ | vendor_name | ABC Corp | XYZ Inc | 0.75 | 0.00 | EXACT | Values do not match |

Quality Analysis Benefits

Confidence scores provide additional insights for evaluation analysis:

  1. Extraction Quality Assessment: Low confidence highlights extraction results needing review
  2. Confidence-Accuracy Correlation: Compare confidence levels with evaluation accuracy to identify patterns
  3. Quality Prioritization: Focus improvement efforts on low-confidence, low-accuracy results

Backward Compatibility

The confidence integration is fully backward compatible:

  • Reports without assessment data show "N/A" for confidence columns
  • Evaluation logic remains unchanged when confidence data is absent
  • Existing evaluation workflows continue to work without modification

Nested Structure Support

The evaluation service fully supports nested document structures including group attributes and list attributes. The service automatically processes these complex structures by flattening them into individual evaluable fields while preserving the configured evaluation methods.

Attribute Types and Processing

Simple Attributes

Basic single-value extractions that are evaluated directly:

# Configuration
{
    "name": "Account Number",
    "attributeType": "simple",
    "evaluation_method": "EXACT"
}

# Flattened attribute name: "Account Number"
# Evaluation: Direct comparison using EXACT method

Group Attributes

Nested object structures where each sub-attribute is evaluated individually:

# Configuration
{
    "name": "Account Holder Address",
    "attributeType": "group",
    "groupAttributes": [
        {
            "name": "Street Number",
            "evaluation_method": "FUZZY",
            "evaluation_threshold": 0.9
        },
        {
            "name": "City",
            "evaluation_method": "FUZZY", 
            "evaluation_threshold": 0.9
        }
    ]
}

# Flattened attribute names:
# - "Account Holder Address.Street Number" (FUZZY evaluation)
# - "Account Holder Address.City" (FUZZY evaluation)

List Attributes

Arrays of items where each item's attributes are evaluated individually:

# Configuration
{
    "name": "Transactions",
    "attributeType": "list",
    "listItemTemplate": {
        "itemAttributes": [
            {
                "name": "Date",
                "evaluation_method": "FUZZY",
                "evaluation_threshold": 0.9
            },
            {
                "name": "Amount",
                "evaluation_method": "NUMERIC_EXACT"
            }
        ]
    }
}

# Flattened attribute names for each transaction:
# - "Transactions[0].Date" (FUZZY evaluation)
# - "Transactions[0].Amount" (NUMERIC_EXACT evaluation)
# - "Transactions[1].Date" (FUZZY evaluation)
# - "Transactions[1].Amount" (NUMERIC_EXACT evaluation)
# - And so on for each transaction...

Data Flattening Process

The evaluation service automatically flattens nested extraction results for comparison:

Input Data (Nested)

{
  "Account Number": "1234567890",
  "Account Holder Address": {
    "Street Number": "123",
    "Street Name": "Main St",
    "City": "Seattle",
    "State": "WA"
  },
  "Transactions": [
    {
      "Date": "01/15/2024",
      "Description": "Coffee Shop",
      "Amount": "-4.50"
    },
    {
      "Date": "01/16/2024", 
      "Description": "ATM Withdrawal",
      "Amount": "-20.00"
    }
  ]
}

Flattened Data (For Evaluation)

{
  "Account Number": "1234567890",
  "Account Holder Address.Street Number": "123",
  "Account Holder Address.Street Name": "Main St", 
  "Account Holder Address.City": "Seattle",
  "Account Holder Address.State": "WA",
  "Transactions[0].Date": "01/15/2024",
  "Transactions[0].Description": "Coffee Shop",
  "Transactions[0].Amount": "-4.50",
  "Transactions[1].Date": "01/16/2024",
  "Transactions[1].Description": "ATM Withdrawal", 
  "Transactions[1].Amount": "-20.00"
}

Evaluation Results for Nested Structures

The evaluation service provides detailed results for all flattened attributes:

Sample Evaluation Output

{
  "attributes": [
    {
      "name": "Account Number",
      "expected": "1234567890",
      "actual": "1234567890", 
      "matched": true,
      "score": 1.0,
      "confidence": 0.95,
      "evaluation_method": "EXACT"
    },
    {
      "name": "Account Holder Address.City",
      "expected": "Seattle",
      "actual": "Seattle",
      "matched": true,
      "score": 1.0,
      "confidence": 0.88,
      "evaluation_method": "FUZZY",
      "evaluation_threshold": 0.9
    },
    {
      "name": "Transactions[0].Amount",
      "expected": "-4.50",
      "actual": "-4.50",
      "matched": true,
      "score": 1.0,
      "confidence": 0.92,
      "evaluation_method": "NUMERIC_EXACT"
    },
    {
      "name": "Transactions[1].Description", 
      "expected": "ATM Withdrawal",
      "actual": "ATM Cash",
      "matched": true,
      "score": 0.85,
      "confidence": 0.87,
      "evaluation_method": "SEMANTIC",
      "evaluation_threshold": 0.7
    }
  ]
}

Markdown Report for Nested Structures

| Status | Attribute | Expected | Actual | Confidence | Score | Method | Reason |
| :----: | --------- | -------- | ------ | :--------: | ----- | ------ | ------ |
|| Account Number | 1234567890 | 1234567890 | 0.95 | 1.00 | EXACT | Exact match |
|| Account Holder Address.Street Number | 123 | 123 | 0.95 | 1.00 | FUZZY (threshold: 0.9) | Exact match |
|| Account Holder Address.City | Seattle | Seattle | 0.88 | 1.00 | FUZZY (threshold: 0.9) | Exact match |
|| Account Holder Address.State | WA | Washington | 0.82 | 0.00 | EXACT | Values do not match exactly |
|| Transactions[0].Date | 01/15/2024 | 01/15/2024 | 0.94 | 1.00 | FUZZY (threshold: 0.9) | Exact match |
|| Transactions[0].Amount | -4.50 | -4.50 | 0.92 | 1.00 | NUMERIC_EXACT | Exact numeric match |
|| Transactions[1].Description | ATM Withdrawal | ATM Cash | 0.87 | 0.85 | SEMANTIC (threshold: 0.7) | Semantically similar |

Benefits of Nested Structure Support

  1. Granular Analysis: Individual evaluation of each nested field provides precise insights
  2. Flexible Configuration: Different evaluation methods can be applied to different parts of nested structures
  3. Comprehensive Coverage: All attributes in complex documents are evaluated, regardless of nesting level
  4. Pattern Recognition: Identify consistent issues with specific nested attributes (e.g., address parsing problems)
  5. Scalable Processing: Handles documents with varying numbers of list items efficiently
  6. Detailed Reporting: Clear attribution of evaluation results to specific nested fields

Use Cases for Nested Evaluation

  • Bank Statements: Evaluate account details (group) and individual transactions (list)
  • Invoices: Evaluate vendor information (group) and line items (list)
  • Medical Records: Evaluate patient information (group) and procedures/medications (lists)
  • Legal Documents: Evaluate parties (group) and clauses/terms (lists)
  • Financial Reports: Evaluate company info (group) and financial line items (lists)

The nested structure support enables comprehensive evaluation of complex documents while maintaining the flexibility to apply appropriate evaluation methods to each type of data within the document.

Document Split Classification Metrics

The evaluation service provides specialized metrics for evaluating document splitting and classification accuracy. This feature is particularly useful for assessing how well the system:

  • Classifies individual pages
  • Groups pages into document sections
  • Maintains correct page order within sections

Overview

DocSplitClassificationMetrics evaluates three types of accuracy:

  1. Page Level Accuracy: Classification accuracy for individual pages
  2. Split Accuracy (Without Order): Correct page grouping regardless of order
  3. Split Accuracy (With Order): Correct page grouping with exact order

Usage

from idp_common.evaluation.doc_split_classification_metrics import DocSplitClassificationMetrics

# Initialize calculator
doc_split_calculator = DocSplitClassificationMetrics()

# Load ground truth and predicted sections
doc_split_calculator.load_sections(
    ground_truth_sections=expected_document.sections,
    predicted_sections=actual_document.sections
)

# Calculate all metrics
metrics = doc_split_calculator.calculate_all_metrics()

# Generate markdown report
report = doc_split_calculator.generate_markdown_report(metrics)

# Access individual metric types
page_level = metrics["page_level_accuracy"]
split_no_order = metrics["split_accuracy_without_order"]
split_with_order = metrics["split_accuracy_with_order"]

Metrics Explained

1. Page Level Accuracy

Evaluates classification accuracy for individual pages by comparing the document_class assigned to each page index.

Calculation:

  • For each page index in ground truth or predicted data
  • Check if the predicted document_class matches the ground truth document_class
  • Calculate: correct_pages / total_pages

Use Case: Determine if the classification model correctly identifies document types at the page level.

Example:

page_level = {
    "accuracy": 0.95,
    "total_pages": 20,
    "correct_pages": 19,
    "page_details": [
        {
            "page_index": 0,
            "ground_truth_class": "Invoice",
            "predicted_class": "Invoice",
            "correct": True
        },
        {
            "page_index": 5,
            "ground_truth_class": "W2",
            "predicted_class": "Receipt",
            "correct": False
        }
    ]
}

2. Split Accuracy (Without Order)

Evaluates whether the system correctly groups pages into sections with the right document class, regardless of page order.

Calculation:

  • For each ground truth section
  • Find a predicted section with:
    • Same set of page indices (as a set, order doesn't matter)
    • Same document_class
  • Calculate: matched_sections / total_ground_truth_sections

Use Case: Assess if the system correctly identifies which pages belong together, even if the order is different.

Example:

split_no_order = {
    "accuracy": 0.90,
    "total_sections": 10,
    "correct_sections": 9,
    "section_details": [
        {
            "section_id": "section_1",
            "ground_truth_class": "Invoice",
            "ground_truth_pages": [0, 1, 2],
            "matched": True,
            "predicted_class": "Invoice",
            "predicted_pages": [2, 0, 1]  # Different order, but same pages
        }
    ]
}

3. Split Accuracy (With Order)

Evaluates whether the system correctly groups pages into sections with the right document class, including exact page order.

Calculation:

  • For each ground truth section
  • Find a predicted section with:
    • Exact same page indices list (same order)
    • Same document_class
  • Calculate: matched_sections / total_ground_truth_sections

Use Case: Assess if the system maintains the correct page sequence within document sections.

Example:

split_with_order = {
    "accuracy": 0.85,
    "total_sections": 10,
    "correct_sections": 8,  # Lower than split_no_order due to order requirement
    "section_details": [
        {
            "section_id": "section_1",
            "ground_truth_class": "Invoice",
            "ground_truth_pages": [0, 1, 2],
            "matched": True,
            "order_matched": True,
            "predicted_class": "Invoice",
            "predicted_pages": [0, 1, 2]  # Exact match including order
        },
        {
            "section_id": "section_2",
            "ground_truth_class": "W2",
            "ground_truth_pages": [3, 4],
            "matched": False,
            "order_matched": False,
            "predicted_class": "W2",
            "predicted_pages": [4, 3]  # Wrong order
        }
    ]
}

Visual Reporting

The generate_markdown_report() method creates comprehensive visual reports with:

Summary Dashboard

## 🎯 Split Classification Summary

- **Page Level Accuracy**: 🟢 19/20 pages [███████████████████░] 95%
- **Split Accuracy (Without Order)**: 🟢 9/10 sections [██████████████████░░] 90%
- **Split Accuracy (With Order)**: 🟡 8/10 sections [████████████████░░░░] 80%

Metrics Table

| Metric | Accuracy | Rating | Correct/Total |
| ------ | :------: | :----: | :-----------: |
| Page Level Classification | 0.9500 | 🟢 Excellent | 19/20 pages |
| Document Split (Without Page Order) | 0.9000 | 🟢 Excellent | 9/10 sections |
| Document Split (With Page Order) | 0.8000 | 🟡 Good | 8/10 sections |

Combined Section Analysis

| Section Match | Page Order Match | Section ID | Expected Class | Expected Pages | Pred Class | Pred Pages | Matched Section |
| :-----------: | :--------------: | ---------- | -------------- | -------------- | ---------- | ---------- | --------------- |
||| section_1 | Invoice | [0, 1, 2] | Invoice | [0, 1, 2] | pred_section_1 |
||| section_2 | W2 | [3, 4] | W2 | [4, 3] | pred_section_2 |
||| section_3 | Receipt | [5, 6] | Invoice | [5] | N/A |
||| N/A | No Match | N/A | Receipt | [6, 7] | pred_section_4 |

Column Definitions:

  • Section Match: ✅ if pages match as a set with same class (order independent)
  • Page Order Match: ✅ if Section Match is true AND page order matches exactly
  • Matched Section: ID of the predicted section that corresponds to ground truth
  • Unmatched Predicted Sections: Rows with "N/A" for ground truth indicate over-segmentation

Color-Coded Ratings

The reports use visual indicators for quick assessment:

  • 🟢 Excellent (≥ 90% accuracy)
  • 🟡 Good (70-89% accuracy)
  • 🟠 Fair (50-69% accuracy)
  • 🔴 Poor (< 50% accuracy)

Integration with Evaluation Service

Document split classification metrics are automatically calculated during document evaluation when both ground truth and predicted sections are available:

# Automatic integration during evaluation
result_document = evaluation_service.evaluate_document(
    actual_document=processed_document,
    expected_document=expected_document
)

# Split classification results are included in evaluation output
if result_document.evaluation_result:
    split_metrics = result_document.evaluation_result.doc_split_metrics
    # Access page-level, split accuracy, and detailed analysis

Use Cases

  1. Model Validation: Assess classification model performance at page and document levels
  2. System Tuning: Compare different splitting algorithms or thresholds
  3. Quality Assurance: Identify systematic issues in document segmentation
  4. A/B Testing: Compare performance of different classification approaches
  5. Continuous Monitoring: Track classification accuracy over time

Best Practices

  1. Prepare Ground Truth: Ensure ground truth sections have accurate:

    • document_class assignments
    • page_indices lists
    • Consistent section identifiers
  2. Interpret Metrics Together:

    • High page-level accuracy but low split accuracy → Correct classification but poor grouping
    • Low page-level accuracy → Review classification model
    • High split accuracy without order but low with order → Page sequencing issues
  3. Use Detailed Analysis: Review page_details and section_details to identify specific problem areas

  4. Monitor Over Time: Track metrics across multiple evaluation runs to detect regression

Error Handling

The calculator gracefully handles missing or malformed data:

  • Missing extraction_result_uri: Section skipped with warning
  • Invalid page indices: Empty list with warning
  • Missing document_class: Recorded as "Unknown"
  • Errors logged in metrics["errors"] array

Example: Complete Workflow

from idp_common.evaluation.doc_split_classification_metrics import DocSplitClassificationMetrics

# Initialize
calculator = DocSplitClassificationMetrics()

# Load sections (from Document objects)
calculator.load_sections(
    ground_truth_sections=baseline_document.sections,
    predicted_sections=processed_document.sections
)

# Calculate all metrics
all_metrics = calculator.calculate_all_metrics()

# Generate and save report
report = calculator.generate_markdown_report(all_metrics)
with open("split_classification_report.md", "w") as f:
    f.write(report)

# Access specific metrics for analysis
page_acc = all_metrics["page_level_accuracy"]["accuracy"]
split_acc_no_order = all_metrics["split_accuracy_without_order"]["accuracy"]
split_acc_with_order = all_metrics["split_accuracy_with_order"]["accuracy"]

print(f"Page Classification: {page_acc:.2%}")
print(f"Split Accuracy (no order): {split_acc_no_order:.2%}")
print(f"Split Accuracy (with order): {split_acc_with_order:.2%}")

# Check for errors
if all_metrics.get("errors"):
    print(f"\nWarnings/Errors: {len(all_metrics['errors'])}")
    for error in all_metrics["errors"]:
        print(f"  - {error}")