Skip to content

Classification confidence hardcoded to 1.0 and prompts lack schema field names #262

@joaothomaz-built

Description

@joaothomaz-built

Summary

Two issues in ClassificationService reduce classification accuracy and prevent downstream consumers from identifying uncertain matches for human review.

Both affect Pattern 2 (and the shared idp_common library used across patterns).

Issue 1: Bedrock classification confidence is hardcoded to 1.0

Bedrock models return meaningful confidence scores in their classification responses, but classify_page_bedrock discards them and hardcodes confidence=1.0:

lib/idp_common_pkg/idp_common/classification/service.py ~line 1337:

return PageClassification(
    page_id=page_id,
    classification=DocumentClassification(
        doc_type=doc_type,
        confidence=1.0,  # Default confidence
        metadata={...},
    ),
)

This pattern propagates through multiple paths in the same file:

Path Approx. line Confidence value
Bedrock classification ~1337 1.0
SageMaker/UDOP ~1433 1.0
Holistic/packet classification (segment pages) ~2601 1.0
Regex shortcut ~1170 1.0
Max-pages fallback ~326 1.0
Single-class config ~2500 1.0
Section splitting disabled ~2080 1.0

The Section model also defaults to 1.0:

# lib/idp_common_pkg/idp_common/models.py ~line 57
classification: str
confidence: float = 1.0

Impact: Consumers of the classification output (UIs, evaluation pipelines, downstream automation) cannot distinguish high-confidence from low-confidence classifications. This prevents flagging uncertain sections for human review, which is critical for document processing accuracy.

Suggestion: Parse and propagate the confidence value from Bedrock's response. For paths where the model genuinely doesn't return confidence (SageMaker, regex shortcuts), the 1.0 default is reasonable, but it should be distinguished from real model confidence (e.g., via a confidence_source metadata field).

Issue 2: Classification prompt lacks schema field names

The class list injected into Bedrock prompts only includes type name + description:

lib/idp_common_pkg/idp_common/classification/service.py ~line 741:

def _format_classes_list(self) -> str:
    return "\n".join(
        [
            f"{doc_type.type_name}  \t[ {doc_type.description} ]"
            for doc_type in self.document_types
        ]
    )

The holistic classification path (_format_classes_and_descriptions, ~line 2332) similarly uses only type + description in a markdown table.

Schema field names (e.g., property_address, appraised_value, inspection_date) from the JSON Schema properties are available in config.classes but are never included in the prompt. These field names are the strongest semantic signal for disambiguation — when Bedrock sees page text containing "Appraised Value: $450,000" and the class list includes fields like appraised_value, effective_date, borrower_name, the match becomes significantly more accurate.

Impact: When multiple document types have similar names or descriptions (e.g., "Appraisal Reports" vs "Inspection Reports"), the model lacks sufficient signal to disambiguate, resulting in misclassification.

Suggestion: Append a subset of schema field names (e.g., top 10-15) to each class entry in the prompt. This is a low-risk change since it only adds content to the prompt without changing its structure, and stays well within model context limits.

Environment

  • Version: v0.4.16
  • File: lib/idp_common_pkg/idp_common/classification/service.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions