Skip to content

[Feature Request] Add document layout analysis confidence scores #4320

@rehan243

Description

@rehan243

Feature Description

Include confidence scores for each extracted element to help downstream processing decide which elements to trust.

Use Case

In our enterprise RAG pipeline, we process thousands of PDFs daily. Some elements are extracted with low confidence (rotated tables, scanned handwritten notes). Having confidence scores would let us:

  1. Filter out low-confidence extractions
  2. Route uncertain elements to manual review
  3. Weight chunk importance in retrieval

Current Behavior

All extracted elements are treated equally regardless of extraction quality.

Proposed Enhancement

element.metadata.confidence_score  # 0.0 - 1.0
element.metadata.extraction_method  # 'ocr', 'native', 'inferred'

This would significantly improve RAG quality for noisy document sources. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions