Skip to content

Picture classification and description silently skipped for DOCX/PPTX/XLSX/HTML — format_options dict only includes PDF and IMAGE #145

@naroam1

Description

@naroam1

Summary

When submitting a DOCX (or PPTX/XLSX/HTML) to docling-serve with do_picture_description=true
and do_picture_classification=true, the flags are silently ignored. The resulting JSON has
meta=null, annotations=[], and captions=[] on every picture. No error, no warning.

Root cause is in docling-jobkit: the DoclingConverterManager only registers
format_options entries for InputFormat.PDF and InputFormat.IMAGE. All other formats
fall back to docling's bare WordFormatOption() / PowerpointFormatOption() / etc., which
default to do_picture_description=False and do_picture_classification=False.

Since docling 2.52.0 (PR docling-project/docling#2251, 2025-09-11), ConvertPipeline and
BaseItemAndImageEnrichmentModel.prepare_element explicitly support enrichment for
documents without page images (DOCX/HTML). The fix landed in docling 8 months ago but
docling-jobkit never plumbed the request flags to the office FormatOptions.

Versions

  • docling-serve 1.13.1
  • docling-jobkit 1.11.0 (also reproduced on main at b6b2e02)
  • docling 2.74.0
  • docling-core 2.65.2

Reproduction

POST a DOCX containing embedded images to /v1/convert/file/async:

curl -X POST https://<docling-serve>/v1/convert/file/async \
  -H "X-Api-Key: ..." \
  -F "files=@with-images.docx;type=application/vnd.openxmlformats-officedocument.wordprocessingml.document" \
  -F "to_formats=json" -F "to_formats=md" \
  -F "do_picture_classification=true" \
  -F "do_picture_description=true" \
  -F 'picture_description_api={"url":"https://your-vlm/v1/chat/completions","params":{"model":"..."},"headers":{"Authorization":"Bearer ..."},"prompt":"Describe the image."}'

Expected: pictures with meta.classification.predictions and meta.description set,
plus annotations array populated.

Actual: every picture has meta=null, annotations=[], processing completes in seconds
(no VLM call made).

Control

Same request with a PDF works correctly: classifier fills 18 class predictions per picture
and the VLM is called for each picture > area threshold. Confirms the request flags are
parsed and that docling-core's enrichment is functional. The only difference is the input
format and which FormatOption it routes through.

Bug location

docling_jobkit/convert/manager.py (lines 473–478 on main):

format_options: dict[InputFormat, FormatOption] = {
    InputFormat.PDF: pdf_format_option,
    InputFormat.IMAGE: image_format_option,
}
return DocumentConverter(format_options=format_options)

DOCX/PPTX/XLSX/HTML never get a FormatOption carrying the request's pipeline options.

Suggested fix

Build a ConvertPipelineOptions from the request flags and register entries for all
non-PDF formats docling supports as input. Sketch:

convert_pipeline_options = ConvertPipelineOptions(
    do_picture_classification=request.do_picture_classification,
    do_picture_description=request.do_picture_description,
    picture_description_options=picture_description_options,
    do_chart_extraction=request.do_chart_extraction,
)
format_options[InputFormat.DOCX]  = WordFormatOption(pipeline_options=convert_pipeline_options)
format_options[InputFormat.PPTX]  = PowerpointFormatOption(pipeline_options=convert_pipeline_options)
format_options[InputFormat.XLSX]  = ExcelFormatOption(pipeline_options=convert_pipeline_options)
format_options[InputFormat.HTML]  = HTMLFormatOption(pipeline_options=convert_pipeline_options)

Happy to send a PR if the maintainers confirm this is the right approach.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions