Summary
When submitting a DOCX (or PPTX/XLSX/HTML) to docling-serve with do_picture_description=true
and do_picture_classification=true, the flags are silently ignored. The resulting JSON has
meta=null, annotations=[], and captions=[] on every picture. No error, no warning.
Root cause is in docling-jobkit: the DoclingConverterManager only registers
format_options entries for InputFormat.PDF and InputFormat.IMAGE. All other formats
fall back to docling's bare WordFormatOption() / PowerpointFormatOption() / etc., which
default to do_picture_description=False and do_picture_classification=False.
Since docling 2.52.0 (PR docling-project/docling#2251, 2025-09-11), ConvertPipeline and
BaseItemAndImageEnrichmentModel.prepare_element explicitly support enrichment for
documents without page images (DOCX/HTML). The fix landed in docling 8 months ago but
docling-jobkit never plumbed the request flags to the office FormatOptions.
Versions
- docling-serve 1.13.1
- docling-jobkit 1.11.0 (also reproduced on
main at b6b2e02)
- docling 2.74.0
- docling-core 2.65.2
Reproduction
POST a DOCX containing embedded images to /v1/convert/file/async:
curl -X POST https://<docling-serve>/v1/convert/file/async \
-H "X-Api-Key: ..." \
-F "files=@with-images.docx;type=application/vnd.openxmlformats-officedocument.wordprocessingml.document" \
-F "to_formats=json" -F "to_formats=md" \
-F "do_picture_classification=true" \
-F "do_picture_description=true" \
-F 'picture_description_api={"url":"https://your-vlm/v1/chat/completions","params":{"model":"..."},"headers":{"Authorization":"Bearer ..."},"prompt":"Describe the image."}'
Expected: pictures with meta.classification.predictions and meta.description set,
plus annotations array populated.
Actual: every picture has meta=null, annotations=[], processing completes in seconds
(no VLM call made).
Control
Same request with a PDF works correctly: classifier fills 18 class predictions per picture
and the VLM is called for each picture > area threshold. Confirms the request flags are
parsed and that docling-core's enrichment is functional. The only difference is the input
format and which FormatOption it routes through.
Bug location
docling_jobkit/convert/manager.py (lines 473–478 on main):
format_options: dict[InputFormat, FormatOption] = {
InputFormat.PDF: pdf_format_option,
InputFormat.IMAGE: image_format_option,
}
return DocumentConverter(format_options=format_options)
DOCX/PPTX/XLSX/HTML never get a FormatOption carrying the request's pipeline options.
Suggested fix
Build a ConvertPipelineOptions from the request flags and register entries for all
non-PDF formats docling supports as input. Sketch:
convert_pipeline_options = ConvertPipelineOptions(
do_picture_classification=request.do_picture_classification,
do_picture_description=request.do_picture_description,
picture_description_options=picture_description_options,
do_chart_extraction=request.do_chart_extraction,
)
format_options[InputFormat.DOCX] = WordFormatOption(pipeline_options=convert_pipeline_options)
format_options[InputFormat.PPTX] = PowerpointFormatOption(pipeline_options=convert_pipeline_options)
format_options[InputFormat.XLSX] = ExcelFormatOption(pipeline_options=convert_pipeline_options)
format_options[InputFormat.HTML] = HTMLFormatOption(pipeline_options=convert_pipeline_options)
Happy to send a PR if the maintainers confirm this is the right approach.
Related
Summary
When submitting a DOCX (or PPTX/XLSX/HTML) to docling-serve with
do_picture_description=trueand
do_picture_classification=true, the flags are silently ignored. The resulting JSON hasmeta=null,annotations=[], andcaptions=[]on every picture. No error, no warning.Root cause is in
docling-jobkit: theDoclingConverterManageronly registersformat_optionsentries forInputFormat.PDFandInputFormat.IMAGE. All other formatsfall back to docling's bare
WordFormatOption()/PowerpointFormatOption()/ etc., whichdefault to
do_picture_description=Falseanddo_picture_classification=False.Since docling 2.52.0 (PR docling-project/docling#2251, 2025-09-11),
ConvertPipelineandBaseItemAndImageEnrichmentModel.prepare_elementexplicitly support enrichment fordocuments without page images (DOCX/HTML). The fix landed in docling 8 months ago but
docling-jobkit never plumbed the request flags to the office FormatOptions.
Versions
mainat b6b2e02)Reproduction
POST a DOCX containing embedded images to
/v1/convert/file/async:Expected: pictures with
meta.classification.predictionsandmeta.descriptionset,plus
annotationsarray populated.Actual: every picture has
meta=null,annotations=[], processing completes in seconds(no VLM call made).
Control
Same request with a PDF works correctly: classifier fills 18 class predictions per picture
and the VLM is called for each picture > area threshold. Confirms the request flags are
parsed and that docling-core's enrichment is functional. The only difference is the input
format and which
FormatOptionit routes through.Bug location
docling_jobkit/convert/manager.py(lines 473–478 on main):DOCX/PPTX/XLSX/HTML never get a
FormatOptioncarrying the request's pipeline options.Suggested fix
Build a
ConvertPipelineOptionsfrom the request flags and register entries for allnon-PDF formats docling supports as input. Sketch:
Happy to send a PR if the maintainers confirm this is the right approach.
Related
ConvertPipelineadds enrichment forDOCX/HTML
response, asks the same question