Skip to content

fix(msword): extract images from text boxes and fix OMML math spacing#3394

Open
riyo264 wants to merge 15 commits into
docling-project:mainfrom
riyo264:main
Open

fix(msword): extract images from text boxes and fix OMML math spacing#3394
riyo264 wants to merge 15 commits into
docling-project:mainfrom
riyo264:main

Conversation

@riyo264
Copy link
Copy Markdown

@riyo264 riyo264 commented May 3, 2026

Description

This PR addresses two specific issues in the Microsoft Word backend (msword_backend.py) regarding missing images and broken Markdown math rendering.

Key Changes:

  1. Image Extraction in Text Boxes: Fixed an issue where images embedded inside text boxes/drawing frames were being skipped during parsing. The backend now properly extracts these images and embeds them in the resulting Markdown.
  2. OMML Math Spacing (KaTeX compatibility): The OMML-to-LaTeX converter was leaving invisible spaces before underscores (e.g., \tau _{max}) and unparsed \texttimes commands. While forgiving engines like MathJax handled this, strict KaTeX engines (like VS Code, GitHub, and Obsidian) would completely fail to render the inline math. Added targeted regex cleanup to ensure textbook-perfect LaTeX syntax (e.g., \tau_{max}).
  3. Updated Ground Truths: Ran the test suite with GENERATE=True to update the .md, .itxt, and .json snapshots for drawingml.docx, equations.docx, textbox.docx, etc., reflecting the newly extracted images and corrected math syntax.

Resolves #3314

Note to reviewers: The large number of lines added (+1300) is almost entirely from the updated .md and .json groundtruth snapshots reflecting the improved extraction. The actual parser logic changes in msword_backend.py are very small.

Checklist

  • The PR title follows the Commit Message Formatting standard (fix(msword): ...).
  • Code has been formatted using pre-commit run --all-files (Ruff passed).
  • Type checking passes (mypy passed).
  • All automated tests pass locally (pytest tests/test_backend_msword.py).
  • Ground truth test snapshots have been updated to reflect parsing improvements.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

DCO Check Passed

Thanks @riyo264, all your commits are properly signed off. 🎉

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 3, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

Waiting for

  • #approved-reviews-by >= 2
This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Signed-off-by: riyo264 <supriyodhani50@gmail.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 2.63158% with 37 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/backend/msword_backend.py 2.63% 37 Missing ⚠️

📢 Thoughts on this report? Let us know!

riyo264 and others added 3 commits May 20, 2026 21:47
…ithub.com>

I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: a70d870
I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 58cd7c4
I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 045b53e
I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: d304349
I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 9981db1
I, riyo264 <supriyodhani50@gmail.com>, hereby add my Signed-off-by to this commit: a95a6c7

Signed-off-by: riyo264 <supriyodhani50@gmail.com>
riyo264 and others added 3 commits May 22, 2026 14:37
…ithub.com>

I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: a70d870
I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 58cd7c4
I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 045b53e
I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: d304349
I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 9981db1
I, riyo264 <supriyodhani50@gmail.com>, hereby add my Signed-off-by to this commit: a95a6c7
I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 02360fc
I, riyo264 <supriyodhani50@gmail.com>, hereby add my Signed-off-by to this commit: f3cc1d3

Signed-off-by: riyo264 <supriyodhani50@gmail.com>
@riyo264
Copy link
Copy Markdown
Author

riyo264 commented May 22, 2026

Update on the CI Tests:
The test suite briefly failed on docx_rich_cells.docx.html. I realized that my local Mac environment (which has LibreOffice installed) was successfully rendering the legacy WMF images in that document, causing my local snapshot to differ from the headless CI server's output.

Since this PR strictly targets the OMML math formatting, I just reverted docx_rich_cells.docx.html to perfectly match main. The CI should be fully green now!

@PeterStaar-IBM
Copy link
Copy Markdown
Member

ok, let's run the CI then

@riyo264
Copy link
Copy Markdown
Author

riyo264 commented May 22, 2026

Hi @PeterStaar-IBM I am leaving this PR as it is for your review, but I wanted to provide context on the single failing CI test (test_e2e_docx_conversions).

The core math regex fixes are fully implemented, but they caused a minor 4-pixel width difference in one of the extracted baseline images (the CI is expecting 250px, but the new output is 254px).

Because I am developing on macOS, my local environment calculates the vector padding/bounding boxes slightly differently than the Ubuntu CI runner. Even when I run GENERATE=True locally, I cannot recreate the exact Linux-compatible snapshot the CI server is looking for.

Since I didn't want to unilaterally change verify_picture_image_v2 to add a pixel tolerance without your permission, I wanted to hand this over to you.

Copy link
Copy Markdown
Member

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @riyo264 for your contribution!

I have added some comments. Basically:

  • The PR seems to add duplicated text
  • Some OMML parsing leads to incorrect LaTeX formulas that were previously correct
  • The math spacing issue is addressed as a post-processing step (_clean_omml_latex). It would be cleaner if we could solve that spacing issue by fixing the OMML-to-LaTeX parsing directly.
  • To keep the library tidy and easier to maintain, I would not create a new test file docx_edge_cases.docx. The name is not very informative either. Since we already have .docx files for general and edge cases, I would add the case related to OMML in equations.docx file, and the image in drawingml.docx file.

Comment thread tests/data/groundtruth/docling_v2/drawingml.docx.md Outdated
Comment thread tests/data/groundtruth/docling_v2/equations.docx.md Outdated
Comment thread tests/data/groundtruth/docling_v2/equations.docx.md Outdated
Comment thread docling/backend/msword_backend.py Outdated
@riyo264
Copy link
Copy Markdown
Author

riyo264 commented May 22, 2026

@ceberam Thank you for the detailed review, I really appreciate the feedback.

I totally understand your points

  • I'll update the text box logic to prevent duplicate text issue
  • I'll also look into fixing the OMML-to-LaTeX parsing directly
  • I'll also move the existing test cases to drawingml.docx and equations.docx

I'll start working on these updates and ping you once the new commits are pushed.

riyo264 added 3 commits May 22, 2026 21:37
…ithub.com>

I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: a70d870
I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 58cd7c4
I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 045b53e
I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: d304349
I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 9981db1
I, riyo264 <supriyodhani50@gmail.com>, hereby add my Signed-off-by to this commit: a95a6c7
I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 02360fc
I, riyo264 <supriyodhani50@gmail.com>, hereby add my Signed-off-by to this commit: f3cc1d3
I, riyo264 <supriyodhani50@gmail.com>, hereby add my Signed-off-by to this commit: 375f04c
I, riyo264 <supriyodhani50@gmail.com>, hereby add my Signed-off-by to this commit: a6d8761

Signed-off-by: riyo264 <supriyodhani50@gmail.com>
@riyo264
Copy link
Copy Markdown
Author

riyo264 commented May 22, 2026

Hi @ceberam, Thanks for the feedback! I've pivoted from regex patching to native fixes as suggested.

Summary of changes:

  • Native Math Fixes: Removed the _clean_omml_latex regex entirely. Spacing issues are now handled natively via .rstrip() in omml.py and by mapping \times in _MATH_CHAR_MAP.

  • Fixed Duplication: Removed the redundant elem_ref.extend call in _handle_textbox_content to prevent duplicate text items.

  • Repo Hygiene: Cleaned up temporary test files and merged necessary cases into the existing equations.docx and drawingml.docx ground truth files.

The logic is now cleaner, more robust, and the E2E tests for DOCX pass. Looking forward to your review!

@riyo264 riyo264 requested a review from ceberam May 22, 2026 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Docling cannot obtain the images inserted into the text box of docx file

3 participants