fix(msword): extract images from text boxes and fix OMML math spacing#3394
fix(msword): extract images from text boxes and fix OMML math spacing#3394riyo264 wants to merge 15 commits into
Conversation
|
✅ DCO Check Passed Thanks @riyo264, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesWaiting for
This rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Signed-off-by: riyo264 <supriyodhani50@gmail.com>
Signed-off-by: Supriyo <138874454+riyo264@users.noreply.github.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…ithub.com> I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: a70d870 I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 58cd7c4 I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 045b53e I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: d304349 I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 9981db1 I, riyo264 <supriyodhani50@gmail.com>, hereby add my Signed-off-by to this commit: a95a6c7 Signed-off-by: riyo264 <supriyodhani50@gmail.com>
…ithub.com> I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: a70d870 I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 58cd7c4 I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 045b53e I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: d304349 I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 9981db1 I, riyo264 <supriyodhani50@gmail.com>, hereby add my Signed-off-by to this commit: a95a6c7 I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 02360fc I, riyo264 <supriyodhani50@gmail.com>, hereby add my Signed-off-by to this commit: f3cc1d3 Signed-off-by: riyo264 <supriyodhani50@gmail.com>
|
Update on the CI Tests: Since this PR strictly targets the OMML math formatting, I just reverted |
|
ok, let's run the CI then |
|
Hi @PeterStaar-IBM I am leaving this PR as it is for your review, but I wanted to provide context on the single failing CI test (test_e2e_docx_conversions). The core math regex fixes are fully implemented, but they caused a minor 4-pixel width difference in one of the extracted baseline images (the CI is expecting 250px, but the new output is 254px). Because I am developing on macOS, my local environment calculates the vector padding/bounding boxes slightly differently than the Ubuntu CI runner. Even when I run GENERATE=True locally, I cannot recreate the exact Linux-compatible snapshot the CI server is looking for. Since I didn't want to unilaterally change |
ceberam
left a comment
There was a problem hiding this comment.
Thanks @riyo264 for your contribution!
I have added some comments. Basically:
- The PR seems to add duplicated text
- Some OMML parsing leads to incorrect LaTeX formulas that were previously correct
- The math spacing issue is addressed as a post-processing step (
_clean_omml_latex). It would be cleaner if we could solve that spacing issue by fixing the OMML-to-LaTeX parsing directly. - To keep the library tidy and easier to maintain, I would not create a new test file
docx_edge_cases.docx. The name is not very informative either. Since we already have.docxfiles for general and edge cases, I would add the case related to OMML inequations.docxfile, and the image indrawingml.docxfile.
|
@ceberam Thank you for the detailed review, I really appreciate the feedback. I totally understand your points
I'll start working on these updates and ping you once the new commits are pushed. |
…ithub.com> I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: a70d870 I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 58cd7c4 I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 045b53e I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: d304349 I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 9981db1 I, riyo264 <supriyodhani50@gmail.com>, hereby add my Signed-off-by to this commit: a95a6c7 I, Supriyo <138874454+riyo264@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 02360fc I, riyo264 <supriyodhani50@gmail.com>, hereby add my Signed-off-by to this commit: f3cc1d3 I, riyo264 <supriyodhani50@gmail.com>, hereby add my Signed-off-by to this commit: 375f04c I, riyo264 <supriyodhani50@gmail.com>, hereby add my Signed-off-by to this commit: a6d8761 Signed-off-by: riyo264 <supriyodhani50@gmail.com>
|
Hi @ceberam, Thanks for the feedback! I've pivoted from regex patching to native fixes as suggested. Summary of changes:
The logic is now cleaner, more robust, and the E2E tests for DOCX pass. Looking forward to your review! |
Description
This PR addresses two specific issues in the Microsoft Word backend (
msword_backend.py) regarding missing images and broken Markdown math rendering.Key Changes:
\tau _{max}) and unparsed\texttimescommands. While forgiving engines like MathJax handled this, strict KaTeX engines (like VS Code, GitHub, and Obsidian) would completely fail to render the inline math. Added targeted regex cleanup to ensure textbook-perfect LaTeX syntax (e.g.,\tau_{max}).GENERATE=Trueto update the.md,.itxt, and.jsonsnapshots fordrawingml.docx,equations.docx,textbox.docx, etc., reflecting the newly extracted images and corrected math syntax.Resolves #3314
Note to reviewers: The large number of lines added (+1300) is almost entirely from the updated
.mdand.jsongroundtruth snapshots reflecting the improved extraction. The actual parser logic changes inmsword_backend.pyare very small.Checklist
fix(msword): ...).pre-commit run --all-files(Ruff passed).mypypassed).pytest tests/test_backend_msword.py).