Skip to content

fix: strip soft hyphen when joining merged text elements#3232

Merged
cau-git merged 2 commits intodocling-project:mainfrom
Frank-Schruefer:fix/merge-elements-soft-hyphen-join
Apr 17, 2026
Merged

fix: strip soft hyphen when joining merged text elements#3232
cau-git merged 2 commits intodocling-project:mainfrom
Frank-Schruefer:fix/merge-elements-soft-hyphen-join

Conversation

@Frank-Schruefer
Copy link
Copy Markdown
Contributor

Problem

_merge_elements always joins two text elements with a space:

new_item.text += f" {merged_elem.text}"

For lines ending with U+00AD (SOFT HYPHEN) this produces abge­ rackerten instead of abgerackerten.

A soft hyphen marks an optional line-break point in typeset text (books, journals, typeset PDFs). The two fragments are halves of one hyphenated word and must be joined without a space, with the soft hyphen itself removed.

Fix

Check for trailing U+00AD before joining and use direct concatenation (after stripping the soft hyphen) in that case:

if new_item.text.endswith('\u00AD'):
    new_item.text = new_item.text[:-1] + merged_elem.text
    new_item.orig = new_item.orig[:-1] + merged_elem.text
else:
    new_item.text += f" {merged_elem.text}"
    new_item.orig += f" {merged_elem.text}"

When predict_merges joins two text elements, _merge_elements always
inserts a space: new_item.text += f' {merged_elem.text}'.

For lines ending with U+00AD (SOFT HYPHEN) this produces 'abge­ rackerten'
instead of 'abgerackerten'. A soft hyphen marks an optional line-break
point in typeset text — the two fragments are halves of one word and must
be joined without a space, with the soft hyphen itself removed.

Check for trailing U+00AD before joining and use direct concatenation
(after stripping the soft hyphen) in that case.

Signed-off-by: stone <frank.schruefer@t-online.de>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 5, 2026

DCO Check Passed

Thanks @Frank-Schruefer, all your commits are properly signed off. 🎉

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 5, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Comment thread docling/models/stages/reading_order/readingorder_model.py
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 6, 2026

Codecov Report

❌ Patch coverage is 60.00000% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
.../models/stages/reading_order/readingorder_model.py 60.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@PeterStaar-IBM
Copy link
Copy Markdown
Member

@Frank-Schruefer Can you run a few times the uv run pre-commit run --all-files

@cau-git
Copy link
Copy Markdown
Member

cau-git commented Apr 15, 2026

@Frank-Schruefer would love to get this in but we need the toolchain for linting/code checks to pass. Could you apply this please?

@PeterStaar-IBM
Copy link
Copy Markdown
Member

please rerun uv run pre-commit run --all-files

Signed-off-by: stone <frank.schruefer@t-online.de>
@cau-git cau-git merged commit 8274892 into docling-project:main Apr 17, 2026
24 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants