Skip to content

fix: extend predict_merges patterns for soft hyphen and Unicode letters#155

Merged
nikos-livathinos merged 3 commits into
docling-project:mainfrom
Frank-Schruefer:fix/predict-merges-regex-continuation
Apr 21, 2026
Merged

fix: extend predict_merges patterns for soft hyphen and Unicode letters#155
nikos-livathinos merged 3 commits into
docling-project:mainfrom
Frank-Schruefer:fix/predict-merges-regex-continuation

Conversation

@Frank-Schruefer
Copy link
Copy Markdown
Contributor

Problem

The merge-continuation heuristic in predict_merges uses two patterns:

  • m1: checks that the current element ends with a continuation character [a-z,\-]
  • m2: checks that the next element starts with a lowercase letter [a-z]

Two gaps prevent valid merges:

1. Soft hyphen (U+00AD) not handled

Typeset PDFs (books, journals) use U+00AD (SOFT HYPHEN) for optional line-break hyphenation. The character is invisible in rendered text but present in the PDF text layer. A line ending with U+00AD is unambiguously a mid-word line break, but the original pattern [a-z,\-] does not match it — so the two halves of the hyphenated word end up as separate paragraphs in the output.

2. m2 rejects uppercase continuation lines

m2 only accepted [a-z], so continuation lines starting with an uppercase letter were never merged. This affects:

  • German nouns (always uppercase)
  • Lines after hard line-breaks in typeset text where the next word happens to be capitalised
  • Any mid-sentence line-break where m1 clearly indicates no sentence end (no ., !, ?)

Since m1 already ensures the previous line did not end with sentence-ending punctuation, allowing uppercase in m2 is safe.

Fix

# before
m1 = re.fullmatch(r".+([a-z,\-])(\s*)", elem.text)
m2 = re.fullmatch(r"(\s*[a-z])(.+)", sorted_elements[ind_p1].text)

# after
m1 = re.fullmatch(r".+([a-z,\-\u00AD])(\s*)", elem.text)
m2 = re.fullmatch(r"(\s*[a-zA-Z\u00C0-\u024F])(.+)", sorted_elements[ind_p1].text)

The merge-continuation heuristic in predict_merges uses two regex patterns:
- m1 checks that the current element ends with a continuation character
- m2 checks that the next element starts with a lowercase letter

Two gaps in the original patterns prevented valid merges:

1. Soft hyphen (U+00AD): typeset PDFs use U+00AD for optional line-break
   hyphenation. The character is invisible in rendered text but present in
   the PDF text layer. A line ending with U+00AD is always a continuation
   (hyphenated word split across lines), but the original [a-z,\-] pattern
   did not match it.

2. m2 only accepted [a-z], so a continuation line starting with an
   uppercase letter (e.g. a German noun, or a word after a hard line-break
   in typeset text) was never merged even when m1 clearly indicated a
   mid-sentence break. Extended to [a-zA-Z\u00C0-\u024F] to cover
   Latin letters including accented and umlauted characters.

Signed-off-by: stone <frank.schruefer@t-online.de>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 5, 2026

DCO Check Passed

Thanks @Frank-Schruefer, all your commits are properly signed off. 🎉

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 5, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@nikos-livathinos nikos-livathinos self-requested a review April 7, 2026 08:07
Copy link
Copy Markdown
Member

@nikos-livathinos nikos-livathinos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. But the code should be properly formatted and pass all code-checks.

@nikos-livathinos nikos-livathinos self-requested a review April 7, 2026 09:23
Signed-off-by: stone <frank.schruefer@t-online.de>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Signed-off-by: Nikos Livathinos <100353117+nikos-livathinos@users.noreply.github.com>
@nikos-livathinos nikos-livathinos merged commit c5e0d0f into docling-project:main Apr 21, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants