fix: extend predict_merges patterns for soft hyphen and Unicode letters#155
Merged
nikos-livathinos merged 3 commits intoApr 21, 2026
Conversation
The merge-continuation heuristic in predict_merges uses two regex patterns: - m1 checks that the current element ends with a continuation character - m2 checks that the next element starts with a lowercase letter Two gaps in the original patterns prevented valid merges: 1. Soft hyphen (U+00AD): typeset PDFs use U+00AD for optional line-break hyphenation. The character is invisible in rendered text but present in the PDF text layer. A line ending with U+00AD is always a continuation (hyphenated word split across lines), but the original [a-z,\-] pattern did not match it. 2. m2 only accepted [a-z], so a continuation line starting with an uppercase letter (e.g. a German noun, or a word after a hard line-break in typeset text) was never merged even when m1 clearly indicated a mid-sentence break. Extended to [a-zA-Z\u00C0-\u024F] to cover Latin letters including accented and umlauted characters. Signed-off-by: stone <frank.schruefer@t-online.de>
Contributor
|
✅ DCO Check Passed Thanks @Frank-Schruefer, all your commits are properly signed off. 🎉 |
Contributor
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
nikos-livathinos
previously approved these changes
Apr 7, 2026
Signed-off-by: stone <frank.schruefer@t-online.de>
nikos-livathinos
previously approved these changes
Apr 17, 2026
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Signed-off-by: Nikos Livathinos <100353117+nikos-livathinos@users.noreply.github.com>
nikos-livathinos
approved these changes
Apr 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The merge-continuation heuristic in
predict_mergesuses two patterns:[a-z,\-][a-z]Two gaps prevent valid merges:
1. Soft hyphen (U+00AD) not handled
Typeset PDFs (books, journals) use U+00AD (SOFT HYPHEN) for optional line-break hyphenation. The character is invisible in rendered text but present in the PDF text layer. A line ending with U+00AD is unambiguously a mid-word line break, but the original pattern
[a-z,\-]does not match it — so the two halves of the hyphenated word end up as separate paragraphs in the output.2. m2 rejects uppercase continuation lines
m2 only accepted
[a-z], so continuation lines starting with an uppercase letter were never merged. This affects:.,!,?)Since m1 already ensures the previous line did not end with sentence-ending punctuation, allowing uppercase in m2 is safe.
Fix