Skip to content

ICU-23440 Merge the end-of-text and Sentence_Break=Sep symbols in the sentence breaking state machine#4036

Open
eggrobin wants to merge 1 commit into
unicode-org:mainfrom
eggrobin:eot=Sep
Open

ICU-23440 Merge the end-of-text and Sentence_Break=Sep symbols in the sentence breaking state machine#4036
eggrobin wants to merge 1 commit into
unicode-org:mainfrom
eggrobin:eot=Sep

Conversation

@eggrobin

@eggrobin eggrobin commented Jun 19, 2026

Copy link
Copy Markdown
Member

No functional change, but saves a few bytes. See the changes to the state machine: eggrobin/unicodetools@1665e41.

Checklist

  • Required: Issue filed: ICU-23440
  • Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
  • Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable
  • Approver: Feel free to merge on my behalf

@eggrobin eggrobin requested a review from robertbastian June 19, 2026 19:02
@markusicu

Copy link
Copy Markdown
Member

The TC only glanced at this and has no opinion... it would be nice if the description in the ticket and in the PR was less heavy on obscure (to segmentation outsiders) abbreviations...

@eggrobin

eggrobin commented Jul 2, 2026

Copy link
Copy Markdown
Member Author

The TC only glanced at this and has no opinion... it would be nice if the description in the ticket and in the PR was less heavy on obscure (to segmentation outsiders) abbreviations...

It would be very nice if the long aliases for Sentence_Break and Word_Break values were not abbreviations. Sep is the long alias for sb=SE, and it actually means something like paragraphs separators.

Apparently those cryptic names were coined by one mark.davis@us.ibm.com twenty-five years ago in https://www.unicode.org/reports/tr29/tr29-1.html. (I also find Word_Break=ALetter annoying, that one first shows up in https://www.unicode.org/reports/tr29/tr29-2.html.)

@jira-pull-request-webhook

Copy link
Copy Markdown

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@markusicu

Copy link
Copy Markdown
Member

Just saying that the PR description and ticket should be more readable to non-segmenters than “eot=Sep for sent” which means nothing to all but a handful of people.

@eggrobin

eggrobin commented Jul 2, 2026

Copy link
Copy Markdown
Member Author

Well the ticket is more verbose. But despite being verbose even I find any discussion of sentence breaking quite impenetrable because the value aliases are too short to be descriptive…

@eggrobin eggrobin changed the title ICU-23440 eot=Sep for sent ICU-23440 Merge the end-of-text and Sentence_Break=Sep symbols in the sentence breaking state machine Jul 2, 2026
@markusicu

Copy link
Copy Markdown
Member

Thanks for updating the description!
Also thanks for the lengthy comment in the rules files.
I restarted the stuck CI check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants