gh-130273: Add pure Python implementation of unicodedata.iter_graphemes() by ambv · Pull Request #148218 · python/cpython

ambv · 2026-04-07T13:59:38Z

New module Lib/_py_grapheme.py implements the full Unicode TR29 Extended Grapheme Cluster algorithm in pure Python, without relying on unicodedata.grapheme_cluster_break(), extended_pictographic(), and indic_conjunct_break() that were also added in Python 3.15.

Refactored GraphemeBreakTest into a BaseGraphemeBreakTest mixin so that both C and pure Python implementations share the same test suite, including the TR29 conformance test against GraphemeBreakTest.txt.

Issue: Traceback colors are shifted when the line contains wide unicode characters #130273

New module Lib/_py_grapheme.py implements the full Unicode TR29 Extended Grapheme Cluster algorithm in pure Python, using the unicodedata.grapheme_cluster_break(), extended_pictographic(), and indic_conjunct_break() property accessors. Refactored GraphemeBreakTest into a BaseGraphemeBreakTest mixin so that both C and pure Python implementations share the same test suite, including the TR29 conformance test against GraphemeBreakTest.txt. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds a pure-Python implementation of Unicode TR29 extended grapheme cluster segmentation to mirror unicodedata.iter_graphemes(), and refactors the existing grapheme-break tests so both the C and Python implementations can share the same conformance suite.

Changes:

Introduces Lib/_py_grapheme.py implementing TR29 Extended Grapheme Cluster segmentation using unicodedata property accessors.
Refactors GraphemeBreakTest into a BaseGraphemeBreakTest mixin and adds PyGraphemeBreakTest to exercise the Python implementation.
Shares the TR29 conformance test (GraphemeBreakTest.txt) across both implementations.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
Lib/_py_grapheme.py	New pure-Python TR29 grapheme cluster iterator returning `Segment` objects.
Lib/test/test_unicodedata.py	Test refactor into a shared base mixin + new test class targeting `_py_grapheme.iter_graphemes`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

StanFromIreland · 2026-04-07T14:55:12Z

Is it possible to ask the SC/@hugovk to allow for backporting of bab1d7a rather than increasing the maintenance burden with a duplicate and different implementation instead? We could make them private on 3.14?

Add makegraphemedata() to Tools/unicode/makeunicodedata.py that generates Lib/_py_grapheme_db.py from the Unicode data files (GraphemeBreakProperty.txt, emoji-data.txt, DerivedCoreProperties.txt). _py_grapheme.py now imports property tables from _py_grapheme_db and uses bisect for lookups instead of calling unicodedata functions added in 3.15. This makes the module usable on Python 3.13 and 3.14 by regenerating the tables for the appropriate Unicode version. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ambv · 2026-04-07T15:13:33Z

@StanFromIreland the exercise to see a pure Python reimplementation is interesting to me in a grander scheme that is providing the entirety of unicodedata in a pure Python version. That way the JIT could use that instead of hitting the C extension boundary every time unicodedata is needed. So this version wouldn't only be used for the backport, but could have a wider use for perf optimization in the future.

In any case, I asked Serhiy on gh-142529 to decide.

vstinner · 2026-04-07T16:13:05Z

The traceback issue gh-130273 has been fixed in the main branch using unicodedata.iter_graphemes() which was added to Python 3.15 and implemented in C.

Adding the pure Python Lib/_py_grapheme.py (251 lines) and Lib/_py_grapheme_db.py (344 lines) to Python 3.14 sounds like a bad idea. That's a lot of new code for a bugfix (3.14.x) release. We don't do that usually (but only backport "tiny" changes fixing bugs). Also, I don't think that we already added new stdlib modules in a bugfix release.

That way the JIT could use that instead of hitting the C extension boundary every time unicodedata is needed. So this version wouldn't only be used for the backport, but could have a wider use for perf optimization in the future.

Does the JIT already support calling a different implementation of a function? Here two modules have to be imported to call a single function (_py_grapheme.iter_graphemes). It sounds like a lot of work for a single unicodedata function and the imports can fail with ImportError. iter_graphemes() is new, so it's not widely used. unicodedata.category() is more commonly used, so we should target this function first if we care about performance, no?

I'm not convinced by the JIT argument. IMO the unicodedata functions are too rarely used and it would be a lot of work to switch to a different implementation in the JIT.

I suggest to not backport the traceback fix to 3.13 and 3.14 stable branches. And I don't think that it's worth it to add a pure Python implementation of unicodedata.iter_graphemes() in the main branch.

ambv · 2026-04-07T16:28:27Z

I'm not convinced by the JIT argument.

The argument isn't about unicodedata.iter_graphemes() specifically, but about avoiding standard library boundaries in JIT traces in general. It's true the JIT isn't doing that at the moment, but I expect it will get relevant in the future. Just like PyPy only uses Python-based standard libraries.

Me exercising the re-implementation of a subset of unicodedata for this goal was simply a test in how complicated it would be to reimplement the entirety of unicodedata in a separate change. Most C-based standard libraries do have Python equivalents per PEP 399 requirements. unicodedata is an example that predates this PEP. So I'm looking at how hard it will be to patch this gap. PyPy could switch to our implementation then.

Now the iter_graphemes() subset of unicodedata is testing the waters, motivated by the bugfix that otherwise can't feasibly be backported. It seems to me like it would make sense to integrate this subset in main and backport it to allow fixing the bug.

I agree that we usually keep backported changes small, but it's not unheard of to backport several hundred lines of code. Here, I'd argue that _py_grapheme_db.py is irrelevant as it's generated code. The actual size of the change is 252 lines for _py_grapheme.py and 129 lines added to makeunicodedata.py. So ~380 lines altogether. Not trivial, but not excessive either.

I hear your argument, let's see what Serhiy's got to say.

malemburg

The new test cases look fine, but I don't see much point in adding a pure Python version of the huge unicodedata database to Python, so -1 on those parts.

If people want to use such a pure Python implementation, they should download a package from PyPI which provides this.

bedevere-app · 2026-04-07T16:31:36Z

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

serhiy-storchaka

This is an interesting exercise. BTW, I initially asked ChatGPT to write grapheme clusterization algorithm in Python. It first provided an outdated algorithms, but after I checked the specification and pointed on errors, it started hallucinating versions close to correct (missing some details at each iteration). Many of them were simpler than the code in this PR, because they checked several previous characters.

How fast is the Python implementation? I think that it will be 2 or 3 decimal orders slower than the C implementation. I have doubts that JIT can speed up it 100 times. I do not think that PEP 399 requires to write every code in Python. We just dropped pure Python implementations of Base32 and Base85/Ascii85 codecs in favor of C code, which is two order faster and consumes two order less memory.

The original report contains:

the carets' positions are compatible with wide unicode characters. I believe the colors should also be compatible.

So, the reporter was satisfied with the 3.13 code for calculating the text width, the problem was only that it was not consistent for caret position and colorization. It was not always correct, but it worked in many cases. The code in this PR looks too large and too complex for a bugfix. We lived without any support of wide characters, and we lived wish imperfect support. I consider the full support of Unicode algorithm a new feature.

Note also that Unicode 17.0 algorithm will be incorrect on terminal emulator using Unicode 9.0 algorithm, and vice versa. We cannot guarantee perfect output on all terminals.

serhiy-storchaka · 2026-04-08T10:19:25Z

+def _grapheme_break(prev_gcb, curr_gcb, ep_state, ri_flag, incb_state):
+    """Return True if a grapheme cluster break occurs between two characters."""
+    # GB3: Do not break between a CR and LF.
+    if prev_gcb == GCB_CR and curr_gcb == GCB_LF:


You can use in.

Also, I think that for Python implementation it would be more efficient to use a precomputed mapping (dict or 2-level array) (prev_gcb, curr_gcb) -> bool | special, where special is a singleton indicating that we should use ep_state, ri_flag, or incb_state (we can use None).

serhiy-storchaka · 2026-04-08T10:41:15Z

+
+        # Update Extended Pictographic FSM (GB11)
+        ext_pict = _get_ext_pict(cp)
+        if ext_pict:


Again, we can use a mapping (ep_state, ext_pict) -> ep_state.

serhiy-storchaka · 2026-04-08T10:42:29Z

+            ep_state = _EP_INIT
+
+        # Update Regional Indicator flag (GB12/GB13)
+        ri_flag = not ri_flag if curr_gcb == GCB_Regional_Indicator else False


Suggested change

ri_flag = not ri_flag if curr_gcb == GCB_Regional_Indicator else False

ri_flag = curr_gcb == GCB_Regional_Indicator and not ri_flag

serhiy-storchaka · 2026-04-08T10:44:03Z

+
+        # Update Indic Conjunct Break FSM (GB9c)
+        curr_incb = _get_incb(cp)
+        if curr_incb == InCB_Consonant:


Again, we can use a mapping (incb_state, curr_incb) -> incb_state.

serhiy-storchaka · 2026-04-08T10:49:13Z


-class GraphemeBreakTest(unittest.TestCase):
+class BaseGraphemeBreakTest:
+    iter_graphemes = staticmethod(unicodedata.iter_graphemes)


Not needed. It is initialized in subclasses.

StanFromIreland · 2026-04-08T12:24:49Z

A slightly different exercise, backporting the C implementation to 3.14: #148247

github-actions · 2026-05-09T00:31:14Z

This PR is stale because it has been open for 30 days with no activity.

ambv requested review from Copilot and serhiy-storchaka April 7, 2026 13:59

bedevere-app Bot added the awaiting core review label Apr 7, 2026

ambv changed the title ~~Add pure Python implementation of unicodedata.iter_graphemes()~~ gh-130273: Add pure Python implementation of unicodedata.iter_graphemes() Apr 7, 2026

ambv added the skip news label Apr 7, 2026

bedevere-app Bot mentioned this pull request Apr 7, 2026

Traceback colors are shifted when the line contains wide unicode characters #130273

Closed

Copilot started reviewing on behalf of ambv April 7, 2026 14:00 View session

ambv mentioned this pull request Apr 7, 2026

gh-130273: Fix traceback color output with unicode characters #142529

Merged

Copilot AI reviewed Apr 7, 2026

View reviewed changes

Comment thread Lib/_py_grapheme.py

Comment thread Lib/_py_grapheme.py Outdated

Comment thread Lib/test/test_unicodedata.py

Comment thread Lib/test/test_unicodedata.py

ambv added 2 commits April 7, 2026 16:15

Add _py_grapheme to stdlib_module_names.h

6262980

Make the first argument positional-only

38db422

StanFromIreland reviewed Apr 7, 2026

View reviewed changes

Comment thread Lib/_py_grapheme.py Outdated

ambv and others added 2 commits April 7, 2026 16:58

Fix newlines to make linter happy

5701c0b

StanFromIreland requested a review from malemburg April 7, 2026 15:12

Achieve 100% statement and branch test coverage

e073e06

malemburg requested changes Apr 7, 2026

View reviewed changes

bedevere-app Bot added awaiting changes and removed awaiting core review labels Apr 7, 2026

serhiy-storchaka reviewed Apr 8, 2026

View reviewed changes

StanFromIreland mentioned this pull request Apr 8, 2026

[3.14] gh-74902: Add Unicode Grapheme Cluster Break algorithm (GH-143076) #148247

Closed

github-actions Bot added the stale Stale PR or inactive for long period of time. label May 9, 2026

	ri_flag = not ri_flag if curr_gcb == GCB_Regional_Indicator else False
	ri_flag = curr_gcb == GCB_Regional_Indicator and not ri_flag

Uh oh!

Conversation

ambv commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StanFromIreland commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ambv commented Apr 7, 2026

Uh oh!

vstinner commented Apr 7, 2026

Uh oh!

ambv commented Apr 7, 2026

Uh oh!

malemburg left a comment

Choose a reason for hiding this comment

Uh oh!

bedevere-app Bot commented Apr 7, 2026

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

StanFromIreland commented Apr 8, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ambv commented Apr 7, 2026 •

edited

Loading

StanFromIreland commented Apr 7, 2026 •

edited

Loading