gh-130273: Add pure Python implementation of unicodedata.iter_graphemes()#148218
gh-130273: Add pure Python implementation of unicodedata.iter_graphemes()#148218ambv wants to merge 6 commits intopython:mainfrom
Conversation
New module Lib/_py_grapheme.py implements the full Unicode TR29 Extended Grapheme Cluster algorithm in pure Python, using the unicodedata.grapheme_cluster_break(), extended_pictographic(), and indic_conjunct_break() property accessors. Refactored GraphemeBreakTest into a BaseGraphemeBreakTest mixin so that both C and pure Python implementations share the same test suite, including the TR29 conformance test against GraphemeBreakTest.txt. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a pure-Python implementation of Unicode TR29 extended grapheme cluster segmentation to mirror unicodedata.iter_graphemes(), and refactors the existing grapheme-break tests so both the C and Python implementations can share the same conformance suite.
Changes:
- Introduces
Lib/_py_grapheme.pyimplementing TR29 Extended Grapheme Cluster segmentation usingunicodedataproperty accessors. - Refactors
GraphemeBreakTestinto aBaseGraphemeBreakTestmixin and addsPyGraphemeBreakTestto exercise the Python implementation. - Shares the TR29 conformance test (GraphemeBreakTest.txt) across both implementations.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| Lib/_py_grapheme.py | New pure-Python TR29 grapheme cluster iterator returning Segment objects. |
| Lib/test/test_unicodedata.py | Test refactor into a shared base mixin + new test class targeting _py_grapheme.iter_graphemes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add makegraphemedata() to Tools/unicode/makeunicodedata.py that generates Lib/_py_grapheme_db.py from the Unicode data files (GraphemeBreakProperty.txt, emoji-data.txt, DerivedCoreProperties.txt). _py_grapheme.py now imports property tables from _py_grapheme_db and uses bisect for lookups instead of calling unicodedata functions added in 3.15. This makes the module usable on Python 3.13 and 3.14 by regenerating the tables for the appropriate Unicode version. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@StanFromIreland the exercise to see a pure Python reimplementation is interesting to me in a grander scheme that is providing the entirety of unicodedata in a pure Python version. That way the JIT could use that instead of hitting the C extension boundary every time unicodedata is needed. So this version wouldn't only be used for the backport, but could have a wider use for perf optimization in the future. In any case, I asked Serhiy on gh-142529 to decide. |
|
The traceback issue gh-130273 has been fixed in the main branch using Adding the pure Python
Does the JIT already support calling a different implementation of a function? Here two modules have to be imported to call a single function ( I'm not convinced by the JIT argument. IMO the I suggest to not backport the traceback fix to 3.13 and 3.14 stable branches. And I don't think that it's worth it to add a pure Python implementation of |
The argument isn't about Me exercising the re-implementation of a subset of Now the I agree that we usually keep backported changes small, but it's not unheard of to backport several hundred lines of code. Here, I'd argue that I hear your argument, let's see what Serhiy's got to say. |
malemburg
left a comment
There was a problem hiding this comment.
The new test cases look fine, but I don't see much point in adding a pure Python version of the huge unicodedata database to Python, so -1 on those parts.
If people want to use such a pure Python implementation, they should download a package from PyPI which provides this.
|
When you're done making the requested changes, leave the comment: |
serhiy-storchaka
left a comment
There was a problem hiding this comment.
This is an interesting exercise. BTW, I initially asked ChatGPT to write grapheme clusterization algorithm in Python. It first provided an outdated algorithms, but after I checked the specification and pointed on errors, it started hallucinating versions close to correct (missing some details at each iteration). Many of them were simpler than the code in this PR, because they checked several previous characters.
How fast is the Python implementation? I think that it will be 2 or 3 decimal orders slower than the C implementation. I have doubts that JIT can speed up it 100 times. I do not think that PEP 399 requires to write every code in Python. We just dropped pure Python implementations of Base32 and Base85/Ascii85 codecs in favor of C code, which is two order faster and consumes two order less memory.
The original report contains:
the carets' positions are compatible with wide unicode characters. I believe the colors should also be compatible.
So, the reporter was satisfied with the 3.13 code for calculating the text width, the problem was only that it was not consistent for caret position and colorization. It was not always correct, but it worked in many cases. The code in this PR looks too large and too complex for a bugfix. We lived without any support of wide characters, and we lived wish imperfect support. I consider the full support of Unicode algorithm a new feature.
Note also that Unicode 17.0 algorithm will be incorrect on terminal emulator using Unicode 9.0 algorithm, and vice versa. We cannot guarantee perfect output on all terminals.
| def _grapheme_break(prev_gcb, curr_gcb, ep_state, ri_flag, incb_state): | ||
| """Return True if a grapheme cluster break occurs between two characters.""" | ||
| # GB3: Do not break between a CR and LF. | ||
| if prev_gcb == GCB_CR and curr_gcb == GCB_LF: |
There was a problem hiding this comment.
You can use in.
Also, I think that for Python implementation it would be more efficient to use a precomputed mapping (dict or 2-level array) (prev_gcb, curr_gcb) -> bool | special, where special is a singleton indicating that we should use ep_state, ri_flag, or incb_state (we can use None).
|
|
||
| # Update Extended Pictographic FSM (GB11) | ||
| ext_pict = _get_ext_pict(cp) | ||
| if ext_pict: |
There was a problem hiding this comment.
Again, we can use a mapping (ep_state, ext_pict) -> ep_state.
| ep_state = _EP_INIT | ||
|
|
||
| # Update Regional Indicator flag (GB12/GB13) | ||
| ri_flag = not ri_flag if curr_gcb == GCB_Regional_Indicator else False |
There was a problem hiding this comment.
| ri_flag = not ri_flag if curr_gcb == GCB_Regional_Indicator else False | |
| ri_flag = curr_gcb == GCB_Regional_Indicator and not ri_flag |
|
|
||
| # Update Indic Conjunct Break FSM (GB9c) | ||
| curr_incb = _get_incb(cp) | ||
| if curr_incb == InCB_Consonant: |
There was a problem hiding this comment.
Again, we can use a mapping (incb_state, curr_incb) -> incb_state.
|
|
||
| class GraphemeBreakTest(unittest.TestCase): | ||
| class BaseGraphemeBreakTest: | ||
| iter_graphemes = staticmethod(unicodedata.iter_graphemes) |
There was a problem hiding this comment.
Not needed. It is initialized in subclasses.
|
A slightly different exercise, backporting the C implementation to 3.14: #148247 |
New module Lib/_py_grapheme.py implements the full Unicode TR29 Extended Grapheme Cluster algorithm in pure Python, without relying on
unicodedata.grapheme_cluster_break(),extended_pictographic(), andindic_conjunct_break()that were also added in Python 3.15.Refactored
GraphemeBreakTestinto aBaseGraphemeBreakTestmixin so that both C and pure Python implementations share the same test suite, including the TR29 conformance test against GraphemeBreakTest.txt.