Skip to content

Commit 2ebc27d

Browse files
committed
[GR-68227] Support more named sequences in unicodedata.lookup
PullRequest: graalpython/4323
2 parents 1759c41 + b4e76ef commit 2ebc27d

8 files changed

Lines changed: 124 additions & 40 deletions

File tree

.agents/skills/jira/SKILL.md

Lines changed: 24 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,14 @@ update, reproduce, potentially fix and/or close them. Go on with this workflow
1212
to the end unless you are actually blocked or get to one of the points where
1313
the workflow tells you to wait for confirmation or ask something.
1414

15+
### General Notes
16+
17+
Typical fields you need to know:
18+
* "components" is typically one of "Python", "Mx", "Infra", "Compiler", "Truffle"
19+
* "issuetype" is typically "Task", "Bug (non BugDB)", "Testing", "Build Failure"
20+
* "project" is typically "GR"
21+
* "labels" is typically left empty when creating new issues
22+
1523
### 1. Getting context
1624

1725
To get the issue data, start with `ol-cli`, for example:
@@ -20,13 +28,17 @@ To get the issue data, start with `ol-cli`, for example:
2028

2129
Read the description and follow any links that seem relevant.
2230

31+
Run this in a subagent if possible and let it give you a summary.
32+
2333
### 2. Check if there is work to do
2434

2535
Issues may be stale, already solved, or no longer apply. Search the context and
2636
logs for other potentially relevant keywords, use `ol-cli jira search` to find
2737
out if there are potentially other related issues, query the codebase and git
2838
history and look for reproducers.
2939

40+
Run this in a subagent if possible and let it give you a summary.
41+
3042
### 3. Reproduce the issue
3143

3244
It is PARAMOUNT to reproduce an issue first before changing code. You should
@@ -63,6 +75,9 @@ DO NOT STOP POLLING AND RETRYING UNTIL EITHER YOU REPRODUCE THE ISSUE, MORE
6375
THAN 8 HOURS HAVE ELAPSED WHILE YOU TRIED, OR YOU HAVE USED AT LEAST AROUND 2
6476
MILLION TOKENS (you may estimate from the conversation history) WHILE TRYING!
6577

78+
Make sure to decline the temporary reproducer PR once you are done with it
79+
using `ol-cli bitbucket`.
80+
6681
### 4a. Fixing a reproducible issue.
6782

6883
Once you have a reproducer (even if it may mean running something in a loop for
@@ -87,7 +102,13 @@ by approval of the human user), it needs to be prepared for inclusion.
87102
Transition the Jira issue to be "In Progress" using `ol-cli jira transition`.
88103

89104
Make sure your changes are committed in reviewable, focused, incremental
90-
commits. Create a bitbucket PR
105+
commits.
106+
107+
Run a subagent to REVIEW the code changes. Give it enough context to understand
108+
why specific implementation decisions were made. Consider the subagent's
109+
comments carefully, change the code where the subagent's comments make sense.
110+
111+
Create a bitbucket PR
91112

92113
1. Push your branch.
93114
2. Open a PR using ol-cli bitbucket with a title including the Jira issue ID, like "[GR-XXXXX] Short description of overall fix."
@@ -110,7 +131,8 @@ You can do this in parallel while watching the Bitbucket PR from step 5.
110131
Add a comment using `ol-cli jira comment` to the Jira issue, summarizing your
111132
findings and any work you may have done. Do NOT use Attlassian markup, the
112133
comment just ONLY be PLAIN TEXT. For paragraphs, just use double '\n'. You can
113-
make plaintext lists by making lines begin with '* '.
134+
make plaintext lists by making lines begin with '* '. Do NOT use ADF, use raw
135+
text, regardless of what the tool's help message says.
114136

115137
Also decide yourself or confer with the human about whether this change needs
116138
to be backported, and what the "fix version" assignment for the Jira label

CHANGELOG.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,10 @@
33
This changelog summarizes major changes between GraalVM versions of the Python
44
language runtime. The main focus is on user-observable behavior of the engine.
55

6-
## Version 25.2.0
6+
## Version 25.1.0
77
* Add support for [Truffle source options](https://www.graalvm.org/truffle/javadoc/com/oracle/truffle/api/source/Source.SourceBuilder.html#option(java.lang.String,java.lang.String)):
88
* The `python.Optimize` option can be used to specify the optimization level, like the `-O` (level 1) and `-OO` (level 2) commandline options.
99
* The `python.NewGlobals` option can be used to run a source with a fresh globals dictionary instead of the main module globals, which is useful for embeddings that want isolated top-level execution.
10-
11-
## Version 25.1.0
1210
* Intern string literals in source files
1311
* Allocation reporting via Truffle has been removed. Python object sizes were never reported correctly, so the data was misleading and there was a non-neglible overhead for object allocations even when reporting was inactive.
1412
* Better `readline` support via JLine. Autocompletion and history now works in `pdb`
@@ -18,6 +16,7 @@ language runtime. The main focus is on user-observable behavior of the engine.
1816
* Add Github workflows that run our gates from the same job definitions as our internal CI. This will make it easier for contributors opening PRs on Github to ensure code contributions pass the same tests that we are running internally.
1917
* Added support for specifying generics on foreign classes, and inheriting from such classes. Especially when using Java classes that support generics, this allows expressing the generic types in Python type annotations as well.
2018
* Added a new `java` backend for the `pyexpat` module that uses a Java XML parser instead of the native `expat` library. It can be useful when running without native access or multiple-context scenarios. This backend is the default when embedding and can be switched back to native `expat` by setting `python.PyExpatModuleBackend` option to `native`. Standalone distribution still defaults to native expat backend.
19+
* Add a new context option `python.UnicodeCharacterDatabaseNativeFallback` to control whether the ICU database may fall back to the native unicode character database from CPython for features and characters not supported by ICU. This requires native access to be enabled and is disabled by default for embeddings.
2120

2221
## Version 25.0.1
2322
* Allow users to keep going on unsupported JDK/OS/ARCH combinations at their own risk by opting out of early failure using `-Dtruffle.UseFallbackRuntime=true`, `-Dpolyglot.engine.userResourceCache=/set/to/a/writeable/dir`, `-Dpolyglot.engine.allowUnsupportedPlatform=true`, and `-Dpolyglot.python.UnsupportedPlatformEmulates=[linux|macos|windows]` and `-Dorg.graalvm.python.resources.exclude=native.files`.

graalpython/com.oracle.graal.python.shell/src/com/oracle/graal/python/shell/GraalPythonMain.java

Lines changed: 11 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -816,9 +816,8 @@ protected void launch(Builder contextBuilder) {
816816
contextBuilder.option("python.PosixModuleBackend", "java");
817817
}
818818

819-
if (!hasContextOptionSetViaCommandLine("WarnExperimentalFeatures")) {
820-
contextBuilder.option("python.WarnExperimentalFeatures", "false");
821-
}
819+
setOptionIfNotSetViaCommandLine(contextBuilder, "WarnExperimentalFeatures", "false");
820+
setOptionIfNotSetViaCommandLine(contextBuilder, "UnicodeCharacterDatabaseNativeFallback", "true");
822821

823822
if (multiContext) {
824823
contextBuilder.engine(Engine.newBuilder().allowExperimentalOptions(true).options(enginePolyglotOptions).build());
@@ -1009,19 +1008,13 @@ private void findAndApplyVenvCfg(Builder contextBuilder, String executable) {
10091008
}
10101009
break;
10111010
case "venvlauncher_command":
1012-
if (!hasContextOptionSetViaCommandLine("VenvlauncherCommand")) {
1013-
contextBuilder.option("python.VenvlauncherCommand", parts[1].trim());
1014-
}
1011+
setOptionIfNotSetViaCommandLine(contextBuilder, "VenvlauncherCommand", parts[1].trim());
10151012
break;
10161013
case "base-prefix":
1017-
if (!hasContextOptionSetViaCommandLine("SysBasePrefix")) {
1018-
contextBuilder.option("python.SysBasePrefix", parts[1].trim());
1019-
}
1014+
setOptionIfNotSetViaCommandLine(contextBuilder, "SysBasePrefix", parts[1].trim());
10201015
break;
10211016
case "base-executable":
1022-
if (!hasContextOptionSetViaCommandLine("BaseExecutable")) {
1023-
contextBuilder.option("python.BaseExecutable", parts[1].trim());
1024-
}
1017+
setOptionIfNotSetViaCommandLine(contextBuilder, "BaseExecutable", parts[1].trim());
10251018
break;
10261019
}
10271020
}
@@ -1052,6 +1045,12 @@ private String getContextOptionIfSetViaCommandLine(String key) {
10521045
return null;
10531046
}
10541047

1048+
private void setOptionIfNotSetViaCommandLine(Context.Builder builder, String key, String value) {
1049+
if (!hasContextOptionSetViaCommandLine(key)) {
1050+
builder.option("python." + key, value);
1051+
}
1052+
}
1053+
10551054
private boolean hasContextOptionSetViaCommandLine(String key) {
10561055
if (System.getProperty("polyglot.python." + key) != null) {
10571056
return System.getProperty("polyglot.python." + key) != null;

graalpython/com.oracle.graal.python.test/src/tests/test_unicodedata.py

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Copyright (c) 2018, 2025, Oracle and/or its affiliates. All rights reserved.
1+
# Copyright (c) 2018, 2026, Oracle and/or its affiliates. All rights reserved.
22
# DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
33
#
44
# The Universal Permissive License (UPL), Version 1.0
@@ -75,6 +75,16 @@ def test_lookup(self):
7575
with self.assertRaisesRegex(KeyError, "name too long"):
7676
unicodedata.lookup("a" * 257)
7777

78+
def test_lookup_named_sequence(self):
79+
if unicodedata.ucd_3_2_0.bidirectional == unicodedata.bidirectional:
80+
raise unittest.SkipTest("Only supported with CPython's unicodedata.ucd_3_2_0")
81+
82+
unicode_name = "LATIN SMALL LETTER R WITH TILDE"
83+
self.assertEqual(unicodedata.lookup(unicode_name), "\u0072\u0303")
84+
85+
with self.assertRaisesRegex(KeyError, "undefined character name 'LATIN SMALL LETTER R WITH TILDE'"):
86+
unicodedata.ucd_3_2_0.lookup(unicode_name)
87+
7888

7989
def test_east_asian_width(self):
8090
list = [1, 2, 3]
@@ -101,4 +111,4 @@ def test_combining(self):
101111

102112
empty_string = ""
103113
with self.assertRaisesRegex(TypeError, r"combining\(\) argument must be a unicode character, not str"):
104-
unicodedata.combining(empty_string)
114+
unicodedata.combining(empty_string)

graalpython/com.oracle.graal.python/src/com/oracle/graal/python/builtins/modules/UnicodeDataModuleBuiltins.java

Lines changed: 68 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
/*
2-
* Copyright (c) 2018, 2025, Oracle and/or its affiliates. All rights reserved.
2+
* Copyright (c) 2018, 2026, Oracle and/or its affiliates. All rights reserved.
33
* DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
44
*
55
* The Universal Permissive License (UPL), Version 1.0
@@ -43,6 +43,8 @@
4343
import static com.oracle.graal.python.nodes.BuiltinNames.J_UNICODEDATA;
4444
import static com.oracle.graal.python.nodes.BuiltinNames.T_UNICODEDATA;
4545
import static com.oracle.graal.python.nodes.BuiltinNames.T___GRAALPYTHON__;
46+
import static com.oracle.graal.python.nodes.SpecialAttributeNames.T___MODULE__;
47+
import static com.oracle.graal.python.nodes.SpecialAttributeNames.T___QUALNAME__;
4648
import static com.oracle.graal.python.runtime.exception.PythonErrorType.KeyError;
4749
import static com.oracle.graal.python.runtime.exception.PythonErrorType.ValueError;
4850
import static com.oracle.graal.python.util.PythonUtils.TS_ENCODING;
@@ -56,21 +58,32 @@
5658
import org.graalvm.shadowed.com.ibm.icu.text.Normalizer2;
5759
import org.graalvm.shadowed.com.ibm.icu.util.VersionInfo;
5860

61+
import com.oracle.graal.python.PythonLanguage;
5962
import com.oracle.graal.python.annotations.ArgumentClinic;
6063
import com.oracle.graal.python.annotations.Builtin;
6164
import com.oracle.graal.python.builtins.CoreFunctions;
6265
import com.oracle.graal.python.builtins.Python3Core;
66+
import com.oracle.graal.python.builtins.PythonBuiltinClassType;
6367
import com.oracle.graal.python.builtins.PythonBuiltins;
6468
import com.oracle.graal.python.builtins.objects.PNone;
6569
import com.oracle.graal.python.builtins.objects.module.PythonModule;
70+
import com.oracle.graal.python.builtins.objects.object.PythonObject;
71+
import com.oracle.graal.python.builtins.objects.type.PythonAbstractClass;
72+
import com.oracle.graal.python.builtins.objects.type.PythonClass;
6673
import com.oracle.graal.python.lib.PyObjectCallMethodObjArgs;
74+
import com.oracle.graal.python.lib.PyObjectGetAttr;
6775
import com.oracle.graal.python.nodes.ErrorMessages;
6876
import com.oracle.graal.python.nodes.PRaiseNode;
77+
import com.oracle.graal.python.nodes.call.CallNode;
6978
import com.oracle.graal.python.nodes.function.PythonBuiltinBaseNode;
7079
import com.oracle.graal.python.nodes.function.builtins.PythonBinaryClinicBuiltinNode;
7180
import com.oracle.graal.python.nodes.function.builtins.PythonUnaryClinicBuiltinNode;
7281
import com.oracle.graal.python.nodes.function.builtins.clinic.ArgumentClinicProvider;
7382
import com.oracle.graal.python.nodes.object.GetOrCreateDictNode;
83+
import com.oracle.graal.python.nodes.object.BuiltinClassProfiles.IsBuiltinObjectProfile;
84+
import com.oracle.graal.python.runtime.PythonOptions;
85+
import com.oracle.graal.python.runtime.exception.PException;
86+
import com.oracle.graal.python.nodes.statement.AbstractImportNode;
7487
import com.oracle.graal.python.runtime.object.PFactory;
7588
import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
7689
import com.oracle.truffle.api.dsl.Bind;
@@ -87,6 +100,11 @@
87100

88101
@CoreFunctions(defineModule = J_UNICODEDATA)
89102
public final class UnicodeDataModuleBuiltins extends PythonBuiltins {
103+
private static final TruffleString T__CPYTHON_UNICODEDATA = toTruffleStringUncached("_cpython_unicodedata");
104+
private static final TruffleString T_LOOKUP = toTruffleStringUncached("lookup");
105+
private static final TruffleString T_UCD_3_2_0 = toTruffleStringUncached("ucd_3_2_0");
106+
private static final TruffleString T_UNIDATA_VERSION = toTruffleStringUncached("unidata_version");
107+
90108
@Override
91109
protected List<? extends NodeFactory<? extends PythonBuiltinBaseNode>> getNodeFactories() {
92110
return UnicodeDataModuleBuiltinsFactory.getFactories();
@@ -120,12 +138,31 @@ private static String getUnicodeNameTB(int cp) {
120138
public void postInitialize(Python3Core core) {
121139
super.postInitialize(core);
122140
PythonModule self = core.lookupBuiltinModule(T_UNICODEDATA);
123-
self.setAttribute(toTruffleStringUncached("unidata_version"), toTruffleStringUncached(getUnicodeVersion()));
124-
PyObjectCallMethodObjArgs.executeUncached(core.lookupBuiltinModule(T___GRAALPYTHON__), toTruffleStringUncached("import_current_as_named_module_with_delegate"),
125-
/* module_name= */ T_UNICODEDATA,
126-
/* delegate_name= */ toTruffleStringUncached("_cpython_unicodedata"),
127-
/* delegate_attributes= */ PFactory.createList(core.getLanguage(), new Object[]{toTruffleStringUncached("ucd_3_2_0")}),
128-
/* owner_globals= */ GetOrCreateDictNode.executeUncached(self));
141+
self.setAttribute(T_UNIDATA_VERSION, toTruffleStringUncached(getUnicodeVersion()));
142+
if (core.getLanguage().getEngineOption(PythonOptions.UnicodeCharacterDatabaseNativeFallback)) {
143+
PyObjectCallMethodObjArgs.executeUncached(core.lookupBuiltinModule(T___GRAALPYTHON__), toTruffleStringUncached("import_current_as_named_module_with_delegate"),
144+
/* module_name= */ T_UNICODEDATA,
145+
/* delegate_name= */ T__CPYTHON_UNICODEDATA,
146+
/* delegate_attributes= */ PFactory.createList(core.getLanguage(), new Object[]{T_UCD_3_2_0}),
147+
/* owner_globals= */ GetOrCreateDictNode.executeUncached(self));
148+
} else {
149+
self.setAttribute(T_UCD_3_2_0, createUCDCompatibilityObject(core, self));
150+
}
151+
}
152+
153+
private PythonObject createUCDCompatibilityObject(Python3Core core, PythonModule self) {
154+
TruffleString t_ucd = toTruffleStringUncached("UCD");
155+
PythonClass clazz = PFactory.createPythonClassAndFixupSlots(null, core.getLanguage(), t_ucd, PythonBuiltinClassType.PythonObject,
156+
new PythonAbstractClass[]{core.lookupType(PythonBuiltinClassType.PythonObject)});
157+
for (String s : new String[]{"normalize", "is_normalized", "lookup", "name", "bidirectional", "category", "combining", "east_asian_width", "decomposition", "digit", "decimal"}) {
158+
TruffleString ts = toTruffleStringUncached(s);
159+
clazz.setAttribute(ts, PFactory.createStaticmethodFromCallableObj(core.getLanguage(), self.getAttribute(ts)));
160+
}
161+
clazz.setAttribute(T___MODULE__, T_UNICODEDATA);
162+
clazz.setAttribute(T___QUALNAME__, t_ucd);
163+
PythonObject obj = PFactory.createPythonObject(clazz, clazz.getInstanceShape());
164+
obj.setAttribute(T_UNIDATA_VERSION, toTruffleStringUncached("3.2.0"));
165+
return obj;
129166
}
130167

131168
static final int NORMALIZER_FORM_COUNT = 4;
@@ -214,34 +251,33 @@ abstract static class LookupNode extends PythonUnaryClinicBuiltinNode {
214251
@Specialization
215252
@TruffleBoundary
216253
static Object lookup(TruffleString name,
254+
@Bind PythonLanguage lang,
217255
@Bind Node inliningTarget) {
218256
String nameString = ToJavaStringNode.getUncached().execute(name);
219257
if (nameString.length() > NAME_MAX_LENGTH) {
220258
throw PRaiseNode.raiseStatic(inliningTarget, KeyError, ErrorMessages.NAME_TOO_LONG);
221259
}
222260

223-
// TODO: support Unicode character named sequences (GR-68227)
224-
// see test/test_ucn.py.UnicodeFunctionsTest.test_named_sequences_full
225261
String character = getCharacterByUnicodeName(nameString);
226262
if (character == null) {
227263
character = getCharacterByUnicodeNameAlias(nameString);
228264
}
229-
if (character == null) {
230-
throw PRaiseNode.raiseStatic(inliningTarget, KeyError, ErrorMessages.UNDEFINED_CHARACTER_NAME, name);
265+
if (character != null) {
266+
return FromJavaStringNode.getUncached().execute(character, TS_ENCODING);
231267
}
232268

233-
return FromJavaStringNode.getUncached().execute(character, TS_ENCODING);
269+
Object namedSequence = lookupNamedSequenceFromFallback(lang, name);
270+
if (namedSequence != null) {
271+
return namedSequence;
272+
}
273+
throw PRaiseNode.raiseStatic(inliningTarget, KeyError, ErrorMessages.UNDEFINED_CHARACTER_NAME, name);
234274
}
235275

236276
@Override
237277
protected ArgumentClinicProvider getArgumentClinic() {
238278
return UnicodeDataModuleBuiltinsClinicProviders.LookupNodeClinicProviderGen.INSTANCE;
239279
}
240280

241-
/**
242-
* Finds a Unicode code point by its Unicode name and returns it as a single character
243-
* String. Returns null if name is not found.
244-
*/
245281
@TruffleBoundary
246282
private static String getCharacterByUnicodeName(String unicodeName) {
247283
int codepoint = UCharacter.getCharFromName(unicodeName);
@@ -253,10 +289,6 @@ private static String getCharacterByUnicodeName(String unicodeName) {
253289
return UCharacter.toString(codepoint);
254290
}
255291

256-
/**
257-
* Finds a Unicode code point by its Unicode name alias and returns it as a single character
258-
* String. Returns null if name alias is not found.
259-
*/
260292
@TruffleBoundary
261293
private static String getCharacterByUnicodeNameAlias(String unicodeName) {
262294
int codepoint = UCharacter.getCharFromNameAlias(unicodeName);
@@ -267,6 +299,22 @@ private static String getCharacterByUnicodeNameAlias(String unicodeName) {
267299

268300
return UCharacter.toString(codepoint);
269301
}
302+
303+
@TruffleBoundary
304+
private static Object lookupNamedSequenceFromFallback(PythonLanguage lang, TruffleString name) {
305+
if (lang.getEngineOption(PythonOptions.UnicodeCharacterDatabaseNativeFallback)) {
306+
try {
307+
PythonModule cpythonUnicodeData = AbstractImportNode.importModule(T__CPYTHON_UNICODEDATA);
308+
Object lookup = PyObjectGetAttr.executeUncached(cpythonUnicodeData, T_LOOKUP);
309+
return CallNode.executeUncached(lookup, name);
310+
} catch (PException e) {
311+
if (!IsBuiltinObjectProfile.profileObjectUncached(e.getUnreifiedException(), PythonBuiltinClassType.ImportError)) {
312+
throw e;
313+
}
314+
}
315+
}
316+
return null;
317+
}
270318
}
271319

272320
// unicodedata.name(chr, default)

0 commit comments

Comments
 (0)