@@ -1609,14 +1609,14 @@ def convert_i32_to_char(cx, i):
16091609```
16101610
16111611Strings are loaded from two ` i32 ` values: a pointer (offset in linear memory)
1612- and a number of bytes . There are three supported string encodings in [ ` canonopt ` ] :
1613- [ UTF-8] , [ UTF-16] and ` latin1+utf16 ` . This last options allows a * dynamic *
1614- choice between [ Latin-1] and UTF-16, indicated by the high bit of the second
1615- ` i32 ` . String values include their original encoding and byte length as a
1616- "hint" that enables ` store_string ` (defined below) to make better up-front
1617- allocation size choices in many cases. Thus, the value produced by
1618- ` load_string ` isn't simply a Python ` str ` , but a * tuple* containing a ` str ` ,
1619- the original encoding and the original byte length .
1612+ and a number of [ code units ] . There are three supported string encodings in
1613+ [ ` canonopt ` ] : [ UTF-8] , [ UTF-16] and ` latin1+utf16 ` . This last options allows a
1614+ * dynamic * choice between [ Latin-1] and UTF-16, indicated by the high bit of the
1615+ second ` i32 ` . String values include their original encoding and length in
1616+ tagged code units as a "hint" that enables ` store_string ` (defined below) to
1617+ make better up-front allocation size choices in many cases. Thus, the value
1618+ produced by ` load_string ` isn't simply a Python ` str ` , but a * tuple* containing
1619+ a ` str ` , the original encoding and the number of source code units .
16201620``` python
16211621String = tuple[str , str , int ]
16221622
@@ -1910,8 +1910,8 @@ approach to update the allocation size during the single copy. A blind
19101910` realloc ` approach would normally suffer from multiple reallocations per string
19111911(e.g., using the standard doubling-growth strategy). However, as already shown
19121912in ` load_string ` above, string values come with two useful hints: their
1913- original encoding and byte length . From this hint data, ` store_string ` can do a
1914- much better job minimizing the number of reallocations.
1913+ original encoding and number of source [ code units ] . From this hint data,
1914+ ` store_string ` can do a much better job minimizing the number of reallocations.
19151915
19161916We start with a case analysis to enumerate all the meaningful encoding
19171917combinations, subdividing the ` latin1+utf16 ` encoding into either ` latin1 ` or
@@ -1975,7 +1975,7 @@ def store_string_copy(cx, src, src_code_units, dst_code_unit_size, dst_alignment
19751975 return (ptr, src_code_units)
19761976```
19771977The choice of ` MAX_STRING_BYTE_LENGTH ` constant ensures that the high bit of a
1978- string's byte length is never set, keeping it clear for ` UTF16_BIT ` .
1978+ string's number of code units is never set, keeping it clear for ` UTF16_BIT ` .
19791979
19801980The 2 cases of transcoding into UTF-8 share an algorithm that starts by
19811981optimistically assuming that each code unit of the source string fits in a
@@ -3903,6 +3903,7 @@ def canon_thread_available_parallelism():
39033903[ Latin-1 ] : https://en.wikipedia.org/wiki/ISO/IEC_8859-1
39043904[ Unicode Scalar Value ] : https://unicode.org/glossary/#unicode_scalar_value
39053905[ Unicode Code Point ] : https://unicode.org/glossary/#code_point
3906+ [ Code Units ] : https://www.unicode.org/glossary/#code_unit
39063907[ Surrogate ] : https://unicode.org/faq/utf_bom.html#utf16-2
39073908[ Name Mangling ] : https://en.wikipedia.org/wiki/Name_mangling
39083909[ Fibers ] : https://en.wikipedia.org/wiki/Fiber_(computer_science)
0 commit comments