Skip to content

Commit 1e457a1

Browse files
committed
Update CanonicalABI.md prose to match Python regarding code units
Closes #509
1 parent 31256b8 commit 1e457a1

1 file changed

Lines changed: 12 additions & 11 deletions

File tree

design/mvp/CanonicalABI.md

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1609,14 +1609,14 @@ def convert_i32_to_char(cx, i):
16091609
```
16101610

16111611
Strings are loaded from two `i32` values: a pointer (offset in linear memory)
1612-
and a number of bytes. There are three supported string encodings in [`canonopt`]:
1613-
[UTF-8], [UTF-16] and `latin1+utf16`. This last options allows a *dynamic*
1614-
choice between [Latin-1] and UTF-16, indicated by the high bit of the second
1615-
`i32`. String values include their original encoding and byte length as a
1616-
"hint" that enables `store_string` (defined below) to make better up-front
1617-
allocation size choices in many cases. Thus, the value produced by
1618-
`load_string` isn't simply a Python `str`, but a *tuple* containing a `str`,
1619-
the original encoding and the original byte length.
1612+
and a number of [code units]. There are three supported string encodings in
1613+
[`canonopt`]: [UTF-8], [UTF-16] and `latin1+utf16`. This last options allows a
1614+
*dynamic* choice between [Latin-1] and UTF-16, indicated by the high bit of the
1615+
second `i32`. String values include their original encoding and length in
1616+
tagged code units as a "hint" that enables `store_string` (defined below) to
1617+
make better up-front allocation size choices in many cases. Thus, the value
1618+
produced by `load_string` isn't simply a Python `str`, but a *tuple* containing
1619+
a `str`, the original encoding and the number of source code units.
16201620
```python
16211621
String = tuple[str, str, int]
16221622

@@ -1910,8 +1910,8 @@ approach to update the allocation size during the single copy. A blind
19101910
`realloc` approach would normally suffer from multiple reallocations per string
19111911
(e.g., using the standard doubling-growth strategy). However, as already shown
19121912
in `load_string` above, string values come with two useful hints: their
1913-
original encoding and byte length. From this hint data, `store_string` can do a
1914-
much better job minimizing the number of reallocations.
1913+
original encoding and number of source [code units]. From this hint data,
1914+
`store_string` can do a much better job minimizing the number of reallocations.
19151915

19161916
We start with a case analysis to enumerate all the meaningful encoding
19171917
combinations, subdividing the `latin1+utf16` encoding into either `latin1` or
@@ -1975,7 +1975,7 @@ def store_string_copy(cx, src, src_code_units, dst_code_unit_size, dst_alignment
19751975
return (ptr, src_code_units)
19761976
```
19771977
The choice of `MAX_STRING_BYTE_LENGTH` constant ensures that the high bit of a
1978-
string's byte length is never set, keeping it clear for `UTF16_BIT`.
1978+
string's number of code units is never set, keeping it clear for `UTF16_BIT`.
19791979

19801980
The 2 cases of transcoding into UTF-8 share an algorithm that starts by
19811981
optimistically assuming that each code unit of the source string fits in a
@@ -3903,6 +3903,7 @@ def canon_thread_available_parallelism():
39033903
[Latin-1]: https://en.wikipedia.org/wiki/ISO/IEC_8859-1
39043904
[Unicode Scalar Value]: https://unicode.org/glossary/#unicode_scalar_value
39053905
[Unicode Code Point]: https://unicode.org/glossary/#code_point
3906+
[Code Units]: https://www.unicode.org/glossary/#code_unit
39063907
[Surrogate]: https://unicode.org/faq/utf_bom.html#utf16-2
39073908
[Name Mangling]: https://en.wikipedia.org/wiki/Name_mangling
39083909
[Fibers]: https://en.wikipedia.org/wiki/Fiber_(computer_science)

0 commit comments

Comments
 (0)