Commit 2fa33e1
* Unigram lattice walks Unicode scalars, not grapheme clusters (#352, Bug 3)
`UnigramTokenizer.tokenize(text:)` and `TokenLattice` indexed the input by Swift
`Character` (extended grapheme clusters). SentencePiece Unigram vocabularies
are scalar-indexed, so an input grapheme that spans multiple scalars never gets
exposed as its constituent scalars to the trie walk and the vocab lookup. For
example `"1️⃣"` is one grapheme but three scalars (digit `1`, VS-16, combining
keycap U+20E3); HF Python emits `▁1 <unk> </s>` for it, while Swift previously
returned `▁ <unk> </s>` — the digit was silently dropped because the entire
keycap grapheme occupied a single lattice slot that didn't match any vocab key.
Switch the iteration unit to `Unicode.Scalar`:
- `UnigramTokenizer.trie` is now `Trie<Unicode.Scalar>` and is fed each vocab
entry's `unicodeScalars` view.
- `tokenize(text:)` walks `Array(text.unicodeScalars)` and reconstructs token
strings from `String.UnicodeScalarView` slices.
- `TokenLattice.chars` is `[Unicode.Scalar]`. The convenience
`init(sentence:)` and the `piece(_:)` reconstruction follow the same
convention. Lattice offsets and lengths are now in scalar units; the
Viterbi algorithm is unchanged (it only sees integer positions).
Adds a regression test using google-t5/t5-small over the keycap emoji.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Address review: tone down comments per @pcuenca
Per @pcuenca's review on #356:
- Replace the multi-line TokenLattice.chars docstring with the shorter
three-line version @pcuenca suggested.
- Remove the in-body comment blocks at the trie-build site in
`UnigramTokenizer.init` and at the top of `UnigramTokenizer.tokenize(text:)`.
The remaining TokenLattice docstring plus the regression test in
TokenizerTests carry the explanation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* I hate linters
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
1 parent eb08337 commit 2fa33e1
3 files changed
Lines changed: 30 additions & 13 deletions
File tree
- Sources/Tokenizers
- Tests/TokenizersTests
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
17 | | - | |
18 | | - | |
19 | | - | |
20 | | - | |
21 | | - | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
22 | 21 | | |
23 | 22 | | |
24 | 23 | | |
25 | 24 | | |
26 | 25 | | |
27 | 26 | | |
28 | | - | |
| 27 | + | |
29 | 28 | | |
30 | 29 | | |
31 | | - | |
| 30 | + | |
32 | 31 | | |
33 | 32 | | |
34 | 33 | | |
| |||
109 | 108 | | |
110 | 109 | | |
111 | 110 | | |
112 | | - | |
| 111 | + | |
113 | 112 | | |
114 | | - | |
| 113 | + | |
115 | 114 | | |
116 | 115 | | |
117 | 116 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
57 | 57 | | |
58 | 58 | | |
59 | 59 | | |
60 | | - | |
| 60 | + | |
61 | 61 | | |
62 | 62 | | |
63 | 63 | | |
| |||
108 | 108 | | |
109 | 109 | | |
110 | 110 | | |
111 | | - | |
| 111 | + | |
112 | 112 | | |
113 | 113 | | |
114 | 114 | | |
| |||
132 | 132 | | |
133 | 133 | | |
134 | 134 | | |
135 | | - | |
| 135 | + | |
136 | 136 | | |
137 | 137 | | |
138 | 138 | | |
| |||
148 | 148 | | |
149 | 149 | | |
150 | 150 | | |
151 | | - | |
| 151 | + | |
152 | 152 | | |
153 | 153 | | |
154 | 154 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
224 | 224 | | |
225 | 225 | | |
226 | 226 | | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
227 | 245 | | |
228 | 246 | | |
229 | 247 | | |
| |||
0 commit comments