Skip to content

Commit 42532af

Browse files
Kabir08Kabiraldehir
authored
unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regr… (#22110)
* unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regression tests - Add unicode_regex_split_custom_qwen35() to [src/unicode.cpp](src/unicode.cpp), a non-backtracking handler for Qwen3.5's [\p{L}\p{M}]+ regex (letters + combining marks). - Register the handler in the custom tokenizer dispatch table to prevent stack overflows on long inputs (fixes #21919). - Add [models/ggml-vocab-qwen35.gguf](models/ggml-vocab-qwen35.gguf) (test vocab), [models/ggml-vocab-qwen35.gguf.inp](models/ggml-vocab-qwen35.gguf.inp) (test cases), and [models/ggml-vocab-qwen35.gguf.out](models/ggml-vocab-qwen35.gguf.out) (expected output) for regression testing. - Update [tests/CMakeLists.txt](tests/CMakeLists.txt) to include the new test entry. This mirrors the Qwen2 fix (commit 0d049d6), but adapts for Qwen3.5's regex. Ensures robust Unicode tokenization and prevents std::regex stack overflows. Closes #21919. * fix: enhance regex handling for Qwen3.5 tokenizer to include accent marks * cont : remove trailing whitespace --------- Co-authored-by: Kabir <kabir@example.com> Co-authored-by: Alde Rojas <hello@alde.dev>
1 parent dbe7901 commit 42532af

5 files changed

Lines changed: 304 additions & 0 deletions

File tree

models/ggml-vocab-qwen35.gguf

5.65 MB
Binary file not shown.

models/ggml-vocab-qwen35.gguf.inp

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
ied 4 ½ months
2+
__ggml_vocab_test__
3+
Äpfel
4+
__ggml_vocab_test__
5+
6+
__ggml_vocab_test__
7+
8+
__ggml_vocab_test__
9+
10+
__ggml_vocab_test__
11+
12+
__ggml_vocab_test__
13+
14+
__ggml_vocab_test__
15+
16+
17+
__ggml_vocab_test__
18+
19+
20+
21+
__ggml_vocab_test__
22+
23+
24+
25+
26+
__ggml_vocab_test__
27+
28+
29+
__ggml_vocab_test__
30+
Hello world
31+
__ggml_vocab_test__
32+
Hello world
33+
__ggml_vocab_test__
34+
Hello World
35+
__ggml_vocab_test__
36+
Hello World
37+
__ggml_vocab_test__
38+
Hello World!
39+
__ggml_vocab_test__
40+
Hello, world!
41+
__ggml_vocab_test__
42+
Hello, world!
43+
__ggml_vocab_test__
44+
this is 🦙.cpp
45+
__ggml_vocab_test__
46+
w048 7tuijk dsdfhu
47+
__ggml_vocab_test__
48+
нещо на Български
49+
__ggml_vocab_test__
50+
កាន់តែពិសេសអាចខលចេញ
51+
__ggml_vocab_test__
52+
🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ (only emoji that has its own token)
53+
__ggml_vocab_test__
54+
Hello
55+
__ggml_vocab_test__
56+
Hello
57+
__ggml_vocab_test__
58+
Hello
59+
__ggml_vocab_test__
60+
Hello
61+
__ggml_vocab_test__
62+
Hello
63+
__ggml_vocab_test__
64+
Hello
65+
Hello
66+
__ggml_vocab_test__
67+
(
68+
__ggml_vocab_test__
69+
70+
=
71+
__ggml_vocab_test__
72+
' era
73+
__ggml_vocab_test__
74+
Hello, y'all! How are you 😁 ?我想在apple工作1314151天~
75+
__ggml_vocab_test__
76+
!!!!!!
77+
__ggml_vocab_test__
78+
3
79+
__ggml_vocab_test__
80+
33
81+
__ggml_vocab_test__
82+
333
83+
__ggml_vocab_test__
84+
3333
85+
__ggml_vocab_test__
86+
33333
87+
__ggml_vocab_test__
88+
333333
89+
__ggml_vocab_test__
90+
3333333
91+
__ggml_vocab_test__
92+
33333333
93+
__ggml_vocab_test__
94+
333333333
95+
__ggml_vocab_test__
96+
Cửa Việt
97+
__ggml_vocab_test__
98+
discards
99+
__ggml_vocab_test__
100+
101+
102+
103+
104+
105+
106+
107+
108+
109+
110+
111+
🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ 🦙🦙 3 33 333 3333 33333 333333 3333333 33333333 3.3 3..3 3...3 កាន់តែពិសេសអាច😁 ?我想在apple工作1314151天~ ------======= нещо на Български ''''''```````""""......!!!!!!?????? I've been 'told he's there, 'RE you sure? 'M not sure I'll make it, 'D you like some tea? We'Ve a'lL
112+
__ggml_vocab_test__
113+
114+
__ggml_vocab_test__
115+
résumé
116+
__ggml_vocab_test__
117+
àààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààà
118+
__ggml_vocab_test__
119+
Vieết Nam
120+
__ggml_vocab_test__

models/ggml-vocab-qwen35.gguf.out

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
1122 220 19 220 26062 3951
2+
86975 15897 301
3+
4+
220
5+
256
6+
262
7+
197
8+
198
9+
271
10+
1406
11+
1572
12+
9707 1879
13+
21927 1879
14+
9707 4337
15+
21927 4337
16+
21927 4337 0
17+
9707 11 1879 0
18+
21927 11 1879 0
19+
419 374 11162 99 247 13 10821
20+
86 15 19 23 220 22 83 1963 41808 11472 2940 16739
21+
78762 14144 1456 13073 63471 33594 3038 133178 79012
22+
146394 97529 241 44258 233 146568 44258 224 147603 20879 115 146280 44258 223 146280 147272 97529 227 147805 148301 147270 44258 223 146848
23+
145836 320 8252 8 26525 114 378 235 149921 30543 320 35673 99066 97534 8 25521 227 320 3243 42365 429 702 1181 1828 3950 8
24+
9707
25+
21927
26+
220 21927
27+
256 21927
28+
262 21927
29+
262 21927 198 262 21927
30+
320
31+
198 284
32+
6 11385
33+
9707 11 379 64848 0 2585 525 498 26525 223 937 104100 18493 22377 99257 16 18 16 19 16 20 16 35727 21216
34+
17085 2928
35+
18
36+
18 18
37+
18 18 18
38+
18 18 18 18
39+
18 18 18 18 18
40+
18 18 18 18 18 18
41+
18 18 18 18 18 18 18
42+
18 18 18 18 18 18 18 18
43+
18 18 18 18 18 18 18 18 18
44+
34 90063 128324
45+
2560 2347
46+
198 4710 14731 65497 7847 1572 2303 78672 10947 145836 320 8252 8 26525 114 378 235 149921 30543 320 35673 99066 97534 8 25521 227 11162 99 247 149955 220 18 220 18 18 220 18 18 18 220 18 18 18 18 220 18 18 18 18 18 220 18 18 18 18 18 18 220 18 18 18 18 18 18 18 220 18 18 18 18 18 18 18 18 220 18 13 18 220 18 496 18 220 18 1112 18 220 146394 97529 241 44258 233 146568 44258 224 147603 20879 115 146280 44258 223 146280 147272 97529 227 144534 937 104100 18493 22377 99257 16 18 16 19 16 20 16 35727 21216 55460 53237 18658 14144 1456 13073 63471 33594 3038 133178 79012 3355 4605 4605 13874 13874 73594 3014 3014 28149 17085 2928 26610 7646 358 3003 1012 364 83 813 566 594 1052 11 364 787 498 2704 30 364 44 537 2704 358 3278 1281 432 11 364 35 498 1075 1045 15243 30 1205 6 42612 264 63866 43
47+
68 53839
48+
265 53839 31323 53839
49+
64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549
50+
53 645 51580 29974

src/unicode.cpp

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -605,6 +605,136 @@ static std::vector<size_t> unicode_regex_split_custom_qwen2(const std::string &
605605
return bpe_offsets;
606606
}
607607

608+
// Qwen3.5 system regex: "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?[\\p{L}\\p{M}]+|\\p{N}| ?[^\\s\\p{L}\\p{M}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
609+
// Compared to Qwen2, letter-runs also consume Unicode combining marks (\p{M}): [\p{L}\p{M}]+ instead of \p{L}+
610+
static std::vector<size_t> unicode_regex_split_custom_qwen35(const std::string & text, const std::vector<size_t> & offsets) {
611+
std::vector<size_t> bpe_offsets; // store the offset of each word
612+
bpe_offsets.reserve(offsets.size()); // Reserve memory for the approximate size
613+
614+
const auto cpts = unicode_cpts_from_utf8(text);
615+
616+
size_t start = 0;
617+
for (auto offset : offsets) {
618+
const size_t offset_ini = start;
619+
const size_t offset_end = start + offset;
620+
assert(offset_end <= cpts.size());
621+
start = offset_end;
622+
623+
static const uint32_t OUT_OF_RANGE = 0xFFFFFFFF;
624+
auto _get_cpt = [&] (const size_t pos) -> uint32_t {
625+
return (offset_ini <= pos && pos < offset_end) ? cpts[pos] : OUT_OF_RANGE;
626+
};
627+
628+
auto _get_flags = [&] (const size_t pos) -> unicode_cpt_flags {
629+
return (offset_ini <= pos && pos < offset_end) ? unicode_cpt_flags_from_cpt(cpts[pos]) : unicode_cpt_flags{};
630+
};
631+
632+
size_t _prev_end = offset_ini;
633+
auto _add_token = [&] (const size_t end) -> size_t {
634+
assert(_prev_end <= end && end <= offset_end);
635+
size_t len = end - _prev_end;
636+
if (len > 0) {
637+
bpe_offsets.push_back(len);
638+
}
639+
_prev_end = end;
640+
return len;
641+
};
642+
643+
for (size_t pos = offset_ini; pos < offset_end; /*pos++*/ ) {
644+
const uint32_t cpt = _get_cpt(pos);
645+
const auto flags = _get_flags(pos);
646+
647+
// regex: (?i:'s|'t|'re|'ve|'m|'ll|'d) // case insensitive
648+
if (cpt == '\'' && pos+1 < offset_end) {
649+
uint32_t cpt_next = unicode_tolower(_get_cpt(pos+1));
650+
if (cpt_next == 's' || cpt_next == 't' || cpt_next == 'm' || cpt_next == 'd') {
651+
pos += _add_token(pos+2);
652+
continue;
653+
}
654+
if (pos+2 < offset_end) {
655+
uint32_t cpt_next_next = unicode_tolower(_get_cpt(pos+2));
656+
if ((cpt_next == 'r' && cpt_next_next == 'e') ||
657+
(cpt_next == 'v' && cpt_next_next == 'e') ||
658+
(cpt_next == 'l' && cpt_next_next == 'l')) {
659+
pos += _add_token(pos+3);
660+
continue;
661+
}
662+
}
663+
}
664+
665+
// regex: [^\r\n\p{L}\p{N}]?[\p{L}\p{M}]+
666+
if (!(cpt == '\r' || cpt == '\n' || flags.is_number)) {
667+
if (flags.is_letter || flags.is_accent_mark || _get_flags(pos + 1).is_accent_mark || _get_flags(pos+1).is_letter) {
668+
pos++;
669+
while (_get_flags(pos).is_letter || _get_flags(pos).is_accent_mark) {
670+
pos++;
671+
}
672+
_add_token(pos);
673+
continue;
674+
}
675+
}
676+
677+
// regex: \p{N}
678+
if (flags.is_number) {
679+
pos++;
680+
_add_token(pos);
681+
continue;
682+
}
683+
684+
// regex: <space>?[^\s\p{L}\p{M}\p{N}]+[\r\n]*
685+
auto flags2 = (cpt == ' ' ? _get_flags(pos+1) : flags);
686+
if (!(flags2.is_whitespace | flags2.is_letter | flags2.is_accent_mark | flags2.is_number) && flags.as_uint()) {
687+
pos += (cpt == ' ');
688+
while (!(flags2.is_whitespace | flags2.is_letter | flags2.is_accent_mark | flags2.is_number) && flags2.as_uint()) {
689+
flags2 = _get_flags(++pos);
690+
}
691+
uint32_t cpt2 = _get_cpt(pos);
692+
while (cpt2 == '\r' || cpt2 == '\n') {
693+
cpt2 = _get_cpt(++pos);
694+
}
695+
_add_token(pos);
696+
continue;
697+
}
698+
699+
size_t num_whitespaces = 0;
700+
size_t last_end_r_or_n = 0;
701+
while (_get_flags(pos+num_whitespaces).is_whitespace) {
702+
uint32_t cpt2 = _get_cpt(pos+num_whitespaces);
703+
if (cpt2 == '\r' || cpt2 == '\n') {
704+
last_end_r_or_n = pos + num_whitespaces + 1;
705+
}
706+
num_whitespaces++;
707+
}
708+
709+
// regex: \s*[\r\n]+
710+
if (last_end_r_or_n > 0) {
711+
pos = last_end_r_or_n;
712+
_add_token(pos);
713+
continue;
714+
}
715+
716+
// regex: \s+(?!\S)
717+
if (num_whitespaces > 1 && _get_cpt(pos+num_whitespaces) != OUT_OF_RANGE) {
718+
pos += num_whitespaces - 1;
719+
_add_token(pos);
720+
continue;
721+
}
722+
723+
// regex: \s+
724+
if (num_whitespaces > 0) {
725+
pos += num_whitespaces;
726+
_add_token(pos);
727+
continue;
728+
}
729+
730+
// no matches
731+
_add_token(++pos);
732+
}
733+
}
734+
735+
return bpe_offsets;
736+
}
737+
608738
template <typename CharT>
609739
static std::vector<size_t> unicode_regex_split_stl(const std::basic_string<CharT> & text, const std::basic_string<CharT> & regex, const std::vector<size_t> & offsets) {
610740
using BidirIt = typename std::basic_string<CharT>::const_iterator;
@@ -929,6 +1059,9 @@ static std::vector<size_t> unicode_regex_split_custom(const std::string & text,
9291059
} else if (
9301060
regex_expr == "(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+") {
9311061
bpe_offsets = unicode_regex_split_custom_qwen2(text, offsets);
1062+
} else if (
1063+
regex_expr == "(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?[\\p{L}\\p{M}]+|\\p{N}| ?[^\\s\\p{L}\\p{M}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+") {
1064+
bpe_offsets = unicode_regex_split_custom_qwen35(text, offsets);
9321065
} else if (regex_expr == "\\p{Han}+") {
9331066
// K2's first pattern - handle all K2 patterns together
9341067
bpe_offsets = unicode_regex_split_custom_kimi_k2(text, offsets);

tests/CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -131,6 +131,7 @@ llama_test(test-tokenizer-0 NAME test-tokenizer-0-llama-spm ARGS ${PROJE
131131
llama_test(test-tokenizer-0 NAME test-tokenizer-0-mpt ARGS ${PROJECT_SOURCE_DIR}/models/ggml-vocab-mpt.gguf)
132132
llama_test(test-tokenizer-0 NAME test-tokenizer-0-phi-3 ARGS ${PROJECT_SOURCE_DIR}/models/ggml-vocab-phi-3.gguf)
133133
llama_test(test-tokenizer-0 NAME test-tokenizer-0-qwen2 ARGS ${PROJECT_SOURCE_DIR}/models/ggml-vocab-qwen2.gguf)
134+
llama_test(test-tokenizer-0 NAME test-tokenizer-0-qwen35 ARGS ${PROJECT_SOURCE_DIR}/models/ggml-vocab-qwen35.gguf)
134135
llama_test(test-tokenizer-0 NAME test-tokenizer-0-refact ARGS ${PROJECT_SOURCE_DIR}/models/ggml-vocab-refact.gguf)
135136
llama_test(test-tokenizer-0 NAME test-tokenizer-0-starcoder ARGS ${PROJECT_SOURCE_DIR}/models/ggml-vocab-starcoder.gguf)
136137

0 commit comments

Comments
 (0)