Commit 78694d5
authored
Add Arabic char tokenizer and Japanese-English katakana support (#15614)
Apply isort and black reformatting
Fix Hindi chartokenizer, use 'case=upper' and prevent duplicate spaces in the Japanese G2P fallback paths.
Apply isort and black reformatting
Fix HindiCharsTokenizer backward compat and add Arabic dialect tests
Apply isort and black reformatting
Add Arabic tokenizer test coverage: diacritics, dialects, punctuation, unknown chars
Expand Arabic tokenizer tests: parametrize diacritics, dialects
Apply isort and black reformatting
added comprehensive test coverage.
fix: add back-compatibility, case=mixed, ascii_letters.
fix: add charset_version to Hindi/Arabic tokenizers for backward compatibility
Introduce a parameter in HindiCharsTokenizer and
ArabicCharsTokenizer so old models (v1: case='mixed') keep working
while new models train with the corrected charset (v2: case='upper').
- Define CASELESS_SCRIPT_TOKENIZER_TARGETS and DEFAULT_CHARSET_VERSION
constants in tts_tokenizers.py
- Persist charset_version into the OmegaConf config during training
(setup_tokenizers) so .nemo archives record which version was used
- Add _migrate_charset_version() helper in magpietts inference utils
to pin charset_version=1 for old checkpoints that lack the field,
preventing a silent vocabulary mismatch at inference time
bugfix: L2_TTS_Fast_dev_runs_Magpietts_OnlineCFGDistillation.sh
Signed-off-by: quapham <quapham@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>1 parent 2223816 commit 78694d5
7 files changed
Lines changed: 347 additions & 21 deletions
File tree
- nemo/collections
- common/tokenizers/text_to_speech
- tts
- data
- g2p/models
- modules/magpietts_inference
- tests
- collections/common/tokenizers/text_to_speech
- functional_tests
Lines changed: 18 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
24 | | - | |
| 24 | + | |
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
| |||
114 | 114 | | |
115 | 115 | | |
116 | 116 | | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
117 | 126 | | |
118 | 127 | | |
119 | 128 | | |
| |||
354 | 363 | | |
355 | 364 | | |
356 | 365 | | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
357 | 374 | | |
358 | 375 | | |
359 | 376 | | |
| |||
Lines changed: 133 additions & 9 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
44 | 44 | | |
45 | 45 | | |
46 | 46 | | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
47 | 56 | | |
48 | 57 | | |
49 | 58 | | |
| |||
164 | 173 | | |
165 | 174 | | |
166 | 175 | | |
167 | | - | |
168 | | - | |
169 | | - | |
| 176 | + | |
| 177 | + | |
170 | 178 | | |
171 | 179 | | |
172 | 180 | | |
| |||
382 | 390 | | |
383 | 391 | | |
384 | 392 | | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
385 | 401 | | |
386 | 402 | | |
387 | 403 | | |
| |||
406 | 422 | | |
407 | 423 | | |
408 | 424 | | |
| 425 | + | |
409 | 426 | | |
410 | | - | |
411 | | - | |
| 427 | + | |
412 | 428 | | |
413 | 429 | | |
414 | 430 | | |
415 | | - | |
| 431 | + | |
| 432 | + | |
416 | 433 | | |
417 | 434 | | |
418 | 435 | | |
| |||
421 | 438 | | |
422 | 439 | | |
423 | 440 | | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
424 | 457 | | |
425 | 458 | | |
426 | 459 | | |
| |||
467 | 500 | | |
468 | 501 | | |
469 | 502 | | |
470 | | - | |
471 | | - | |
472 | | - | |
| 503 | + | |
| 504 | + | |
| 505 | + | |
| 506 | + | |
| 507 | + | |
| 508 | + | |
| 509 | + | |
| 510 | + | |
| 511 | + | |
| 512 | + | |
| 513 | + | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
| 517 | + | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
| 529 | + | |
| 530 | + | |
| 531 | + | |
| 532 | + | |
| 533 | + | |
| 534 | + | |
| 535 | + | |
| 536 | + | |
| 537 | + | |
| 538 | + | |
| 539 | + | |
| 540 | + | |
| 541 | + | |
| 542 | + | |
| 543 | + | |
| 544 | + | |
| 545 | + | |
| 546 | + | |
| 547 | + | |
| 548 | + | |
| 549 | + | |
| 550 | + | |
| 551 | + | |
| 552 | + | |
| 553 | + | |
| 554 | + | |
| 555 | + | |
| 556 | + | |
| 557 | + | |
| 558 | + | |
| 559 | + | |
| 560 | + | |
| 561 | + | |
| 562 | + | |
| 563 | + | |
| 564 | + | |
| 565 | + | |
| 566 | + | |
| 567 | + | |
| 568 | + | |
| 569 | + | |
| 570 | + | |
| 571 | + | |
| 572 | + | |
| 573 | + | |
| 574 | + | |
| 575 | + | |
| 576 | + | |
| 577 | + | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
| 581 | + | |
| 582 | + | |
| 583 | + | |
| 584 | + | |
| 585 | + | |
| 586 | + | |
| 587 | + | |
| 588 | + | |
| 589 | + | |
| 590 | + | |
| 591 | + | |
| 592 | + | |
| 593 | + | |
| 594 | + | |
| 595 | + | |
| 596 | + | |
473 | 597 | | |
474 | 598 | | |
475 | 599 | | |
| |||
Lines changed: 18 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
27 | | - | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
28 | 33 | | |
29 | 34 | | |
30 | 35 | | |
| |||
74 | 79 | | |
75 | 80 | | |
76 | 81 | | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
77 | 94 | | |
78 | 95 | | |
79 | 96 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
298 | 298 | | |
299 | 299 | | |
300 | 300 | | |
| 301 | + | |
301 | 302 | | |
302 | | - | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
303 | 306 | | |
304 | | - | |
305 | | - | |
306 | | - | |
307 | | - | |
308 | | - | |
309 | | - | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
310 | 313 | | |
311 | | - | |
312 | | - | |
| 314 | + | |
| 315 | + | |
313 | 316 | | |
314 | 317 | | |
315 | 318 | | |
316 | 319 | | |
317 | | - | |
| 320 | + | |
| 321 | + | |
318 | 322 | | |
319 | 323 | | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
320 | 330 | | |
321 | 331 | | |
322 | 332 | | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
323 | 339 | | |
324 | 340 | | |
325 | 341 | | |
| |||
Lines changed: 20 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
28 | 28 | | |
29 | 29 | | |
30 | 30 | | |
| 31 | + | |
31 | 32 | | |
32 | 33 | | |
33 | 34 | | |
| |||
149 | 150 | | |
150 | 151 | | |
151 | 152 | | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
152 | 171 | | |
153 | 172 | | |
154 | 173 | | |
| |||
203 | 222 | | |
204 | 223 | | |
205 | 224 | | |
| 225 | + | |
206 | 226 | | |
207 | 227 | | |
208 | 228 | | |
| |||
0 commit comments