Skip to content

Commit b5fbfe1

Browse files
KO TN fixes (DCO remediation + MRC updates) (NVIDIA#389)
* Add Korean TN support for cardinal numbers and postprocessing (NVIDIA#285) * Add Korean TN support for cardinal numbers and postprocessing Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactor Korean TN cardinal and postprocessing logic based on review feedback Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add __init__.py to ko/data directory Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * Update KO_TN_CACHE to trigger Korean CI run Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> --------- Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * Korean Ordinal TN support (NVIDIA#286) * Add Korean TN support for cardinal numbers and postprocessing Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactor Korean TN cardinal and postprocessing logic based on review feedback Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * Add Korean Ordinal TN logic and test cases Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactor ordinal logic (1-39, 40+) and add word tagger and verbalizer Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * Refactor ordinal logic (1-39, 40+) and add word tagger and verbalizer Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * Add support for 0 in ordinal tagger Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * Update ordinal.py to exclude digit 1 in code and remove unnecessary TSV file Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove .far files Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix(ko/ordinal): update ordinal FST based on review feedback Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * Korean TN Decimal Support (NVIDIA#303) * feat(ko/decimal): add Korean decimal TN support Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat(ko): Add fraction tagger and verbalizer with tests Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix(ko): Update decimal and fraction taggers Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * Korean TN for Date and Time (NVIDIA#316) * feat(ko/date): Add date TN taggers, verbalizers, test cases, and post-processing fixes Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix(ko/date): update date tagger and sparrowhawk test Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * ko(TN): Date TN fixes & cleanup Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * ko(TN): Add Time tagger/verbalizer + tests Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * ko(TN): Date — strict YYYY for delimited formats; define single-year 1–4 digit behavior Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * Korean TN for Money and Telephone (NVIDIA#324) * feat(ko/money): Korean Money TN only; add data & tests; wire tagger/verbalizer Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix(ko/money): polish tagger/verbalizer & expand tests Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * ko: add Telephone TN (tagger+verbalizer) + wire + tests; include money/test updates Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * ko: refactor money/telephone taggers & verbalizers Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * ko/money: use NEMO_NOT_QUOTE, lowercase space helper, trim mid optimizes Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * ko: update money/telephone taggers and telephone verbalizer Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * ko: update telephone taggers Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> --------- Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * Korean TN for Measure and Electronic (NVIDIA#353) * Add: Korean Measure & Electronic TN (taggers, verbalizers, tests, data) Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update KO electronic & measure taggers/verbalizers and test cases Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Edited as per review feedback Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * Korean TN fixes: cardinal, decimal, fraction, date Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com> Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * Add ko electronic extensions and improve electronic/telephone normalization Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com> Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * Fix Korean TN issues and update test cases Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com> Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * Fix Korean TN electronic and post-processing issues Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com> Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * Fix Korean TN spacing and electronic/cardinal handling Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com> Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * Fix optional token separator and remove redundant whitespace normalization Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com> Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * Remove unused KO post_processing and update exporter Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com> Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * Add native counting support for number+counter in Korean TN Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com> Signed-off-by: Jinwoo Bae <bbae7050@gmail.com>
1 parent fb7c0db commit b5fbfe1

File tree

4 files changed

+61
-5
lines changed

4 files changed

+61
-5
lines changed
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
2+
3+
4+
마리
5+
송이
6+
포기
7+
사람
8+
자루
9+
10+
켤레
11+
그루
12+
13+
14+
15+
16+
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
1
2+
2
3+
3
4+
4
5+
5 다섯
6+
6 여섯
7+
7 일곱
8+
8 여덟
9+
9 아홉

nemo_text_processing/text_normalization/ko/taggers/cardinal.py

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -274,6 +274,32 @@ def __init__(self, deterministic: bool = True):
274274
graph_zero,
275275
).optimize()
276276

277+
# ----------------------------
278+
# Native counting + counters
279+
# e.g., 3개, 2명, 10살
280+
#
281+
# In Korean, counters require native numeral forms
282+
# for small numbers (한/두/세…, 열/스무/서른…).
283+
counter_suffix = pynini.string_file(get_abs_path("data/number/counter_suffix.tsv"))
284+
counter_suffix_accep = pynini.project(counter_suffix, "input").optimize()
285+
286+
native_ones = pynini.string_file(get_abs_path("data/number/native_ones.tsv")) # 1~9: 한/두/세/...
287+
ordinal_tens = pynini.string_file(get_abs_path("data/ordinal/tens.tsv")) # 10=열, 20=스무, 30=서른
288+
ordinal_tens_prefix = pynini.string_file(get_abs_path("data/ordinal/tens_prefix.tsv")) # 열/스물/서른
289+
290+
native_11_to_39 = (ordinal_tens_prefix + native_ones).optimize()
291+
native_1_to_39 = pynini.union(native_ones, ordinal_tens, native_11_to_39).optimize()
292+
293+
# Compose number + counter as one cardinal token
294+
counter_case = (
295+
pynutil.insert('integer: "')
296+
+ native_1_to_39
297+
+ pynutil.insert('" ')
298+
+ pynutil.insert('counter: "')
299+
+ counter_suffix_accep
300+
+ pynutil.insert('"')
301+
).optimize()
302+
277303
# Sign and final formatting
278304
# Build the integer token (integer: "...")
279305
integer_token = pynutil.insert('integer: "') + graph_num + pynutil.insert('"')
@@ -293,7 +319,7 @@ def __init__(self, deterministic: bool = True):
293319
signed_integer = (minus_prefix | plus_prefix).ques + integer_token
294320

295321
# Prefer accounting-form first, then signed form
296-
final_graph = paren_negative | signed_integer
322+
final_graph = paren_negative | signed_integer | counter_case
297323

298324
# Wrap with class tokens and finalize
299325
final_graph = self.add_tokens(final_graph)

nemo_text_processing/text_normalization/ko/verbalizers/cardinal.py

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -38,11 +38,16 @@ def __init__(self, deterministic: bool = True):
3838

3939
self.optional_sign = pynini.closure(self.optional_sign + delete_space, 0, 1)
4040

41-
integer = pynini.closure(NEMO_NOT_QUOTE)
41+
# quoted: parses a quoted string value like "십", "명"
42+
quoted = delete_space + pynutil.delete('"') + pynini.closure(NEMO_NOT_QUOTE) + pynutil.delete('"')
4243

43-
self.integer = delete_space + pynutil.delete("\"") + integer + pynutil.delete("\"")
44-
integer = pynutil.delete("integer:") + self.integer
44+
# integer: mandatory field
45+
integer = pynutil.delete("integer:") + quoted
46+
47+
# counter: optional field (e.g., 명/개/살).
48+
counter = pynutil.delete("counter:") + quoted
49+
counter = pynini.closure(delete_space + counter, 0, 1)
50+
self.numbers = self.optional_sign + integer + counter
4551

46-
self.numbers = self.optional_sign + integer
4752
delete_tokens = self.delete_tokens(self.numbers)
4853
self.fst = delete_tokens.optimize()

0 commit comments

Comments
 (0)