Skip to content

Commit 7795e7b

Browse files
Korean TN for Measure and Electronic (NVIDIA#353)
* Add: Korean Measure & Electronic TN (taggers, verbalizers, tests, data) Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update KO electronic & measure taggers/verbalizers and test cases Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Edited as per review feedback Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Jinwoo Bae <bbae7050@gmail.com> Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Jinwoo Bae <bbae7050@gmail.com>
1 parent 34bfecf commit 7795e7b

File tree

18 files changed

+731
-1
lines changed

18 files changed

+731
-1
lines changed
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
카드 끝자리 카드 끝자리
2+
카드 마지막 네자리 카드 마지막 네자리
3+
카드 마지막 4자리 카드 마지막 4자리
4+
신용카드 번호 신용카드 번호
5+
신용카드 신용카드
6+
체크카드 번호 체크카드 번호
7+
체크카드 체크카드
8+
카드번호 카드번호
9+
결제 카드 결제 카드
10+
결제카드 결제카드
11+
카드 카드
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
.com 닷컴
2+
.org 닷 오알지
3+
.gov 닷 거브
4+
.edu 닷 에듀
5+
.net 닷 넷
6+
.ai 닷 에이아이
7+
.io 닷 아이오
8+
.dev 닷 데브
9+
.app 닷 앱
10+
.cloud 닷 클라우드
11+
.shop 닷 샵
12+
.store 닷 스토어
13+
.co 닷 씨오
14+
.me 닷 미
15+
.kr 닷 케이알
16+
.co.kr 닷 씨오 닷 케이알
17+
.ac.kr 닷 에이씨 닷 케이알
18+
.or.kr 닷 오알 닷 케이알
19+
.go.kr 닷 지오 닷 케이알
20+
.re.kr 닷 알이 닷 케이알
21+
.jp 닷 제이피
22+
.cn 닷 씨엔
23+
.fr 닷 에프알
24+
.de 닷 디이
25+
.it 닷 아이티
26+
.uk 닷 유케이
27+
.br 닷 비알
28+
.in 닷 아이엔
29+
.ru 닷 알유
30+
.jpg 닷 제이피지
31+
.png 닷 피엔지
32+
.pdf 닷 피디에프
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
@ 골뱅이
2+
.
3+
/ 슬래시
4+
- 대시
5+
_ 언더바
6+
: 콜론
7+
? 물음표
8+
= 이퀄
9+
& 앰퍼샌드
10+
% 퍼센트
11+
+ 플러스
12+
#
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
kg 킬로그램
2+
g 그램
3+
km 킬로미터
4+
m 미터
5+
cm 센티미터
6+
mm 밀리미터
7+
L 리터
8+
l 리터
9+
mL 밀리리터
10+
ml 밀리리터
11+
h 시간
12+
s
13+
N 뉴턴
14+
W 와트
15+
Hz 헤르츠
16+
°
17+
% 퍼센트
18+
rpm 분당회전수
Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
16+
import pynini
17+
from pynini.lib import pynutil
18+
19+
from nemo_text_processing.text_normalization.ko.graph_utils import (
20+
NEMO_ALPHA,
21+
NEMO_DIGIT,
22+
NEMO_NOT_SPACE,
23+
NEMO_SIGMA,
24+
NEMO_SPACE,
25+
GraphFst,
26+
delete_space,
27+
insert_space,
28+
)
29+
from nemo_text_processing.text_normalization.ko.utils import get_abs_path
30+
31+
32+
class ElectronicFst(GraphFst):
33+
"""
34+
Finite state transducer (FST) for classifying **electronic expressions** such as
35+
email addresses, URLs, and domain names in Korean.
36+
37+
Example conversions:
38+
- abc@nvidia.co.kr → electronic { username: "abc" domain: "nvidia.co.kr" }
39+
- www.nvidia.com → electronic { domain: "www.nvidia.com" }
40+
- https://nvidia.com → electronic { protocol: "HTTPS colon slash slash" domain: "nvidia.com" }
41+
- 1234-5678-9012-3456 → electronic { protocol: "credit card" domain: "1234567890123456" }
42+
43+
Args:
44+
cardinal: FST for digit/number verbalization (used for numeric parts if non-deterministic).
45+
deterministic: If True, provides a single transduction path; otherwise allows multiple.
46+
"""
47+
48+
def __init__(self, cardinal: GraphFst, deterministic: bool = True):
49+
super().__init__(name="electronic", kind="classify", deterministic=deterministic)
50+
51+
# ---------- Basic character ranges and symbols ----------
52+
LOWER = pynini.union(*[pynini.accep(c) for c in "abcdefghijklmnopqrstuvwxyz"])
53+
UPPER = pynini.union(*[pynini.accep(c) for c in "ABCDEFGHIJKLMNOPQRSTUVWXYZ"])
54+
ASCII_LETTER = (LOWER | UPPER).optimize()
55+
ASCII_ALNUM = (ASCII_LETTER | NEMO_DIGIT).optimize()
56+
57+
HYPHEN = pynini.accep("-")
58+
DOT = pynini.accep(".")
59+
SLASH = pynini.accep("/")
60+
AT = pynini.accep("@")
61+
62+
# Handle numeric reading mode (only for non-deterministic mode)
63+
numbers = (
64+
NEMO_DIGIT
65+
if deterministic
66+
else (pynutil.insert(NEMO_SPACE) + cardinal.long_numbers + pynutil.insert(NEMO_SPACE))
67+
)
68+
69+
# ---------- Load resources ----------
70+
cc_cues = pynini.string_file(get_abs_path("data/electronic/cc_cues.tsv"))
71+
accepted_symbols = pynini.project(pynini.string_file(get_abs_path("data/electronic/symbol.tsv")), "input")
72+
accepted_common_domains = pynini.project(
73+
pynini.string_file(get_abs_path("data/electronic/domain.tsv")), "input"
74+
)
75+
graph_symbols = pynini.string_file(get_abs_path("data/electronic/symbol.tsv")).optimize()
76+
77+
# ---------- Username ----------
78+
# Exclude '@' from username
79+
username_symbols = pynini.difference(accepted_symbols, AT)
80+
# Start with alphanumeric and allow symbols/numbers repeatedly
81+
username_core = ASCII_ALNUM + pynini.closure(ASCII_ALNUM | numbers | username_symbols)
82+
username = pynutil.insert('username: "') + username_core + pynutil.insert('"') + pynini.cross("@", NEMO_SPACE)
83+
84+
# ---------- Domain ----------
85+
# Simplified RFC: label = [A-Za-z0-9-]+ , TLD = '.' [A-Za-z0-9]{2,}
86+
label = pynini.closure(ASCII_ALNUM | HYPHEN, 1)
87+
tld = DOT + pynini.closure(ASCII_ALNUM, 2)
88+
# Domain can be (label + TLD) or TLD only (e.g., ".com")
89+
domain_core = (label + pynini.closure(tld, 1)) | tld
90+
91+
# Optional path after domain (e.g., /path)
92+
path_segment = pynini.closure(NEMO_NOT_SPACE, 1) # at least one non-space character
93+
path = SLASH + path_segment # /<segment>
94+
optional_path = pynini.closure(path, 0, 1) # optional path
95+
96+
domain_with_opt_path = domain_core + optional_path
97+
98+
domain_graph_with_class_tags = (
99+
pynutil.insert('domain: "') + domain_with_opt_path.optimize() + pynutil.insert('"')
100+
)
101+
102+
# ---------- protocol ----------
103+
protocol_symbols = pynini.closure((graph_symbols | pynini.cross(":", "colon")) + pynutil.insert(NEMO_SPACE))
104+
protocol_start = (pynini.cross("https", "HTTPS ") | pynini.cross("http", "HTTP ")) + (
105+
pynini.accep("://") @ protocol_symbols
106+
)
107+
protocol_file_start = pynini.accep("file") + insert_space + (pynini.accep(":///") @ protocol_symbols)
108+
protocol_end = pynutil.add_weight(pynini.cross("www", "WWW ") + pynini.accep(".") @ protocol_symbols, -1000)
109+
protocol = protocol_file_start | protocol_start | protocol_end | (protocol_start + protocol_end)
110+
protocol = pynutil.insert('protocol: "') + protocol + pynutil.insert('"')
111+
112+
# ---------- Combine all graphs ----------
113+
graph = pynini.Fst() # empty
114+
115+
# (1) Email pattern
116+
email_guard = NEMO_SIGMA + AT + NEMO_SIGMA + DOT + NEMO_SIGMA
117+
graph |= pynini.compose(email_guard, username + domain_graph_with_class_tags)
118+
119+
# (2) Domain only (without protocol)
120+
# Exclude '$' (conflict with money FST) and '@' (email)
121+
dollar_accep = pynini.accep("$")
122+
excluded_symbols = DOT | dollar_accep | AT
123+
filtered_symbols = pynini.difference(accepted_symbols, excluded_symbols)
124+
accepted_characters = ASCII_ALNUM | filtered_symbols
125+
# Domain core graph
126+
graph_domain = (pynutil.insert('domain: "') + domain_core + pynutil.insert('"')).optimize()
127+
graph |= graph_domain
128+
129+
# (3) URL with protocol
130+
graph |= protocol + insert_space + domain_graph_with_class_tags
131+
132+
# (4) Credit card pattern: cue + 4–16 digits
133+
if deterministic:
134+
cc_digits = pynini.closure(NEMO_DIGIT, 4, 16)
135+
cc_phrases = (
136+
pynutil.insert('protocol: "')
137+
+ cc_cues
138+
+ pynutil.insert('" domain: "')
139+
+ delete_space
140+
+ cc_digits
141+
+ pynutil.insert('"')
142+
)
143+
graph |= cc_phrases
144+
145+
four = pynini.closure(NEMO_DIGIT, 4, 4)
146+
sep_token = pynini.union(HYPHEN, NEMO_SPACE)
147+
sep_del = pynutil.delete(pynini.closure(sep_token, 1)) # allow mix of - or space
148+
149+
cc16_grouped = four + sep_del + four + sep_del + four + sep_del + four
150+
151+
cc16_no_cue = (
152+
pynutil.insert('protocol: "신용카드 " ')
153+
+ pynutil.insert('domain: "')
154+
+ cc16_grouped
155+
+ pynutil.insert('"')
156+
)
157+
158+
# Give it higher priority over Date FST
159+
cc16_no_cue = pynutil.add_weight(cc16_no_cue.optimize(), -1.0)
160+
161+
graph |= cc16_no_cue
162+
163+
final_graph = self.add_tokens(graph)
164+
self.fst = final_graph.optimize()

nemo_text_processing/text_normalization/ko/taggers/fraction.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,8 @@ def __init__(self, cardinal: GraphFst, deterministic: bool = True):
6767
+ slash
6868
+ pynutil.insert(NEMO_SPACE)
6969
+ denominator_component
70+
+ pynutil.insert(NEMO_SPACE)
71+
+ pynutil.insert('morphosyntactic_features: "분의"')
7072
)
7173

7274
# Format 2: Korean native "4분의3" style
@@ -90,5 +92,6 @@ def __init__(self, cardinal: GraphFst, deterministic: bool = True):
9092

9193
# Combine full graph
9294
graph = pynini.closure(optional_sign, 0, 1) + (graph_fraction_slash | graph_fraction_word)
95+
self.graph = graph.optimize()
9396
final_graph = self.add_tokens(graph)
9497
self.fst = final_graph.optimize()
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
import pynini
16+
from pynini.lib import pynutil
17+
18+
from nemo_text_processing.text_normalization.ko.graph_utils import GraphFst, delete_space, insert_space
19+
from nemo_text_processing.text_normalization.ko.utils import get_abs_path
20+
21+
22+
class MeasureFst(GraphFst):
23+
"""
24+
Finite state transducer for classifying Korean measure expressions.
25+
- 1kg → measure { cardinal { integer: "일" } units: "킬로그램" }
26+
- 12.5km → measure { decimal { integer_part: "십이" fractional_part: "오" } units: "킬로미터" }
27+
- 2/3m → measure { fraction { numerator: "이" denominator: "삼" } units: "미터" }
28+
- 60km/h → measure { cardinal { integer: "육십" } units: "킬로미터 퍼 시간" }
29+
30+
This FST attaches measurement units (e.g., "킬로미터", "그램") to numeric expressions
31+
classified by the `cardinal`, `decimal`, or `fraction` subgraphs.
32+
33+
Args:
34+
cardinal: FST handling integer (cardinal) numbers.
35+
decimal: FST handling decimal numbers (optional).
36+
fraction: FST handling fractional numbers (optional).
37+
deterministic: If True, provides a single transduction path; otherwise allows multiple.
38+
"""
39+
40+
def __init__(
41+
self,
42+
cardinal: GraphFst,
43+
decimal: GraphFst = None,
44+
fraction: GraphFst = None,
45+
deterministic: bool = True,
46+
):
47+
super().__init__(name="measure", kind="classify", deterministic=deterministic)
48+
49+
# Numeric subgraphs
50+
graph_cardinal = cardinal.graph
51+
52+
# Unit lexicon
53+
graph_unit = pynini.string_file(get_abs_path("data/measure/unit.tsv"))
54+
55+
# Per-expression handling (e.g., km/h, m/s)
56+
opt_space = pynini.closure(delete_space, 0, 1)
57+
per = pynini.cross("/", "퍼") + opt_space + insert_space + graph_unit
58+
optional_per = pynini.closure(opt_space + insert_space + per, 0, 1)
59+
60+
# Final unit FST produces either "<unit>" or "<unit> 퍼 <unit>"
61+
unit = pynutil.insert('units: "') + (graph_unit + optional_per | per) + pynutil.insert('"')
62+
63+
minus_as_field = pynutil.insert('negative: "마이너스" ')
64+
consume_minus = pynini.cross("-", "") | pynini.cross("마이너스", "")
65+
66+
# Optional minus field + removal of actual sign symbol or word
67+
optional_minus = pynini.closure(minus_as_field + consume_minus + opt_space, 0, 1)
68+
69+
# Combine numeric and unit components
70+
pieces = []
71+
72+
# 1) Cardinal form: e.g., "12kg"
73+
sub_cardinal = (
74+
pynutil.insert("cardinal { ")
75+
+ pynutil.insert('integer: "')
76+
+ graph_cardinal
77+
+ delete_space
78+
+ pynutil.insert('" } ')
79+
+ unit
80+
)
81+
pieces.append(sub_cardinal)
82+
83+
# 2) Decimal form: e.g., "12.5km"
84+
if decimal is not None:
85+
sub_decimal = (
86+
pynutil.insert("decimal { ")
87+
+ optional_minus
88+
+ decimal.just_decimal
89+
+ delete_space
90+
+ pynutil.insert(" } ")
91+
+ unit
92+
)
93+
pieces.append(sub_decimal)
94+
95+
# 3) Fraction form: e.g., "2/3m" or "삼분의 이 미터"
96+
if fraction is not None:
97+
sub_fraction = pynutil.insert("fraction { ") + fraction.graph + delete_space + pynutil.insert(" } ") + unit
98+
pieces.append(sub_fraction)
99+
100+
# Union all supported numeric forms (cardinal | decimal | fraction)
101+
graph = pieces[0]
102+
for p in pieces[1:]:
103+
graph |= p
104+
105+
# Final wrapping into tokens { measure { ... } }
106+
self.fst = self.add_tokens(graph).optimize()

0 commit comments

Comments
 (0)