Skip to content

Commit b5d525d

Browse files
mgrafuRajanPuttypre-commit-ci[bot]
authored
Add hi_en Code Switched (#415) (#426)
* Add hi_en Code Switched (#415) * Add hi_en Code Switched Signed-off-by: RajanPutty <rputty@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Address PR #415 review: restore ko, dedupe whitelists, expand hi_en tests, add hi_en CI Signed-off-by: Rajan Putty <rputty@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: RajanPutty <rputty@nvidia.com> Signed-off-by: Rajan Putty <rputty@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * update jenkins Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com> * Add hi_en Code Switched (#415) * Add hi_en Code Switched Signed-off-by: RajanPutty <rputty@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Address PR #415 review: restore ko, dedupe whitelists, expand hi_en tests, add hi_en CI Signed-off-by: Rajan Putty <rputty@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: RajanPutty <rputty@nvidia.com> Signed-off-by: Rajan Putty <rputty@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * update jenkins Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com> * fix jenkins bug Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com> * separate cache dirs Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com> * refresh cache dirs Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com> * refresh cache dir ko Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com> --------- Signed-off-by: RajanPutty <rputty@nvidia.com> Signed-off-by: Rajan Putty <rputty@nvidia.com> Signed-off-by: Mariana <47233618+mgrafu@users.noreply.github.com> Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com> Co-authored-by: RajanPutty <rputty@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent 6e5601d commit b5d525d

41 files changed

Lines changed: 1642 additions & 8 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

Jenkinsfile

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,10 @@ pipeline {
1212
environment {
1313
AR_TN_CACHE='/home/jenkins/TestData/text_norm/ci/grammars/04-24-24-0'
1414
DE_TN_CACHE='/home/jenkins/TestData/text_norm/ci/grammars/10-23-24-0'
15-
EN_TN_CACHE='/home/jenkins/TestData/text_norm/ci/grammars/09-25-25-0'
15+
EN_TN_CACHE='/home/jenkins/TestData/text_norm/ci/grammars/06-04-26-3'
1616
ES_TN_CACHE='/home/jenkins/TestData/text_norm/ci/grammars/09-25-24-0'
1717
ES_EN_TN_CACHE='/home/jenkins/TestData/text_norm/ci/grammars/08-30-24-0'
18+
HI_EN_TN_CACHE='/home/jenkins/TestData/text_norm/ci/grammars/06-04-26-4'
1819
FR_TN_CACHE='/home/jenkins/TestData/text_norm/ci/grammars/04-07-25-0'
1920
HU_TN_CACHE='/home/jenkins/TestData/text_norm/ci/grammars/07-16-24-0'
2021
PT_TN_CACHE='/home/jenkins/TestData/text_norm/ci/grammars/05-01-26-1'
@@ -27,8 +28,8 @@ pipeline {
2728
HY_TN_CACHE='/home/jenkins/TestData/text_norm/ci/grammars/03-12-24-0'
2829
MR_TN_CACHE='/home/jenkins/TestData/text_norm/ci/grammars/03-12-24-1'
2930
JA_TN_CACHE='/home/jenkins/TestData/text_norm/ci/grammars/10-17-24-1'
30-
KO_TN_CACHE='/home/jenkins/TestData/text_norm/ci/grammars/04-23-26-0'
31-
HI_TN_CACHE='/home/jenkins/TestData/text_norm/ci/grammars/04-23-26-0'
31+
HI_TN_CACHE='/home/jenkins/TestData/text_norm/ci/grammars/06-04-26-5'
32+
KO_TN_CACHE='/home/jenkins/TestData/text_norm/ci/grammars/06-04-25-6'
3233
DEFAULT_TN_CACHE='/home/jenkins/TestData/text_norm/ci/grammars/06-08-23-0'
3334
}
3435
stages {
@@ -104,7 +105,11 @@ pipeline {
104105
sh 'CUDA_VISIBLE_DEVICES="" python nemo_text_processing/inverse_text_normalization/inverse_normalize.py --lang=hi --text="एक" --cache_dir ${HI_TN_CACHE}'
105106
}
106107
}
107-
108+
stage('L0: Codeswitched HI/EN ITN grammars') {
109+
steps {
110+
sh 'CUDA_VISIBLE_DEVICES="" python nemo_text_processing/inverse_text_normalization/inverse_normalize.py --lang=hi_en --text="एक" --cache_dir ${HI_EN_TN_CACHE}'
111+
}
112+
}
108113
}
109114
}
110115

@@ -168,7 +173,6 @@ pipeline {
168173
sh 'CUDA_VISIBLE_DEVICES="" python nemo_text_processing/inverse_text_normalization/inverse_normalize.py --lang=ar --text="اثنان " --cache_dir ${AR_TN_CACHE}'
169174
}
170175
}
171-
172176
}
173177
}
174178

@@ -409,6 +413,11 @@ pipeline {
409413
sh 'CUDA_VISIBLE_DEVICES="" pytest tests/nemo_text_processing/es_en/ -m "not pleasefixme" --cpu --tn_cache_dir ${ES_EN_TN_CACHE}'
410414
}
411415
}
416+
stage('L1: Run all Codeswitched HI/EN TN/ITN tests (restore grammars from cache)') {
417+
steps {
418+
sh 'CUDA_VISIBLE_DEVICES="" pytest tests/nemo_text_processing/hi_en/ -m "not pleasefixme" --cpu --tn_cache_dir ${HI_EN_TN_CACHE}'
419+
}
420+
}
412421
stage('L1: Run all AR TN/ITN tests (restore grammars from cache)') {
413422
steps {
414423
sh 'CUDA_VISIBLE_DEVICES="" pytest tests/nemo_text_processing/ar/ -m "not pleasefixme" --cpu --tn_cache_dir ${AR_TN_CACHE}'
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
from nemo_text_processing.inverse_text_normalization.hi_en.taggers.tokenize_and_classify import ClassifyFst
16+
from nemo_text_processing.inverse_text_normalization.hi_en.verbalizers.verbalize import VerbalizeFst
17+
from nemo_text_processing.inverse_text_normalization.hi_en.verbalizers.verbalize_final import VerbalizeFinalFst
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
import os
16+
17+
import pynini
18+
from pynini.lib import pynutil
19+
20+
from nemo_text_processing.inverse_text_normalization.en.taggers.cardinal import CardinalFst as EnCardinalFst
21+
from nemo_text_processing.inverse_text_normalization.en.taggers.date import DateFst as EnDateFst
22+
from nemo_text_processing.inverse_text_normalization.en.taggers.decimal import DecimalFst as EnDecimalFst
23+
from nemo_text_processing.inverse_text_normalization.en.taggers.electronic import ElectronicFst as EnElectronicFst
24+
from nemo_text_processing.inverse_text_normalization.en.taggers.measure import MeasureFst as EnMeasureFst
25+
from nemo_text_processing.inverse_text_normalization.en.taggers.money import MoneyFst as EnMoneyFst
26+
from nemo_text_processing.inverse_text_normalization.en.taggers.ordinal import OrdinalFst as EnOrdinalFst
27+
from nemo_text_processing.inverse_text_normalization.en.taggers.punctuation import PunctuationFst as EnPunctuationFst
28+
from nemo_text_processing.inverse_text_normalization.en.taggers.telephone import TelephoneFst as EnTelephoneFst
29+
from nemo_text_processing.inverse_text_normalization.en.taggers.time import TimeFst as EnTimeFst
30+
from nemo_text_processing.inverse_text_normalization.en.taggers.whitelist import WhiteListFst as EnWhiteListFst
31+
from nemo_text_processing.inverse_text_normalization.en.taggers.word import WordFst as EnWordFst
32+
from nemo_text_processing.inverse_text_normalization.hi.taggers.cardinal import CardinalFst
33+
from nemo_text_processing.inverse_text_normalization.hi.taggers.date import DateFst
34+
from nemo_text_processing.inverse_text_normalization.hi.taggers.decimal import DecimalFst
35+
from nemo_text_processing.inverse_text_normalization.hi.taggers.fraction import FractionFst
36+
from nemo_text_processing.inverse_text_normalization.hi.taggers.measure import MeasureFst
37+
from nemo_text_processing.inverse_text_normalization.hi.taggers.money import MoneyFst
38+
from nemo_text_processing.inverse_text_normalization.hi.taggers.ordinal import OrdinalFst
39+
from nemo_text_processing.inverse_text_normalization.hi.taggers.punctuation import PunctuationFst
40+
from nemo_text_processing.inverse_text_normalization.hi.taggers.telephone import TelephoneFst
41+
from nemo_text_processing.inverse_text_normalization.hi.taggers.time import TimeFst
42+
from nemo_text_processing.inverse_text_normalization.hi.taggers.whitelist import WhiteListFst
43+
from nemo_text_processing.inverse_text_normalization.hi.taggers.word import WordFst
44+
from nemo_text_processing.text_normalization.en.graph_utils import (
45+
INPUT_LOWER_CASED,
46+
GraphFst,
47+
delete_extra_space,
48+
delete_space,
49+
generator_main,
50+
)
51+
from nemo_text_processing.utils.logging import logger
52+
53+
54+
class ClassifyFst(GraphFst):
55+
"""
56+
Final class that composes all other classification grammars. This class can process an entire sentence, that is lower cased.
57+
For deployment, this grammar will be compiled and exported to OpenFst Finite State Archive (FAR) File.
58+
More details to deployment at NeMo/tools/text_processing_deployment.
59+
60+
Args:
61+
cache_dir: path to a dir with .far grammar file. Set to None to avoid using cache.
62+
overwrite_cache: set to True to overwrite .far files
63+
whitelist: path to a file with Hindi whitelist replacements. If None, defaults to the Hindi whitelist at
64+
nemo_text_processing/inverse_text_normalization/hi/data/whitelist/whitelist.tsv
65+
en_whitelist: path to a file with English whitelist replacements. If None, defaults to the English whitelist at
66+
nemo_text_processing/inverse_text_normalization/en/data/whitelist.tsv
67+
input_case: accepting either "lower_cased" or "cased" input.
68+
"""
69+
70+
def __init__(
71+
self,
72+
cache_dir: str = None,
73+
overwrite_cache: bool = False,
74+
whitelist: str = None,
75+
en_whitelist: str = None,
76+
input_case: str = INPUT_LOWER_CASED,
77+
):
78+
super().__init__(name="tokenize_and_classify", kind="classify")
79+
80+
far_file = None
81+
if cache_dir is not None and cache_dir != "None":
82+
os.makedirs(cache_dir, exist_ok=True)
83+
far_file = os.path.join(cache_dir, f"hi_en_itn_{input_case}.far")
84+
if not overwrite_cache and far_file and os.path.exists(far_file):
85+
self.fst = pynini.Far(far_file, mode="r")["tokenize_and_classify"]
86+
logger.info(f"ClassifyFst.fst was restored from {far_file}.")
87+
else:
88+
logger.info(f"Creating ClassifyFst grammars.")
89+
90+
cardinal = CardinalFst()
91+
cardinal_graph = cardinal.fst
92+
93+
ordinal = OrdinalFst(cardinal)
94+
ordinal_graph = ordinal.fst
95+
96+
decimal = DecimalFst(cardinal)
97+
decimal_graph = decimal.fst
98+
99+
fraction = FractionFst(cardinal)
100+
fraction_graph = fraction.fst
101+
102+
measure_graph = MeasureFst(cardinal=cardinal, decimal=decimal).fst
103+
date_graph = DateFst(cardinal, ordinal).fst
104+
word_graph = WordFst().fst
105+
time_graph = TimeFst(cardinal).fst
106+
money_graph = MoneyFst(cardinal=cardinal, decimal=decimal).fst
107+
whitelist_graph = WhiteListFst(input_file=whitelist).fst
108+
punct_graph = PunctuationFst().fst
109+
telephone_graph = TelephoneFst(cardinal).fst
110+
111+
en_cardinal = EnCardinalFst(input_case=input_case)
112+
en_cardinal_graph = en_cardinal.fst
113+
114+
en_ordinal = EnOrdinalFst(cardinal=en_cardinal, input_case=input_case)
115+
en_ordinal_graph = en_ordinal.fst
116+
117+
en_decimal = EnDecimalFst(cardinal=en_cardinal, input_case=input_case)
118+
en_decimal_graph = en_decimal.fst
119+
120+
en_measure_graph = EnMeasureFst(cardinal=en_cardinal, decimal=en_decimal, input_case=input_case).fst
121+
en_date_graph = EnDateFst(ordinal=en_ordinal, input_case=input_case).fst
122+
en_word_graph = EnWordFst().fst
123+
en_time_graph = EnTimeFst(input_case=input_case).fst
124+
en_money_graph = EnMoneyFst(cardinal=en_cardinal, decimal=en_decimal, input_case=input_case).fst
125+
en_whitelist_graph = EnWhiteListFst(input_file=en_whitelist, input_case=input_case).fst
126+
en_punct_graph = EnPunctuationFst().fst
127+
en_electronic_graph = EnElectronicFst(input_case=input_case).fst
128+
en_telephone_graph = EnTelephoneFst(cardinal=en_cardinal, input_case=input_case).fst
129+
130+
classify = (
131+
pynutil.add_weight(whitelist_graph, 1.01)
132+
| pynutil.add_weight(en_whitelist_graph, 1.01)
133+
| pynutil.add_weight(time_graph, 1.1)
134+
| pynutil.add_weight(en_time_graph, 1.1)
135+
| pynutil.add_weight(date_graph, 1.09)
136+
| pynutil.add_weight(en_date_graph, 1.09)
137+
| pynutil.add_weight(decimal_graph, 1.09)
138+
| pynutil.add_weight(en_decimal_graph, 1.09)
139+
| pynutil.add_weight(fraction_graph, 1.09)
140+
| pynutil.add_weight(measure_graph, 1.6)
141+
| pynutil.add_weight(en_measure_graph, 1.1)
142+
| pynutil.add_weight(cardinal_graph, 1.6)
143+
| pynutil.add_weight(en_cardinal_graph, 1.1)
144+
| pynutil.add_weight(ordinal_graph, 1.6)
145+
| pynutil.add_weight(en_ordinal_graph, 1.09)
146+
| pynutil.add_weight(money_graph, 1.6)
147+
| pynutil.add_weight(en_money_graph, 1.1)
148+
| pynutil.add_weight(telephone_graph, 1.6)
149+
| pynutil.add_weight(en_telephone_graph, 1.1)
150+
| pynutil.add_weight(en_electronic_graph, 1.1)
151+
| pynutil.add_weight(word_graph, 100)
152+
| pynutil.add_weight(en_word_graph, 120)
153+
)
154+
155+
punct = pynutil.insert("tokens { ") + pynutil.add_weight(punct_graph, weight=1.1) + pynutil.insert(" }")
156+
en_punct = (
157+
pynutil.insert("tokens { ") + pynutil.add_weight(en_punct_graph, weight=1.3) + pynutil.insert(" }")
158+
)
159+
token = pynutil.insert("tokens { ") + classify + pynutil.insert(" }")
160+
token_plus_punct = (
161+
pynini.closure(punct + pynutil.insert(" "))
162+
+ token
163+
+ pynini.closure(pynutil.insert(" ") + punct | en_punct)
164+
)
165+
166+
graph = token_plus_punct + pynini.closure(delete_extra_space + token_plus_punct)
167+
graph = delete_space + graph + delete_space
168+
169+
self.fst = graph.optimize()
170+
171+
if far_file:
172+
generator_main(far_file, {"tokenize_and_classify": self.fst})
173+
logger.info(f"ClassifyFst grammars are saved to {far_file}.")
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
import os
16+
17+
18+
def get_abs_path(rel_path):
19+
"""
20+
Get absolute path
21+
22+
Args:
23+
rel_path: relative path to this file
24+
25+
Returns absolute path
26+
"""
27+
return os.path.dirname(os.path.abspath(__file__)) + '/' + rel_path
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.

0 commit comments

Comments
 (0)