Skip to content

Commit 2fe04a2

Browse files
authored
update doc and fix alignment for itn (#47)
* save Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * save Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * extend alignment for itn Signed-off-by: Yang Zhang <yangzhang@nvidia.com> --------- Signed-off-by: Yang Zhang <yangzhang@nvidia.com>
1 parent cb53beb commit 2fe04a2

4 files changed

Lines changed: 44 additions & 46 deletions

File tree

.github/PULL_REQUEST_TEMPLATE.md

Lines changed: 14 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -2,38 +2,27 @@
22

33
Add a one line overview of what this PR aims to accomplish.
44

5-
**Collection**: [Note which collection this PR will affect]
6-
7-
# Changelog
8-
- Add specific line by line info of high level changes in this PR.
9-
10-
# Usage
11-
* You can potentially add a usage example below
12-
13-
```python
14-
# Add a code snippet demonstrating how to use this
15-
```
165

176
# Before your PR is "Ready for review"
187
**Pre checks**:
19-
- [ ] Make sure you read and followed [Contributor guidelines](https://github.com/NVIDIA/NeMo/blob/main/CONTRIBUTING.md)
20-
- [ ] Did you write any new necessary tests?
21-
- [ ] Did you add or update any necessary documentation?
22-
- [ ] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
23-
- [ ] Reviewer: Does the PR have correct import guards for all optional libraries?
8+
- [ ] Have you signed your commits? Use ``git commit -s`` to sign.
9+
- [ ] Do all unittests finish successfully before sending PR?
10+
1) ``pytest`` or (if your machine does not have GPU) ``pytest --cpu`` from the root folder (given you marked your test cases accordingly `@pytest.mark.run_only_on('CPU')`).
11+
2) Sparrowhawk tests ``bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...``
12+
- [ ] If you are adding a new feature: Have you added test cases for both `pytest` and Sparrowhawk [here](tests/nemo_text_processing).
13+
- [ ] Have you added ``__init__.py`` for every folder and subfolder, including `data` folder which has .TSV files?
14+
- [ ] Have you followed codeQL results and removed unused variables and imports (report is at the bottom of the PR in github review box) ?
15+
- [ ] Have you added the correct license header `Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.` to all newly added Python files?
16+
- [ ] If you copied [nemo_text_processing/text_normalization/en/graph_utils.py](nemo_text_processing/text_normalization/en/graph_utils.py) your header's second line should be `Copyright 2015 and onwards Google, Inc.`. See an example [here](https://github.com/NVIDIA/NeMo-text-processing/blob/main/nemo_text_processing/text_normalization/en/graph_utils.py#L2).
17+
- [ ] Remove import guards (`try import: ... except: ...`) if not already done.
18+
- [ ] If you added a new language or a new feature please update the [NeMo documentation](https://github.com/NVIDIA/NeMo/blob/main/docs/source/nlp/text_normalization/wfst/wfst_text_normalization.rst) (lives in different repo).
19+
- [ ] Have you added your language support to [tools/text_processing_deployment/pynini_export.py](tools/text_processing_deployment/pynini_export.py).
2420

21+
22+
2523
**PR Type**:
2624
- [ ] New Feature
2725
- [ ] Bugfix
2826
- [ ] Documentation
2927

3028
If you haven't finished some of the above items you can still open "Draft" PR.
31-
32-
33-
## Who can review?
34-
35-
Anyone in the community is free to review the PR once the checks have passed.
36-
[Contributor guidelines](https://github.com/NVIDIA/NeMo/blob/main/CONTRIBUTING.md) contains specific people who can review PRs to various areas.
37-
38-
# Additional Information
39-
* Related to # (issue)

README.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,6 @@
11
**NeMo Text Processing**
22
==========================
33

4-
**This repository is under development, please refer to https://github.com/NVIDIA/NeMo/tree/main/nemo_text_processing for full functionality. See [documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/text_normalization/wfst/wfst_text_normalization.html) for details.**
5-
64
Introduction
75
------------
86

nemo_text_processing/fst_alignment/alignment.cpp

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ typedef StdArcLookAheadFst LookaheadFst;
3030
// Usage:
3131

3232
// g++ -std=gnu++11 -I<path to env>/include/ alignment.cpp -lfst -lthrax -ldl -L<path to env>/lib
33-
// ./a.out <fst file> "tokenize_and_classify" "2615 Forest Av, 1 Aug 2016" 22 26
33+
// ./a.out <fst file> "TOKENIZE_AND_CLASSIFY" "2615 Forest Av, 1 Aug 2016" 22 26
3434

3535
// Output:
3636
// inp string: |2615 Forest Av, 1 Aug 2016|
@@ -42,7 +42,7 @@ typedef StdArcLookAheadFst LookaheadFst;
4242
// Disclaimer: The heuristic algorithm relies on monotonous alignment and can fail in certain situations,
4343
// e.g. when word pieces are reordered by the fst, e.g.
4444

45-
// ./a.out <fst file> "tokenize_and_classify" "$1" 0 1
45+
// ./a.out <fst file> "TOKENIZE_AND_CLASSIFY" "$1" 0 1
4646
// inp string: |$1|
4747
// out string: |one dollar|
4848
// inp indices: [0:1] out indices: [0:3]

nemo_text_processing/fst_alignment/alignment.py

Lines changed: 28 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@
3030
3131
Usage:
3232
33-
python alignment.py --fst=<fst file> --text=\"2615 Forest Av, 1 Aug 2016\" --rule=\"tokenize_and_classify\" --start=22 --end=26
33+
python alignment.py --fst=<fst file> --text="2615 Forest Av, 1 Aug 2016" --rule=TOKENIZE_AND_CLASSIFY --start=22 --end=26 --grammar=TN
3434
3535
Output:
3636
inp string: |2615 Forest Av, 1 Aug 2016|
@@ -40,7 +40,7 @@
4040
in: |2016| out: |twenty sixteen|
4141
4242
43-
python alignment.py --fst=<fst file> --text=\"2615 Forest Av, 1 Aug 2016\" --rule=\"tokenize_and_classify\"
43+
python alignment.py --fst=<fst file> --text="2615 Forest Av, 1 Aug 2016" --rule=TOKENIZE_AND_CLASSIFY
4444
4545
Output:
4646
inp string: |2615 Forest Av, 1 Aug 2016|
@@ -74,6 +74,9 @@
7474
def parse_args():
7575
args = ArgumentParser("map substring to output with FST")
7676
args.add_argument("--fst", help="FAR file containing FST", type=str, required=True)
77+
args.add_argument(
78+
"--grammar", help="tn or itn", type=str, required=False, choices=[ITN_MODE, TN_MODE], default=TN_MODE
79+
)
7780
args.add_argument(
7881
"--rule",
7982
help="rule name in FAR file containing FST",
@@ -94,6 +97,8 @@ def parse_args():
9497

9598
EPS = "<eps>"
9699
WHITE_SPACE = "\u23B5"
100+
ITN_MODE = "itn"
101+
TN_MODE = "tn"
97102

98103

99104
def get_word_segments(text: str) -> List[List[int]]:
@@ -142,9 +147,10 @@ def get_string_alignment(fst: pynini.Fst, input_text: str, symbol_table: pynini.
142147

143148
ilabels = paths.ilabels()
144149
olabels = paths.olabels()
145-
logging.debug(paths.istring())
146-
logging.debug(paths.ostring())
150+
logging.debug("input: " + paths.istring())
151+
logging.debug("output: " + paths.ostring())
147152
output = list(zip([symbol_table.find(x) for x in ilabels], [symbol_table.find(x) for x in olabels]))
153+
logging.debug(f"alignment: {output}")
148154
paths.next()
149155
assert paths.done()
150156
output_str = "".join(map(remove, [x[1] for x in output]))
@@ -184,52 +190,57 @@ def _get_original_index(alignment, aligned_index):
184190
remove = lambda x: "" if x == EPS else " " if x == WHITE_SPACE else x
185191

186192

187-
def indexed_map_to_output(alignment: List[tuple], start: int, end: int):
193+
def indexed_map_to_output(alignment: List[tuple], start: int, end: int, mode: str):
188194
"""
189195
Given input start and end index of contracted substring return corresponding output start and end index
190196
191197
Args:
192198
alignment: alignment generated by FST with shortestpath, is longer than original string since including eps transitions
193199
start: inclusive start position in input string
194200
end: exclusive end position in input string
201+
mode: grammar type for either tn or itn
195202
196203
Returns:
197204
output_og_start_index: inclusive start position in output string
198205
output_og_end_index: exclusive end position in output string
199206
"""
200207
# get aligned start and end of input substring
208+
201209
aligned_start = _get_aligned_index(alignment, start)
202210
aligned_end = _get_aligned_index(alignment, end - 1) # inclusive
203211

204212
logging.debug(f"0: |{list(map(remove, [x[0] for x in alignment[aligned_start:aligned_end+1]]))}|")
205213

206214
# extend aligned_start to left
215+
207216
while (
208217
aligned_start - 1 > 0
209218
and alignment[aligned_start - 1][0] == EPS
210-
and (alignment[aligned_start - 1][1].isalpha() or alignment[aligned_start - 1][1] == EPS)
219+
and (alignment[aligned_start - 1][1].isalnum() or alignment[aligned_start - 1][1] == EPS)
211220
):
212221
aligned_start -= 1
213222

214223
while (
215224
aligned_end + 1 < len(alignment)
216225
and alignment[aligned_end + 1][0] == EPS
217-
and (alignment[aligned_end + 1][1].isalpha() or alignment[aligned_end + 1][1] == EPS)
226+
and (alignment[aligned_end + 1][1].isalnum() or alignment[aligned_end + 1][1] == EPS)
218227
):
219228
aligned_end += 1
220229

221-
while (aligned_end + 1) < len(alignment) and (
222-
alignment[aligned_end + 1][1].isalpha() or alignment[aligned_end + 1][1] == EPS
223-
):
224-
aligned_end += 1
230+
if mode == TN_MODE:
231+
while (aligned_end + 1) < len(alignment) and (
232+
alignment[aligned_end + 1][1].isalnum() or alignment[aligned_end + 1][1] == EPS
233+
):
234+
aligned_end += 1
225235

226236
output_og_start_index = _get_original_index(alignment=alignment, aligned_index=aligned_start)
227237
output_og_end_index = _get_original_index(alignment=alignment, aligned_index=aligned_end + 1)
238+
228239
return output_og_start_index, output_og_end_index
229240

230241

231242
if __name__ == '__main__':
232-
logging.setLevel(logging.INFO)
243+
logging.getLogger().setLevel(logging.INFO)
233244
args = parse_args()
234245
fst = Far(args.fst, mode='r')
235246
try:
@@ -240,14 +251,14 @@ def indexed_map_to_output(alignment: List[tuple], start: int, end: int):
240251

241252
table = create_symbol_table()
242253
alignment, output_text = get_string_alignment(fst=fst, input_text=input_text, symbol_table=table)
243-
print(f"inp string: |{args.text}|")
244-
print(f"out string: |{output_text}|")
254+
logging.info(f"inp string: |{args.text}|")
255+
logging.info(f"out string: |{output_text}|")
245256

246257
if args.start is None:
247258
indices = get_word_segments(input_text)
248259
else:
249260
indices = [(args.start, args.end)]
250261
for x in indices:
251-
start, end = indexed_map_to_output(start=x[0], end=x[1], alignment=alignment)
252-
print(f"inp indices: [{x[0]}:{x[1]}] out indices: [{start}:{end}]")
253-
print(f"in: |{input_text[x[0]:x[1]]}| out: |{output_text[start:end]}|")
262+
start, end = indexed_map_to_output(start=x[0], end=x[1], alignment=alignment, mode=args.grammar)
263+
logging.info(f"inp indices: [{x[0]}:{x[1]}] out indices: [{start}:{end}]")
264+
logging.info(f"in: |{input_text[x[0]:x[1]]}| out: |{output_text[start:end]}|")

0 commit comments

Comments
 (0)