You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
dynamic pruning and substitution restrictions (#60)
* massive speed improvement, with some WER improvements too in large complicated files with lots of deletions
* adding a flexible beam, adjustable via command line parameters
* adding --strict-punctuation mode, that will allow punctuation marks to be substituted only within themselves
* adding support for strict punctuation and favouring same words in alignments
* fixing the default value of the beam to 50
* adding new test case file
* adding new result for std composition
* fixes, and new test cases updates
* bumping version and adding release notes to the readme
Copy file name to clipboardExpand all lines: README.md
+92Lines changed: 92 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,6 +3,7 @@
3
3
4
4
# fstalign
5
5
-[Overview](#Overview)
6
+
-[What's new in 2.0](#What's-new-in-2.0)
6
7
-[Installation](#Installation)
7
8
*[Dependencies](#Dependencies)
8
9
*[Build](#Build)
@@ -14,6 +15,97 @@
14
15
15
16
Due to its use of OpenFST and lazy algorithms for text-based alignment, `fstalign` is efficient for calculating WER while also providing significant flexibility for different measurement features and error analysis.
16
17
18
+
## What's new in 2.0
19
+
20
+
Version 2.0 introduces two major changes:
21
+
1. A new method to traverse the composition graph, which dramatically improves the overall speed, especially when the sequences are long contain many errors.
22
+
We have files that took 25 minutes to align before that can now take about 7 seconds. This is especially noticeable with the adapted composition (the default).
23
+
1. Some smarts were introduced when --use-case and --use-punctuation are enabled.
24
+
Now, by default, punctuation symbols can only be substituted by other punctuation symbols (or deleted/inserted).
25
+
Also, words that differ only by the first letter case will be preffered for substitution.
26
+
27
+
28
+
Here's an example of the 1.x behavior and the 2.0 version
29
+
```
30
+
==> v1.x sbs.txt <==
31
+
ref_token hyp_token IsErr Class Wer_Tag_Entities
32
+
Welcome Welcome ###322_###|
33
+
back back
34
+
to to
35
+
another another
36
+
episode episode ###323_###|
37
+
of of
38
+
Podcasts Podcast ERR ###324_###|
39
+
in and ERR
40
+
Color Color ###167_###|###325_###|
41
+
: of ERR
42
+
The the ERR
43
+
Podcast Podcast ###168_###|###326_###|
44
+
. .
45
+
I I
46
+
47
+
==> v2.0 sbs.txt <==
48
+
ref_token hyp_token IsErr Class Wer_Tag_Entities
49
+
Welcome Welcome ###322_###|
50
+
back back
51
+
to to
52
+
another another
53
+
episode episode ###323_###|
54
+
of of
55
+
Podcasts Podcast ERR ###324_###|
56
+
in and ERR
57
+
Color Color ###167_###|###325_###|
58
+
<ins> of ERR
59
+
: <del> ERR
60
+
The the ERR
61
+
Podcast Podcast ###168_###|###326_###|
62
+
```
63
+
The confusion between `:` and `of` is not longer allowed.
64
+
65
+
Also, here's how favoring or not the substitution based on case-insensitive comparison, while still counting it as an error, looks like:
66
+
```
67
+
==> v1.x sbs.txt <==
68
+
ref_token hyp_token IsErr Class Wer_Tag_Entities
69
+
shorten shorten ###801_###|
70
+
It's it's ERR
71
+
Berry Barry ERR ###785_###|###788_###|###802_###|
72
+
. .
73
+
Just Just
74
+
Yeah like ERR ###805_###|
75
+
. <del> ERR
76
+
Like <del> ERR
77
+
, <del> ERR
78
+
I I ###809_###|
79
+
have have
80
+
a a
81
+
nickname nickname
82
+
83
+
==> v2.0 sbs.txt <==
84
+
ref_token hyp_token IsErr Class Wer_Tag_Entities
85
+
It's it's ERR
86
+
Berry Barry ERR ###785_###|###788_###|###802_###|
87
+
. .
88
+
Just Just
89
+
Yeah <del> ERR ###805_###|
90
+
. <del> ERR
91
+
Like like ERR
92
+
, <del> ERR
93
+
I I ###809_###|
94
+
have have
95
+
a a
96
+
nickname nickname
97
+
```
98
+
Here, `Like <-> like` substitution is favored. While this generally won't change the WER value itself (although it can), it will improve the timing alignments.
99
+
100
+
101
+
These behavior, as well as the beam size (that has a default value of 50.0) can be controlled with the following new parameters:
102
+
```
103
+
--disable-strict-punctuation
104
+
Disable strict punctuation alignment (which prevents punctuation aligning with words).
105
+
--disable-favored-subs Disable favored substitutions (which makes alignment favor substitutions between words which differ only by case).
106
+
--favored-sub-cost FLOAT Cost for favored substitutions (e.g., case diff). Default: 0.1
0 commit comments