Commit dfa14bd
committed
feat: rule-based sentence segmentation — fix abbreviation splitting
Replace unicode_segmentation::unicode_sentences() which splits on
abbreviation periods (Mr., p.m., H.G., etc.), inflating 1-2 word
sentence bin to 22% and deflating mean to 6.8 words.
New forward-scanning splitter with:
- 3-tier abbreviation classification (strong/internal/weak)
- Mid-dotted-sequence detection (e.g., p.m., H.G., B.B.C.)
- Decimal number handling (42.50)
- Ellipsis handling (split only before uppercase)
- Dialog tag detection ("Go!" he said — no split at !)
- 20 unit tests covering all edge cases
Fingerprint after fix: mean 6.8→12.6, 1-2 bin 21.9%→9.8%,
long sentences (18+ words) 0%→22.3%, p95 14→35.1 parent 8c17be3 commit dfa14bd
4 files changed
Lines changed: 466 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
39 | 39 | | |
40 | 40 | | |
41 | 41 | | |
42 | | - | |
43 | | - | |
| 42 | + | |
| 43 | + | |
44 | 44 | | |
45 | 45 | | |
46 | 46 | | |
| |||
126 | 126 | | |
127 | 127 | | |
128 | 128 | | |
129 | | - | |
130 | | - | |
| 129 | + | |
| 130 | + | |
131 | 131 | | |
132 | 132 | | |
133 | 133 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
13 | 13 | | |
14 | 14 | | |
15 | 15 | | |
| 16 | + | |
16 | 17 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
25 | | - | |
| 25 | + | |
26 | 26 | | |
27 | 27 | | |
28 | 28 | | |
| |||
0 commit comments