You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A .NET 9 / C# 13 class library designed to normalize text containing emojis, currency symbols, numbers, abbreviations, and other non-standard elements, making it suitable for consistent and natural-sounding Text-to-Speech (TTS) synthesis across different engines (e.g., System.Speech, KokoroSharp). Specifically tailored for scenarios involving user-generated content like Twitch/YouTube chat and donations.
6
+
A .NET 9 / C# 13 class library designed to normalize text containing emojis, currency symbols, numbers, URLs, abbreviations, and other non-standard elements, making it suitable for consistent and natural-sounding Text-to-Speech (TTS) synthesis across different engines (e.g., System.Speech, KokoroSharp). Specifically tailored for scenarios involving user-generated content like Twitch/YouTube chat and donations.
7
7
8
8
## Problem Solved
9
9
@@ -13,31 +13,41 @@ TTS engines often struggle with or produce inconsistent results when encounterin
13
13
* Currency symbols and codes from various locales (e.g., $, £, €, USD, JPY, BRL)
14
14
* Different number formats (cardinals, ordinals, decimals, version numbers)
15
15
* Common chat/gaming abbreviations and slang (e.g., lol, brb, gg, afk)
16
+
* URLs (e.g., https://example.com, www.test.org)
16
17
* Excessive punctuation or letter repetitions (e.g., !!!, ???, sooooo)
17
-
*URLs or non-standard characters
18
+
*Non-standard characters
18
19
19
20
This library preprocesses input text using a configurable pipeline of rules to replace or adjust these elements *before* sending the text to the TTS engine, leading to a more predictable, consistent, and pleasant listening experience.
20
21
21
22
## Features
22
23
23
-
***Emoji Normalization:** Replaces Unicode emojis (including flags, ZWJ sequences) with descriptive text (e.g., ✨ -> `sparkles`, 👍 -> `thumbs up`, 🇬🇧 -> `flag United Kingdom`) using an up-to-date emoji dataset processed by a source generator.
24
-
***Currency Normalization:** Detects currency symbols and ISO codes known to the .NET runtime. Replaces amounts with spoken text using locale-aware mappings (e.g., `$10.50` -> `ten dollars fifty cents`, `€100` -> `one hundred euros`, `500 JPY` -> `five hundred yen`). Uses Humanizer for number-to-word conversion. Requires manual mapping for TTS spoken names per ISO code.
25
-
***Number Normalization:** Handles standalone cardinals ("123" -> `one hundred and twenty-three`), ordinals ("1st" -> `first`), decimals ("1.5" -> `one point five`), and basic version-like numbers ("1.2.3" -> `one point two point three`) using Humanizer and custom Regex.
26
-
***Abbreviation/Acronym Expansion:** Expands a comprehensive list of common chat, gaming, and streaming abbreviations (e.g., `lol` -> `laughing out loud`, `gg` -> `good game`, `afk` -> `away from keyboard`). Case-insensitive and whole-word matching.
27
-
***Basic Text Sanitization:** Normalizes line breaks, removes common problematic control/formatting characters, and replaces non-standard "fancy" punctuation (smart quotes, dashes, ellipsis) with ASCII equivalents.
24
+
***Emoji Normalization:** Replaces Unicode emojis (including flags, ZWJ sequences) with descriptive text (e.g., ✨ -> `sparkles`).
***Currency Normalization:** Detects currency symbols and ISO codes. Replaces amounts with spoken text using locale-aware mappings (e.g., `$10.50` -> `ten US dollars fifty cents`). Uses Humanizer.
27
+
***Number Normalization:** Handles standalone cardinals ("123" -> `one hundred and twenty-three`), ordinals ("1st" -> `first`), decimals ("1.5" -> `one point five`), and version-like numbers ("1.2.3" -> `one point two point three`). Uses Humanizer.
28
+
***URL Normalization:** Replaces detected URLs (http, https, www) with a placeholder (default: " link ").
29
+
**Configurable:* Specify custom placeholder text via `UrlRuleOptions`.
30
+
***Abbreviation/Acronym Expansion:** Expands a comprehensive list of common chat, gaming, and streaming abbreviations (e.g., `lol` -> `laughing out loud`). Case-insensitive and whole-word matching.
31
+
**Configurable:* Add custom abbreviations or completely replace the default list via `AbbreviationRuleOptions`.
32
+
***Basic Text Sanitization:** Normalizes line breaks, removes common problematic control/formatting characters, and replaces non-standard "fancy" punctuation with ASCII equivalents.
* Reduces excessive letter repetitions (`soooo` -> `soo`).
31
-
***Whitespace Normalization:** Trims leading/trailing whitespace, collapses multiple internal whitespace characters to a single space, and normalizes spacing around common punctuation (removes space before, ensures space after).
32
-
***Extensibility:** Designed around a pipeline of `ITextNormalizationRule` instances, easily configurable via Dependency Injection. Custom rules can be created by implementing the interface.
33
-
***Performance:** Optimized using modern .NET features like source generators (Regex, Emoji data), `FrozenDictionary` for lookups, and efficient string handling where possible.
36
+
***Whitespace Normalization:** Trims leading/trailing whitespace, collapses multiple internal whitespace characters to a single space, and normalizes spacing around common punctuation.
37
+
***Extensibility & Configuration:**
38
+
* Designed around a pipeline of `ITextNormalizationRule` instances.
39
+
* Easily configurable via Dependency Injection using `AddTextNormalization`.
40
+
* Rule execution order can be overridden during registration.
41
+
* Specific rules offer configuration via the standard .NET Options pattern (`IOptions<T>`).
42
+
* Custom rules can be created by implementing the `ITextNormalizationRule` interface.
43
+
***Performance:** Optimized using modern .NET features like source generators (Regex, Emoji data), `FrozenDictionary` for lookups, `IOptions`, and efficient string handling where possible.
34
44
35
45
## Technology
36
46
37
-
***C# 13 / .NET 9 (Preview)**: Leverages the latest language and runtime features.
47
+
***C# 13 / .NET 9**: Leverages the latest language and runtime features.
38
48
***Source Generators:** Used for generating optimized Regex patterns and embedding up-to-date Emoji data at compile time.
39
49
***Humanizer:** Used for robust number-to-words and ordinal conversion.
40
-
***Core .NET Libraries:**`System.Text.RegularExpressions`, `System.Globalization`, `System.Collections.Frozen`, `System.Text.Json` (in generator).
50
+
***Core .NET Libraries:**`System.Text.RegularExpressions`, `System.Globalization`, `System.Collections.Frozen`, `System.Text.Json` (in generator), `Microsoft.Extensions.Options`.
41
51
***Dependency Injection:** Designed for easy integration using `Microsoft.Extensions.DependencyInjection`.
42
52
43
53
## Getting Started
@@ -57,27 +67,57 @@ Or install `Agash.TTSTextNormalization` via the NuGet Package Manager in Visual
57
67
usingMicrosoft.Extensions.DependencyInjection;
58
68
usingTTSTextNormalization.Abstractions; // For ITextNormalizer
59
69
usingTTSTextNormalization.DependencyInjection; // For extension methods
70
+
usingTTSTextNormalization.Rules; // For rule options classes
71
+
usingSystem.Collections.Frozen; // For FrozenDictionary
60
72
61
73
// ... other using statements
62
74
63
75
varservices=newServiceCollection();
64
76
65
-
// Configure the normalization pipeline with desired rules
0 commit comments