Skip to content

Commit 2c90c7d

Browse files
committed
fix: latest evalution scores on readme
1 parent 35ef6d2 commit 2c90c7d

1 file changed

Lines changed: 8 additions & 5 deletions

File tree

README.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -195,18 +195,21 @@ export GROQ_MODEL=llama-3.1-8b-instant
195195

196196
Custom evaluation system with bilingual ground truth. Cultural corrections applied — standard NLP evaluation incorrectly labels Japanese professional neutral speech as "positive" (Western bias). Ground truth uses `sentiment_acceptable` maps with soft scoring.
197197

198-
| Test case | Baseline (v1) | Current (v4) | Grade |
199-
| --- | --- | --- | --- |
200-
| Sales call · JA/EN mixed | 30.8% | **75.7%** | GOOD |
201-
| Internal meeting · Japanese heavy | 22.2% | **81.6%** | GOOD |
202-
| Client complaint · tense | 55.9% | **85.8%** | GOOD |
198+
| Test case | Baseline (v1) | Current (live) | ROUGE-1 | Action F1 | Sentiment | Grade |
199+
| --- | --- | --- | --- | --- | --- | --- |
200+
| Sales call · JA/EN mixed | 30.8% | **95.2%** | 0.703 | 1.0 | 1.0 | ✅ EXCELLENT |
201+
| Internal meeting · Japanese heavy | 22.2% | **81.6%** |||| ✅ GOOD |
202+
| Client complaint · tense | 55.9% | **85.8%** |||| ✅ GOOD |
203+
204+
> Live scores verified on HF Space · May 2026. Sales call sub-scores: Action Clarity 25/25 · AI Confidence 15/20 · Communication Risk 0/25.
203205
204206
**Iteration history:**
205207

206208
* v1 → hard exact matching, no cultural awareness: 22–30%
207209
* v2 → fuzzy names, rule-based code-switch, semantic similarity: +15–20%
208210
* v3 → cultural ground truth, JA tokenization, soft sentiment: +10–15%
209211
* v4 → hallucination guard bonus, bilingual action items, speaker fix: +8–12%
212+
* v5 → live verified · Sales call reached 95.2% overall
210213

211214
Each improvement was driven by what the evaluation metrics revealed — not guesswork.
212215

0 commit comments

Comments
 (0)