The combined v2.4 dataset contains 7,718 exact/normalized duplicate
rows (5.4% of 145,811). Under the training split in config.py
(test_size=0.2, random_state=42), 7.4% of the test set (2,172 rows)
also appears in the training set — so reported accuracy is partly
measuring memorization, not generalization.
Also found: ~80 machine-translation artifacts mislabeled as SMS
(e.g. "Sorry, I cannot provide a translation...") and a few
empty/symbol-only rows.
I have a reproducible cleaning script (dedup + artifact removal +
unicode normalization) and verified leakage drops to 0% after
cleaning. Happy to submit a PR — does a deduplicated v2.4.1 dataset
- cleaning utility sound useful?
The combined v2.4 dataset contains 7,718 exact/normalized duplicate
rows (5.4% of 145,811). Under the training split in config.py
(test_size=0.2, random_state=42), 7.4% of the test set (2,172 rows)
also appears in the training set — so reported accuracy is partly
measuring memorization, not generalization.
Also found: ~80 machine-translation artifacts mislabeled as SMS
(e.g. "Sorry, I cannot provide a translation...") and a few
empty/symbol-only rows.
I have a reproducible cleaning script (dedup + artifact removal +
unicode normalization) and verified leakage drops to 0% after
cleaning. Happy to submit a PR — does a deduplicated v2.4.1 dataset