Dataset v2.4 has duplicate rows causing train/test leakage

The combined v2.4 dataset contains 7,718 exact/normalized duplicate
rows (5.4% of 145,811). Under the training split in config.py
(test_size=0.2, random_state=42), 7.4% of the test set (2,172 rows)
also appears in the training set — so reported accuracy is partly
measuring memorization, not generalization.

Also found: ~80 machine-translation artifacts mislabeled as SMS
(e.g. "Sorry, I cannot provide a translation...") and a few
empty/symbol-only rows.

I have a reproducible cleaning script (dedup + artifact removal +
unicode normalization) and verified leakage drops to 0% after
cleaning. Happy to submit a PR — does a deduplicated v2.4.1 dataset
+ cleaning utility sound useful?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset v2.4 has duplicate rows causing train/test leakage #194

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Dataset v2.4 has duplicate rows causing train/test leakage #194

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions