Skip to content

Dataset v2.4 has duplicate rows causing train/test leakage #194

@Frhnfaya

Description

@Frhnfaya

The combined v2.4 dataset contains 7,718 exact/normalized duplicate
rows (5.4% of 145,811). Under the training split in config.py
(test_size=0.2, random_state=42), 7.4% of the test set (2,172 rows)
also appears in the training set — so reported accuracy is partly
measuring memorization, not generalization.

Also found: ~80 machine-translation artifacts mislabeled as SMS
(e.g. "Sorry, I cannot provide a translation...") and a few
empty/symbol-only rows.

I have a reproducible cleaning script (dedup + artifact removal +
unicode normalization) and verified leakage drops to 0% after
cleaning. Happy to submit a PR — does a deduplicated v2.4.1 dataset

  • cleaning utility sound useful?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions