Skip to content

All sentences in ELRC-{3056-,}wikipedia_health zh-en end with spaces, possibly duplicates #4

@jelmervdl

Description

@jelmervdl

I by chance noticed this, but all data formats for this particular dataset seem to end with spaces at the end of the lines. The original source files, from https://www.elrc-share.eu/repository/browse/covid-19-health-wikipedia-dataset-bilingual-en-zh/c6236d148de811ea913100155d026706c2a9a16f8fc74d0487006e8379d322a0/, don't seem to have this issue.

Also, these might be duplicates. The samples are different, but en-zh tmx is exactly the same except for the creation header:

I haven't checked all other ELRC imported datasets, but another en-zh didn't seem to have this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions