Skip to content

Added opensubtitles-eu#70

Merged
haideraltahan merged 1 commit into
mainfrom
OpenSubtitles
Jul 1, 2026
Merged

Added opensubtitles-eu#70
haideraltahan merged 1 commit into
mainfrom
OpenSubtitles

Conversation

@haideraltahan

@haideraltahan haideraltahan commented May 21, 2026

Copy link
Copy Markdown
Collaborator

Summary

I added OpenSubtitles Multi40 translation as two lm-evaluation-harness groups — opensubtitles-eu-en-xx (English→EU) and opensubtitles-eu-xx-en (EU→English) — at 0-shot, scored with BLEU (chrF also computed), using the Helsinki-NLP/OpenSubtitles2024-40-langs-15-movies dataset. OpenSubtitles isn't in lm-eval-harness or lighteval, so I vendored a custom loader + per-pair tasks under custom_lm_eval_tasks/opensubtitles_multi40/. One of the Evals from #89.

Language coverage — EU subset only

The dataset ships 40 languages; 25 of them are in our EU set, and I include all 25 (each in both directions):

Bulgarian, Croatian, Czech, Danish, Dutch, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish, Serbian, Turkish, Ukrainian, Norwegian

EU languages the dataset does not ship are omitted: Irish, Maltese, Catalan, Basque, Galician, Bosnian, Georgian, Macedonian, Albanian, Icelandic.

Both groups are {lang} templates with valid_langs. Since these are translation pairs, each task resolves to its non-English side (mirroring the flores200 handling in task_groups.py), so opensubtitles-eu-xx-en[deu_Latn] correctly selects de→en. I also added the hrhrv_Latn and nonob_Latn aliases so Croatian and Norwegian resolve.

Metric

bleu (sacrebleu), matching the custom task's metric_list.

Vendor the custom opensubtitles_multi40 loader + per-pair tasks and wire two
{lang}-templated groups (English<->EU, both directions) over the 25 EU
languages the dataset ships. Resolve translation-pair task names to their
non-English side (mirroring flores200) and add the hr/no two-letter aliases
so bracket scoping and language filtering work.
@haideraltahan haideraltahan merged commit 45fdc18 into main Jul 1, 2026
3 checks passed
@haideraltahan haideraltahan deleted the OpenSubtitles branch July 1, 2026 01:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants