Skip to content

Commit f46bc88

Browse files
committed
LibriBrain100: align manifest with finalized HF upload
Fixes from the on-VM fingerprint test now that pnpl/LibriBrain2 has finished uploading sub-0 across every corpus: - Podcasts task token is "TheMoth" on disk (not "Podcasts"). The user-facing corpus="podcasts" alias is unchanged; only the internal task token / HF folder name moves. - Sherlock9 starts at ses-0 (the LibriVox preface track), giving 13 sessions instead of 12. - TIMIT has 14 sessions (not 10). Per the paper, TIMIT splits are utterance-level (24-speaker core test, 50-speaker Kaldi dev), so every TIMIT session is assigned to train and finer splits will surface from event-level filtering once the per-utterance metadata in events.tsv is finalized. After this change, the on-VM fingerprint reports 100% manifest / upload parity for sub-0 across MOCHATIMIT/Sherlock8/Sherlock9/TIMIT/ TheMoth. The 32 broad-subject Sherlock1 ses-11/ses-12 records are still pending; the loader skips them with a single grouped warning.
1 parent f74795a commit f46bc88

2 files changed

Lines changed: 14 additions & 30 deletions

File tree

pnpl/datasets/libribrain100/constants.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@
7676

7777
TIMIT_TASK = "TIMIT"
7878
MOCHATIMIT_TASK = "MOCHATIMIT"
79-
PODCASTS_TASK = "Podcasts"
79+
PODCASTS_TASK = "TheMoth" # HF folder name; user-facing corpus is "podcasts"
8080

8181
# Map task token → corpus.
8282
TASK_TO_CORPUS: dict[str, str] = {

pnpl/datasets/libribrain100/manifest.py

Lines changed: 13 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -135,18 +135,14 @@ def _build_libribrain_sherlock_records() -> list[RunRecord]:
135135
# ---------------------------------------------------------------------------
136136
# Sherlock8 / Sherlock9 — sub-0, LibriBrain2 release.
137137
#
138-
# The HF tree shows Sherlock8 with ses-1..10 currently. Sherlock9 is
139-
# expected to follow the same one-session-per-LibriVox-track pattern as
140-
# the other books in the canon; we approximate with ses-1..12 and
141-
# tolerate gaps at runtime. Both are entirely train (all-canon Sherlock
142-
# val/test live in book 1 already).
138+
# Sherlock8 has ses-1..10. Sherlock9 starts at ses-0 (the book's preface
139+
# track on LibriVox) and runs through ses-12. Both are entirely train
140+
# (all-canon Sherlock val/test live in book 1 already).
143141
# ---------------------------------------------------------------------------
144142

145143
_LIBRIBRAIN2_SHERLOCK_SESSION_RUNS: dict[str, tuple[tuple[str, str], ...]] = {
146144
"Sherlock8": tuple((str(i), "1") for i in range(1, 11)),
147-
# Sherlock9 session count is approximate; will be reconciled with the
148-
# final upload. Missing sessions are skipped at load time.
149-
"Sherlock9": tuple((str(i), "1") for i in range(1, 13)),
145+
"Sherlock9": tuple((str(i), "1") for i in range(0, 13)),
150146
}
151147

152148

@@ -171,32 +167,20 @@ def _build_libribrain2_sherlock_records() -> list[RunRecord]:
171167
# ---------------------------------------------------------------------------
172168
# TIMIT — sub-0, LibriBrain2 release.
173169
#
174-
# Per the paper (Sec. 3.2), the standard split follows TIMIT's official
175-
# core test set + Kaldi 50-speaker dev set, applied at the utterance
176-
# level. The MEG-side session count is not yet visible on HF; we
177-
# placeholder with ses-1..10 and assign one session each to val/test.
178-
# Final per-utterance filtering will happen at the events level once
179-
# the upload exposes the per-utterance metadata.
170+
# 14 sessions on HF. Per the paper (Sec. 3.2), the standard TIMIT split
171+
# is utterance-level (24-speaker core test, 50-speaker Kaldi dev), so
172+
# session-level partitioning here would be too coarse. We assign every
173+
# session to ``train`` and rely on event-level filtering (in events.tsv
174+
# rows) to surface the standard val/test subsets — to be wired up once
175+
# the per-utterance metadata is finalized in the released events files.
180176
# ---------------------------------------------------------------------------
181177

182-
_TIMIT_TRAIN_SESSIONS = tuple(str(i) for i in range(1, 9))
183-
_TIMIT_VAL_SESSIONS = ("9",)
184-
_TIMIT_TEST_SESSIONS = ("10",)
185-
186-
187-
def _timit_partition(session: str) -> str:
188-
if session in _TIMIT_VAL_SESSIONS:
189-
return PARTITION_VALIDATION
190-
if session in _TIMIT_TEST_SESSIONS:
191-
return PARTITION_TEST
192-
return PARTITION_TRAIN
178+
_TIMIT_SESSIONS = tuple(str(i) for i in range(1, 15))
193179

194180

195181
def _build_timit_records() -> list[RunRecord]:
196182
out: list[RunRecord] = []
197-
for ses in (
198-
_TIMIT_TRAIN_SESSIONS + _TIMIT_VAL_SESSIONS + _TIMIT_TEST_SESSIONS
199-
):
183+
for ses in _TIMIT_SESSIONS:
200184
out.append(
201185
RunRecord(
202186
subject=DEEP_SUBJECT,
@@ -205,7 +189,7 @@ def _build_timit_records() -> list[RunRecord]:
205189
run="1",
206190
corpus=CORPUS_TIMIT,
207191
repo=REPO_KEY_LIBRIBRAIN2,
208-
partition=_timit_partition(ses),
192+
partition=PARTITION_TRAIN,
209193
)
210194
)
211195
return out

0 commit comments

Comments
 (0)