Standardize audio manifest container paths#1437
Conversation
Add a shared /data-oriented audio path helper and thread --audio-prefix through ASR Leaderboard, FLEURS, CoVoST2, ContextASR-Bench, and LibriSpeech-PC preparation. Prepared manifests now consistently write in-container paths like /data/<benchmark>/... to match the burst_eval_vllm <data_dir>:/data mount convention. BREAKING: contextasr-bench --audio-prefix now means the global container root. Old usage --audio-prefix /data/contextasr-bench should become --audio-prefix /data. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com>
Allow prepare.py to write under <data_dir>/asr-leaderboard so it matches the other large audio benchmark prepare scripts and avoids relying on package-directory writes. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com>
Route AudioBench, MMAU-Pro, MUSAN, and Numb3rs prepared audio paths through the shared container-root helpers so their manifests follow the same /data/<benchmark>/... convention. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>
📝 WalkthroughWalkthroughThis PR standardizes audio path handling across nine dataset preparation scripts by introducing shared container audio root utilities and applying them consistently. All datasets now support configurable in-container audio prefixes via ChangesContainer Audio Path Standardization
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_skills/dataset/mmau-pro/prepare.py (1)
72-108:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHonor
with_audioso--no-audioactually disables audio fieldsLine 164 threads
with_audio, but Line 98 ignores it and still writes audio metadata wheneveraudio_pathexists. That breaks the--no-audiocontract by emitting audio references even when audio download is skipped.Suggested fix
def format_entry(entry, with_audio=False, audio_root="/data"): @@ - if entry.get("audio_path"): + if with_audio and entry.get("audio_path"): audio_path = entry["audio_path"] @@ - elif isinstance(audio_path, str): + elif isinstance(audio_path, str): audio_path = _normalize_audio_path(audio_path, audio_root) formatted_entry["audio_path"] = audio_path user_message["audio"] = {"path": audio_path, "duration": 10.0}As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically".
Also applies to: 164-164
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/dataset/mmau-pro/prepare.py` around lines 72 - 108, The format_entry function currently adds audio metadata and formatted_entry["audio_path"] whenever entry.get("audio_path") exists, ignoring the with_audio flag; update format_entry to check with_audio and only normalize/set formatted_entry["audio_path"] and user_message["audio"/"audios"] when with_audio is True, otherwise skip adding any audio-related keys (both in the single-string branch and the list branch) so the --no-audio behavior is honored; ensure you still keep the audio_path raw value out of the output when with_audio is False and use the existing helper _normalize_audio_path for normalization when allowed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@nemo_skills/dataset/librispeech-pc/prepare.py`:
- Around line 217-220: The file prepare.py failed CI due to ruff-format
formatting changes; run the formatter (ruff format) on this file and commit the
result so the formatting of the block around download_manifests(...) and the
process_split(...) generator expression matches the project's ruff settings;
specifically ensure the call to download_manifests and the splits/total
assignment (where process_split is used) are formatted per ruff (line breaks,
spacing, and parentheses) to satisfy the linter.
In `@nemo_skills/dataset/numb3rs/prepare.py`:
- Around line 112-114: The file has formatting issues flagged by ruff-format
around the audio writing calls (see the audio_file_path variable and
sf.write(...) usages at the block with audio_file_path = audio_dir /
audio_filename and the similar block around lines where sf.write is called
again), so run the code formatter (ruff format) on
nemo_skills/dataset/numb3rs/prepare.py and commit the resulting changes so the
spacing/line breaks and import/order style conform to the project's ruff config;
ensure the two locations referencing audio_file_path and sf.write are updated by
the formatter and no manual code changes are introduced.
In `@tests/test_audio_path_prefix.py`:
- Around line 110-117: Run the ruff formatter on tests/test_audio_path_prefix.py
and commit the changes so pre-commit CI passes; specifically reformat the
assertion block that references formatted["messages"][1]["audio"]["path"] and
the subsequent call to _assert_clean_audio_paths(formatted,
"/data/asr-leaderboard/") (also apply the ruff-format fixes around the later
lines ~133-138 mentioned in the workflow), then stage and commit the updated
file.
---
Outside diff comments:
In `@nemo_skills/dataset/mmau-pro/prepare.py`:
- Around line 72-108: The format_entry function currently adds audio metadata
and formatted_entry["audio_path"] whenever entry.get("audio_path") exists,
ignoring the with_audio flag; update format_entry to check with_audio and only
normalize/set formatted_entry["audio_path"] and user_message["audio"/"audios"]
when with_audio is True, otherwise skip adding any audio-related keys (both in
the single-string branch and the list branch) so the --no-audio behavior is
honored; ensure you still keep the audio_path raw value out of the output when
with_audio is False and use the existing helper _normalize_audio_path for
normalization when allowed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 61635a67-8952-428f-b951-8cb90e7af773
📒 Files selected for processing (12)
docs/evaluation/speech-audio.mdnemo_skills/dataset/asr-leaderboard/prepare.pynemo_skills/dataset/audiobench/prepare.pynemo_skills/dataset/contextasr-bench/prepare.pynemo_skills/dataset/covost2/prepare.pynemo_skills/dataset/fleurs/prepare.pynemo_skills/dataset/librispeech-pc/prepare.pynemo_skills/dataset/mmau-pro/prepare.pynemo_skills/dataset/musan/prepare.pynemo_skills/dataset/numb3rs/prepare.pynemo_skills/dataset/utils.pytests/test_audio_path_prefix.py
Signed-off-by: Dongji Gao <dongjig@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
nemo_skills/dataset/mmau-pro/prepare.py (1)
77-77: ⚡ Quick win
with_audiois threaded but unused.
with_audiois passed through the call chain but never used informat_entry, which makes behavior ambiguous and adds dead API surface. Either remove it, or explicitly gate audio-field emission with it.♻️ Proposed fix (remove dead parameter)
-def format_entry(entry, with_audio=False, audio_root=DEFAULT_CONTAINER_AUDIO_ROOT): +def format_entry(entry, audio_root=DEFAULT_CONTAINER_AUDIO_ROOT): @@ - formatted_entry = format_entry(entry, with_audio=not args.no_audio, audio_root=audio_root) + formatted_entry = format_entry(entry, audio_root=audio_root)As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing."
Also applies to: 169-169
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/dataset/mmau-pro/prepare.py` at line 77, The parameter with_audio on format_entry is unused; remove it from the function signature (def format_entry(entry, audio_root=DEFAULT_CONTAINER_AUDIO_ROOT)) and delete any callers that pass a with_audio argument (also remove the identical unused parameter at the other occurrence flagged). If audio emission was intended, instead implement explicit gating: only add audio-related fields inside format_entry when with_audio is True and keep the parameter; otherwise prefer removing the parameter and all propagation of with_audio so the API no longer accepts an unused argument. Ensure you update all call sites that passed with_audio to match the new signature.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@nemo_skills/dataset/mmau-pro/prepare.py`:
- Line 77: The parameter with_audio on format_entry is unused; remove it from
the function signature (def format_entry(entry,
audio_root=DEFAULT_CONTAINER_AUDIO_ROOT)) and delete any callers that pass a
with_audio argument (also remove the identical unused parameter at the other
occurrence flagged). If audio emission was intended, instead implement explicit
gating: only add audio-related fields inside format_entry when with_audio is
True and keep the parameter; otherwise prefer removing the parameter and all
propagation of with_audio so the API no longer accepts an unused argument.
Ensure you update all call sites that passed with_audio to match the new
signature.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2b7b11f7-b0de-4523-8573-78d75b0fa050
📒 Files selected for processing (9)
docs/evaluation/speech-audio.mdnemo_skills/dataset/asr-leaderboard/prepare.pynemo_skills/dataset/audiobench/prepare.pynemo_skills/dataset/covost2/prepare.pynemo_skills/dataset/fleurs/prepare.pynemo_skills/dataset/librispeech-pc/prepare.pynemo_skills/dataset/mmau-pro/prepare.pynemo_skills/dataset/musan/prepare.pynemo_skills/dataset/numb3rs/prepare.py
🚧 Files skipped from review as they are similar to previous changes (6)
- nemo_skills/dataset/audiobench/prepare.py
- nemo_skills/dataset/musan/prepare.py
- nemo_skills/dataset/fleurs/prepare.py
- nemo_skills/dataset/librispeech-pc/prepare.py
- nemo_skills/dataset/covost2/prepare.py
- nemo_skills/dataset/asr-leaderboard/prepare.py
Signed-off-by: Dongji Gao <dongjig@nvidia.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_skills/dataset/librispeech-pc/prepare.py (1)
135-140:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFail fast on required manifest keys instead of silently skipping malformed rows.
audio_filepathandtextare required fields for a valid sample, but using.get(..., "")with a skip path can hide manifest corruption and silently reduce dataset size. Use direct key access so missing keys fail loudly.Proposed fix
- audio_filepath = entry.get("audio_filepath", "") - text = entry.get("text", "") + audio_filepath = entry["audio_filepath"] + text = entry["text"] if not audio_filepath or not text: skipped += 1 continueAs per coding guidelines "Don't use
.get()for accessing dictionary keys if the code expects them to be present; use direct accessdata[key_name]to fail with a clear error instead of silently corrupting data".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/dataset/librispeech-pc/prepare.py` around lines 135 - 140, Replace the silent .get(...) checks for audio_filepath and text with direct key access so missing manifest keys fail loudly: use entry["audio_filepath"] and entry["text"] instead of entry.get("...", "") in the block handling each manifest row (the variables audio_filepath and text), remove the skipped += 1 / continue path for missing keys, and either let the KeyError propagate or catch it and raise a new error with contextual information (e.g., include the manifest line/index) so malformed rows do not silently disappear.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@nemo_skills/dataset/librispeech-pc/prepare.py`:
- Around line 135-140: Replace the silent .get(...) checks for audio_filepath
and text with direct key access so missing manifest keys fail loudly: use
entry["audio_filepath"] and entry["text"] instead of entry.get("...", "") in the
block handling each manifest row (the variables audio_filepath and text), remove
the skipped += 1 / continue path for missing keys, and either let the KeyError
propagate or catch it and raise a new error with contextual information (e.g.,
include the manifest line/index) so malformed rows do not silently disappear.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5a083dd1-29f7-4813-8352-5f6bb38bfe71
📒 Files selected for processing (3)
nemo_skills/dataset/librispeech-pc/prepare.pynemo_skills/dataset/numb3rs/prepare.pytests/test_audio_path_prefix.py
🚧 Files skipped from review as they are similar to previous changes (2)
- nemo_skills/dataset/numb3rs/prepare.py
- tests/test_audio_path_prefix.py
|
From the Covost and Fleurs side, LGTM! |
|
@DongjiGao please recreate from a branch, so that tests can run |
Summary
/databy default.--audio-prefixthrough audio benchmark prepare scripts so manifests consistently use/data/<benchmark>/...instead of mixed/datasetor host paths./data.Behavior change
contextasr-bench --audio-prefixnow means the global in-container audio root. For example, use--audio-prefix /data; the script appendscontextasr-benchitself.Known existing issue / TODO
numb3rs/prepare.pyissue: the currentnvidia/Numb3rstest split rows do not have acategorykey, so full real-data Numb3rs prep fails withKeyError: 'category'. This is unrelated to the audio-path change and remains tracked as a separate TODO; this PR only covers the path convention.Test plan
python -m pytest tests/test_audio_path_prefix.py -q(13 passed)python -m compileall -qon changed Python filesgit diff --check/datawith no stale/datasetpaths.ruff/pre-commitlocally because neither command is installed in the active environment.Made with Cursor
Summary by CodeRabbit
Documentation
New Features
Tests