Standardize audio manifest container paths by DongjiGao · Pull Request #1437 · NVIDIA-NeMo/Skills

DongjiGao · 2026-05-08T21:56:16Z

Summary

Add shared helpers for writing prepared audio manifests with in-container paths rooted at /data by default.
Thread --audio-prefix through audio benchmark prepare scripts so manifests consistently use /data/<benchmark>/... instead of mixed /dataset or host paths.
Document the prepare/eval mount contract and update stale examples to use /data.

Behavior change

contextasr-bench --audio-prefix now means the global in-container audio root. For example, use --audio-prefix /data; the script appends contextasr-bench itself.

Known existing issue / TODO

Real Draco prep surfaced a pre-existing numb3rs/prepare.py issue: the current nvidia/Numb3rs test split rows do not have a category key, so full real-data Numb3rs prep fails with KeyError: 'category'. This is unrelated to the audio-path change and remains tracked as a separate TODO; this PR only covers the path convention.

Test plan

python -m pytest tests/test_audio_path_prefix.py -q (13 passed)
python -m compileall -q on changed Python files
git diff --check
Real prepared-data checks on Draco for completed benchmarks: ASR leaderboard, CoVoST2, and MMAU-Pro paths resolve under /data with no stale /dataset paths.
Not run: ruff / pre-commit locally because neither command is installed in the active environment.

Made with Cursor

Summary by CodeRabbit

Documentation
- Standardized "Audio path convention" guidance, clarified --audio-prefix behavior and fallback to environment or /data, and updated examples to use /data.
New Features
- Added consistent, configurable in-container audio-root handling across dataset prepare tools and a shared audio-path helper.
- Added --audio-prefix option to dataset CLIs; manifests now reference consistent container-relative audio locations.
Tests
- Added tests validating manifest audio paths are clean and correctly prefixed.

Add a shared /data-oriented audio path helper and thread --audio-prefix through ASR Leaderboard, FLEURS, CoVoST2, ContextASR-Bench, and LibriSpeech-PC preparation. Prepared manifests now consistently write in-container paths like /data/<benchmark>/... to match the burst_eval_vllm <data_dir>:/data mount convention. BREAKING: contextasr-bench --audio-prefix now means the global container root. Old usage --audio-prefix /data/contextasr-bench should become --audio-prefix /data. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com>

Allow prepare.py to write under <data_dir>/asr-leaderboard so it matches the other large audio benchmark prepare scripts and avoids relying on package-directory writes. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com>

Route AudioBench, MMAU-Pro, MUSAN, and Numb3rs prepared audio paths through the shared container-root helpers so their manifests follow the same /data/<benchmark>/... convention. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>

coderabbitai · 2026-05-08T22:00:37Z

📝 Walkthrough

Walkthrough

This PR standardizes audio path handling across nine dataset preparation scripts by introducing shared container audio root utilities and applying them consistently. All datasets now support configurable in-container audio prefixes via --audio-prefix, defaulting to environment variable NEMO_SKILLS_AUDIO_ROOT or /data, replacing previous hardcoded paths like /dataset/.... Documentation is updated to describe the new convention.

Changes

Container Audio Path Standardization

Layer / File(s)	Summary
Shared Container Audio Utilities `nemo_skills/dataset/utils.py`	Adds `DEFAULT_CONTAINER_AUDIO_ROOT` constant, `get_container_audio_root(audio_prefix)` to resolve in-container root from CLI arg or environment, and `build_container_audio_path(benchmark, *parts)` to construct normalized manifest paths.
Audio Path Conventions & Documentation `docs/evaluation/speech-audio.md`	New "Audio path convention" section explains in-container root semantics and `--audio-prefix` behavior; evaluation examples updated to use `/data` instead of `/dataset`.
ASR-Leaderboard `nemo_skills/dataset/asr-leaderboard/prepare.py`	Adds `audio_root` parameter to `save_audio_and_format_entry()` and `prepare_dataset()`, replaces hardcoded `/dataset/asr-leaderboard/...` with `build_container_audio_path()`, adds `--audio-prefix` and `--data_dir` CLI options.
Audiobench `nemo_skills/dataset/audiobench/prepare.py`	Extends `create_manifest_entry()` and `process_dataset()` with `audio_root`, replaces hardcoded manifest paths with helper-based construction, adds `--audio-prefix` CLI option.
ContextASR-Bench `nemo_skills/dataset/contextasr-bench/prepare.py`	Adds `BENCHMARK_NAME` constant and `resolve_audio_prefix()` helper that delegates to `build_container_audio_path()`, updates CLI prefix resolution and sample audio verification.
CoVoST2 `nemo_skills/dataset/covost2/prepare.py`	Updates `get_container_audio_path()` to delegate to `build_container_audio_path()`, extends `prepare_covost2()` with `audio_root`, adds `--audio-prefix` CLI option.
FLEURS `nemo_skills/dataset/fleurs/prepare.py`	Updates `get_container_audio_path()` wrapper and `prepare_fleurs()` with `audio_root` parameter, delegates path construction to shared utility, adds `--audio-prefix` CLI option.
LibriSpeech-PC `nemo_skills/dataset/librispeech-pc/prepare.py`	Extends `process_split()` with `audio_root`, replaces environment variable lookup and string formatting with `build_container_audio_path()`, adds `--audio-prefix` CLI option.
MMAU-Pro `nemo_skills/dataset/mmau-pro/prepare.py`	Adds `_normalize_audio_path()` helper to convert dataset paths to container paths, extends `format_entry()` with `audio_root`, updates manifest and message audio handling, adds `--audio-prefix` CLI option.
MUSAN `nemo_skills/dataset/musan/prepare.py`	Updates `create_manifest_entry()` to accept `audio_root` and use `build_container_audio_path()`, extends `process_category_from_files()` and `process_category()`, adds `--audio-prefix` CLI option, threads `audio_root` through both file and HuggingFace processing paths.
Numb3rs `nemo_skills/dataset/numb3rs/prepare.py`	Renames `audio_prefix` to `audio_root` in `save_audio_and_format_entry()` and `prepare_category()`, replaces hardcoded path formatting with `build_container_audio_path()`, updates CLI `--audio-prefix` to default `None` and derive root via helper.
Test Coverage `tests/test_audio_path_prefix.py`	New test module with shared utilities for path validation, dependency stubbing, and end-to-end tests covering all nine datasets; verifies clean audio paths without stale `/dataset/` references.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 55.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'Standardize audio manifest container paths' accurately and concisely describes the main change: refactoring audio path handling in dataset manifests to use a standardized container root convention.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_skills/dataset/mmau-pro/prepare.py (1)
72-108: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Honor with_audio so --no-audio actually disables audio fields

Line 164 threads with_audio, but Line 98 ignores it and still writes audio metadata whenever audio_path exists. That breaks the --no-audio contract by emitting audio references even when audio download is skipped.
Suggested fix
 def format_entry(entry, with_audio=False, audio_root="/data"):
@@
-    if entry.get("audio_path"):
+    if with_audio and entry.get("audio_path"):
         audio_path = entry["audio_path"]
@@
-        elif isinstance(audio_path, str):
+        elif isinstance(audio_path, str):
             audio_path = _normalize_audio_path(audio_path, audio_root)
             formatted_entry["audio_path"] = audio_path
             user_message["audio"] = {"path": audio_path, "duration": 10.0}
As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically".

Also applies to: 164-164
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/dataset/mmau-pro/prepare.py` around lines 72 - 108, The
format_entry function currently adds audio metadata and
formatted_entry["audio_path"] whenever entry.get("audio_path") exists, ignoring
the with_audio flag; update format_entry to check with_audio and only
normalize/set formatted_entry["audio_path"] and user_message["audio"/"audios"]
when with_audio is True, otherwise skip adding any audio-related keys (both in
the single-string branch and the list branch) so the --no-audio behavior is
honored; ensure you still keep the audio_path raw value out of the output when
with_audio is False and use the existing helper _normalize_audio_path for
normalization when allowed.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/dataset/librispeech-pc/prepare.py`:
- Around line 217-220: The file prepare.py failed CI due to ruff-format
formatting changes; run the formatter (ruff format) on this file and commit the
result so the formatting of the block around download_manifests(...) and the
process_split(...) generator expression matches the project's ruff settings;
specifically ensure the call to download_manifests and the splits/total
assignment (where process_split is used) are formatted per ruff (line breaks,
spacing, and parentheses) to satisfy the linter.

In `@nemo_skills/dataset/numb3rs/prepare.py`:
- Around line 112-114: The file has formatting issues flagged by ruff-format
around the audio writing calls (see the audio_file_path variable and
sf.write(...) usages at the block with audio_file_path = audio_dir /
audio_filename and the similar block around lines where sf.write is called
again), so run the code formatter (ruff format) on
nemo_skills/dataset/numb3rs/prepare.py and commit the resulting changes so the
spacing/line breaks and import/order style conform to the project's ruff config;
ensure the two locations referencing audio_file_path and sf.write are updated by
the formatter and no manual code changes are introduced.

In `@tests/test_audio_path_prefix.py`:
- Around line 110-117: Run the ruff formatter on tests/test_audio_path_prefix.py
and commit the changes so pre-commit CI passes; specifically reformat the
assertion block that references formatted["messages"][1]["audio"]["path"] and
the subsequent call to _assert_clean_audio_paths(formatted,
"/data/asr-leaderboard/") (also apply the ruff-format fixes around the later
lines ~133-138 mentioned in the workflow), then stage and commit the updated
file.

---

Outside diff comments:
In `@nemo_skills/dataset/mmau-pro/prepare.py`:
- Around line 72-108: The format_entry function currently adds audio metadata
and formatted_entry["audio_path"] whenever entry.get("audio_path") exists,
ignoring the with_audio flag; update format_entry to check with_audio and only
normalize/set formatted_entry["audio_path"] and user_message["audio"/"audios"]
when with_audio is True, otherwise skip adding any audio-related keys (both in
the single-string branch and the list branch) so the --no-audio behavior is
honored; ensure you still keep the audio_path raw value out of the output when
with_audio is False and use the existing helper _normalize_audio_path for
normalization when allowed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 61635a67-8952-428f-b951-8cb90e7af773

📥 Commits

Reviewing files that changed from the base of the PR and between 4701f10 and 1f9c0cd.

📒 Files selected for processing (12)

docs/evaluation/speech-audio.md
nemo_skills/dataset/asr-leaderboard/prepare.py
nemo_skills/dataset/audiobench/prepare.py
nemo_skills/dataset/contextasr-bench/prepare.py
nemo_skills/dataset/covost2/prepare.py
nemo_skills/dataset/fleurs/prepare.py
nemo_skills/dataset/librispeech-pc/prepare.py
nemo_skills/dataset/mmau-pro/prepare.py
nemo_skills/dataset/musan/prepare.py
nemo_skills/dataset/numb3rs/prepare.py
nemo_skills/dataset/utils.py
tests/test_audio_path_prefix.py

Signed-off-by: Dongji Gao <dongjig@nvidia.com>

coderabbitai

🧹 Nitpick comments (1)

nemo_skills/dataset/mmau-pro/prepare.py (1)
77-77: ⚡ Quick win

with_audio is threaded but unused.

with_audio is passed through the call chain but never used in format_entry, which makes behavior ambiguous and adds dead API surface. Either remove it, or explicitly gate audio-field emission with it.
♻️ Proposed fix (remove dead parameter)
-def format_entry(entry, with_audio=False, audio_root=DEFAULT_CONTAINER_AUDIO_ROOT):
+def format_entry(entry, audio_root=DEFAULT_CONTAINER_AUDIO_ROOT):
@@
-            formatted_entry = format_entry(entry, with_audio=not args.no_audio, audio_root=audio_root)
+            formatted_entry = format_entry(entry, audio_root=audio_root)
As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing."

Also applies to: 169-169
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/dataset/mmau-pro/prepare.py` at line 77, The parameter with_audio
on format_entry is unused; remove it from the function signature (def
format_entry(entry, audio_root=DEFAULT_CONTAINER_AUDIO_ROOT)) and delete any
callers that pass a with_audio argument (also remove the identical unused
parameter at the other occurrence flagged). If audio emission was intended,
instead implement explicit gating: only add audio-related fields inside
format_entry when with_audio is True and keep the parameter; otherwise prefer
removing the parameter and all propagation of with_audio so the API no longer
accepts an unused argument. Ensure you update all call sites that passed
with_audio to match the new signature.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@nemo_skills/dataset/mmau-pro/prepare.py`:
- Line 77: The parameter with_audio on format_entry is unused; remove it from
the function signature (def format_entry(entry,
audio_root=DEFAULT_CONTAINER_AUDIO_ROOT)) and delete any callers that pass a
with_audio argument (also remove the identical unused parameter at the other
occurrence flagged). If audio emission was intended, instead implement explicit
gating: only add audio-related fields inside format_entry when with_audio is
True and keep the parameter; otherwise prefer removing the parameter and all
propagation of with_audio so the API no longer accepts an unused argument.
Ensure you update all call sites that passed with_audio to match the new
signature.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2b7b11f7-b0de-4523-8573-78d75b0fa050

📥 Commits

Reviewing files that changed from the base of the PR and between 1f9c0cd and f5900a3.

📒 Files selected for processing (9)

docs/evaluation/speech-audio.md
nemo_skills/dataset/asr-leaderboard/prepare.py
nemo_skills/dataset/audiobench/prepare.py
nemo_skills/dataset/covost2/prepare.py
nemo_skills/dataset/fleurs/prepare.py
nemo_skills/dataset/librispeech-pc/prepare.py
nemo_skills/dataset/mmau-pro/prepare.py
nemo_skills/dataset/musan/prepare.py
nemo_skills/dataset/numb3rs/prepare.py

🚧 Files skipped from review as they are similar to previous changes (6)

nemo_skills/dataset/audiobench/prepare.py
nemo_skills/dataset/musan/prepare.py
nemo_skills/dataset/fleurs/prepare.py
nemo_skills/dataset/librispeech-pc/prepare.py
nemo_skills/dataset/covost2/prepare.py
nemo_skills/dataset/asr-leaderboard/prepare.py

Signed-off-by: Dongji Gao <dongjig@nvidia.com>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_skills/dataset/librispeech-pc/prepare.py (1)
135-140: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast on required manifest keys instead of silently skipping malformed rows.

audio_filepath and text are required fields for a valid sample, but using .get(..., "") with a skip path can hide manifest corruption and silently reduce dataset size. Use direct key access so missing keys fail loudly.
Proposed fix
-            audio_filepath = entry.get("audio_filepath", "")
-            text = entry.get("text", "")
+            audio_filepath = entry["audio_filepath"]
+            text = entry["text"]

             if not audio_filepath or not text:
                 skipped += 1
                 continue
As per coding guidelines "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/dataset/librispeech-pc/prepare.py` around lines 135 - 140,
Replace the silent .get(...) checks for audio_filepath and text with direct key
access so missing manifest keys fail loudly: use entry["audio_filepath"] and
entry["text"] instead of entry.get("...", "") in the block handling each
manifest row (the variables audio_filepath and text), remove the skipped += 1 /
continue path for missing keys, and either let the KeyError propagate or catch
it and raise a new error with contextual information (e.g., include the manifest
line/index) so malformed rows do not silently disappear.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@nemo_skills/dataset/librispeech-pc/prepare.py`:
- Around line 135-140: Replace the silent .get(...) checks for audio_filepath
and text with direct key access so missing manifest keys fail loudly: use
entry["audio_filepath"] and entry["text"] instead of entry.get("...", "") in the
block handling each manifest row (the variables audio_filepath and text), remove
the skipped += 1 / continue path for missing keys, and either let the KeyError
propagate or catch it and raise a new error with contextual information (e.g.,
include the manifest line/index) so malformed rows do not silently disappear.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5a083dd1-29f7-4813-8352-5f6bb38bfe71

📥 Commits

Reviewing files that changed from the base of the PR and between f5900a3 and 967f6ba.

📒 Files selected for processing (3)

nemo_skills/dataset/librispeech-pc/prepare.py
nemo_skills/dataset/numb3rs/prepare.py
tests/test_audio_path_prefix.py

🚧 Files skipped from review as they are similar to previous changes (2)

nemo_skills/dataset/numb3rs/prepare.py
tests/test_audio_path_prefix.py

naymaraq · 2026-05-09T14:27:25Z

From the Covost and Fleurs side, LGTM!
Just update the branch, since there might be some changes I made to the Covost/Fleurs benchmarks.

Kipok · 2026-05-12T00:22:06Z

@DongjiGao please recreate from a branch, so that tests can run

DongjiGao and others added 3 commits May 6, 2026 15:22

asr-leaderboard: support custom prepare output dir

6c96a8a

Allow prepare.py to write under <data_dir>/asr-leaderboard so it matches the other large audio benchmark prepare scripts and avoids relying on package-directory writes. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com>

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

Comment thread nemo_skills/dataset/librispeech-pc/prepare.py Outdated

Comment thread nemo_skills/dataset/numb3rs/prepare.py

Comment thread tests/test_audio_path_prefix.py

Use shared audio root defaults in prepare docs

f5900a3

Signed-off-by: Dongji Gao <dongjig@nvidia.com>

DongjiGao force-pushed the main branch from 1f9c0cd to f5900a3 Compare May 8, 2026 22:15

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

Apply Ruff formatting to audio path changes

967f6ba

Signed-off-by: Dongji Gao <dongjig@nvidia.com>

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

naymaraq self-requested a review May 9, 2026 12:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardize audio manifest container paths#1437

Standardize audio manifest container paths#1437
DongjiGao wants to merge 5 commits into
NVIDIA-NeMo:mainfrom
DongjiGao:main

DongjiGao commented May 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 8, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

naymaraq commented May 9, 2026

Uh oh!

Kipok commented May 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

DongjiGao commented May 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Behavior change

Known existing issue / TODO

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

naymaraq commented May 9, 2026

Uh oh!

Kipok commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DongjiGao commented May 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 8, 2026 •

edited

Loading

Kipok commented May 12, 2026 •

edited

Loading