Skip to content

Standardize audio manifest container paths#1437

Open
DongjiGao wants to merge 5 commits into
NVIDIA-NeMo:mainfrom
DongjiGao:main
Open

Standardize audio manifest container paths#1437
DongjiGao wants to merge 5 commits into
NVIDIA-NeMo:mainfrom
DongjiGao:main

Conversation

@DongjiGao
Copy link
Copy Markdown
Collaborator

@DongjiGao DongjiGao commented May 8, 2026

Summary

  • Add shared helpers for writing prepared audio manifests with in-container paths rooted at /data by default.
  • Thread --audio-prefix through audio benchmark prepare scripts so manifests consistently use /data/<benchmark>/... instead of mixed /dataset or host paths.
  • Document the prepare/eval mount contract and update stale examples to use /data.

Behavior change

  • contextasr-bench --audio-prefix now means the global in-container audio root. For example, use --audio-prefix /data; the script appends contextasr-bench itself.

Known existing issue / TODO

  • Real Draco prep surfaced a pre-existing numb3rs/prepare.py issue: the current nvidia/Numb3rs test split rows do not have a category key, so full real-data Numb3rs prep fails with KeyError: 'category'. This is unrelated to the audio-path change and remains tracked as a separate TODO; this PR only covers the path convention.

Test plan

  • python -m pytest tests/test_audio_path_prefix.py -q (13 passed)
  • python -m compileall -q on changed Python files
  • git diff --check
  • Real prepared-data checks on Draco for completed benchmarks: ASR leaderboard, CoVoST2, and MMAU-Pro paths resolve under /data with no stale /dataset paths.
  • Not run: ruff / pre-commit locally because neither command is installed in the active environment.

Made with Cursor

Summary by CodeRabbit

  • Documentation

    • Standardized "Audio path convention" guidance, clarified --audio-prefix behavior and fallback to environment or /data, and updated examples to use /data.
  • New Features

    • Added consistent, configurable in-container audio-root handling across dataset prepare tools and a shared audio-path helper.
    • Added --audio-prefix option to dataset CLIs; manifests now reference consistent container-relative audio locations.
  • Tests

    • Added tests validating manifest audio paths are clean and correctly prefixed.

DongjiGao and others added 3 commits May 6, 2026 15:22
Add a shared /data-oriented audio path helper and thread --audio-prefix through ASR Leaderboard, FLEURS, CoVoST2, ContextASR-Bench, and LibriSpeech-PC preparation. Prepared manifests now consistently write in-container paths like /data/<benchmark>/... to match the burst_eval_vllm <data_dir>:/data mount convention.

BREAKING: contextasr-bench --audio-prefix now means the global container root. Old usage --audio-prefix /data/contextasr-bench should become --audio-prefix /data.

Made-with: Cursor
Signed-off-by: Dongji Gao <dongjig@nvidia.com>
Allow prepare.py to write under <data_dir>/asr-leaderboard so it matches the other large audio benchmark prepare scripts and avoids relying on package-directory writes.

Made-with: Cursor
Signed-off-by: Dongji Gao <dongjig@nvidia.com>
Route AudioBench, MMAU-Pro, MUSAN, and Numb3rs prepared audio paths through the shared container-root helpers so their manifests follow the same /data/<benchmark>/... convention.

Made-with: Cursor
Signed-off-by: Dongji Gao <dongjig@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 8, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR standardizes audio path handling across nine dataset preparation scripts by introducing shared container audio root utilities and applying them consistently. All datasets now support configurable in-container audio prefixes via --audio-prefix, defaulting to environment variable NEMO_SKILLS_AUDIO_ROOT or /data, replacing previous hardcoded paths like /dataset/.... Documentation is updated to describe the new convention.

Changes

Container Audio Path Standardization

Layer / File(s) Summary
Shared Container Audio Utilities
nemo_skills/dataset/utils.py
Adds DEFAULT_CONTAINER_AUDIO_ROOT constant, get_container_audio_root(audio_prefix) to resolve in-container root from CLI arg or environment, and build_container_audio_path(benchmark, *parts) to construct normalized manifest paths.
Audio Path Conventions & Documentation
docs/evaluation/speech-audio.md
New "Audio path convention" section explains in-container root semantics and --audio-prefix behavior; evaluation examples updated to use /data instead of /dataset.
ASR-Leaderboard
nemo_skills/dataset/asr-leaderboard/prepare.py
Adds audio_root parameter to save_audio_and_format_entry() and prepare_dataset(), replaces hardcoded /dataset/asr-leaderboard/... with build_container_audio_path(), adds --audio-prefix and --data_dir CLI options.
Audiobench
nemo_skills/dataset/audiobench/prepare.py
Extends create_manifest_entry() and process_dataset() with audio_root, replaces hardcoded manifest paths with helper-based construction, adds --audio-prefix CLI option.
ContextASR-Bench
nemo_skills/dataset/contextasr-bench/prepare.py
Adds BENCHMARK_NAME constant and resolve_audio_prefix() helper that delegates to build_container_audio_path(), updates CLI prefix resolution and sample audio verification.
CoVoST2
nemo_skills/dataset/covost2/prepare.py
Updates get_container_audio_path() to delegate to build_container_audio_path(), extends prepare_covost2() with audio_root, adds --audio-prefix CLI option.
FLEURS
nemo_skills/dataset/fleurs/prepare.py
Updates get_container_audio_path() wrapper and prepare_fleurs() with audio_root parameter, delegates path construction to shared utility, adds --audio-prefix CLI option.
LibriSpeech-PC
nemo_skills/dataset/librispeech-pc/prepare.py
Extends process_split() with audio_root, replaces environment variable lookup and string formatting with build_container_audio_path(), adds --audio-prefix CLI option.
MMAU-Pro
nemo_skills/dataset/mmau-pro/prepare.py
Adds _normalize_audio_path() helper to convert dataset paths to container paths, extends format_entry() with audio_root, updates manifest and message audio handling, adds --audio-prefix CLI option.
MUSAN
nemo_skills/dataset/musan/prepare.py
Updates create_manifest_entry() to accept audio_root and use build_container_audio_path(), extends process_category_from_files() and process_category(), adds --audio-prefix CLI option, threads audio_root through both file and HuggingFace processing paths.
Numb3rs
nemo_skills/dataset/numb3rs/prepare.py
Renames audio_prefix to audio_root in save_audio_and_format_entry() and prepare_category(), replaces hardcoded path formatting with build_container_audio_path(), updates CLI --audio-prefix to default None and derive root via helper.
Test Coverage
tests/test_audio_path_prefix.py
New test module with shared utilities for path validation, dependency stubbing, and end-to-end tests covering all nine datasets; verifies clean audio paths without stale /dataset/ references.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'Standardize audio manifest container paths' accurately and concisely describes the main change: refactoring audio path handling in dataset manifests to use a standardized container root convention.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_skills/dataset/mmau-pro/prepare.py (1)

72-108: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Honor with_audio so --no-audio actually disables audio fields

Line 164 threads with_audio, but Line 98 ignores it and still writes audio metadata whenever audio_path exists. That breaks the --no-audio contract by emitting audio references even when audio download is skipped.

Suggested fix
 def format_entry(entry, with_audio=False, audio_root="/data"):
@@
-    if entry.get("audio_path"):
+    if with_audio and entry.get("audio_path"):
         audio_path = entry["audio_path"]
@@
-        elif isinstance(audio_path, str):
+        elif isinstance(audio_path, str):
             audio_path = _normalize_audio_path(audio_path, audio_root)
             formatted_entry["audio_path"] = audio_path
             user_message["audio"] = {"path": audio_path, "duration": 10.0}

As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically".

Also applies to: 164-164

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/dataset/mmau-pro/prepare.py` around lines 72 - 108, The
format_entry function currently adds audio metadata and
formatted_entry["audio_path"] whenever entry.get("audio_path") exists, ignoring
the with_audio flag; update format_entry to check with_audio and only
normalize/set formatted_entry["audio_path"] and user_message["audio"/"audios"]
when with_audio is True, otherwise skip adding any audio-related keys (both in
the single-string branch and the list branch) so the --no-audio behavior is
honored; ensure you still keep the audio_path raw value out of the output when
with_audio is False and use the existing helper _normalize_audio_path for
normalization when allowed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/dataset/librispeech-pc/prepare.py`:
- Around line 217-220: The file prepare.py failed CI due to ruff-format
formatting changes; run the formatter (ruff format) on this file and commit the
result so the formatting of the block around download_manifests(...) and the
process_split(...) generator expression matches the project's ruff settings;
specifically ensure the call to download_manifests and the splits/total
assignment (where process_split is used) are formatted per ruff (line breaks,
spacing, and parentheses) to satisfy the linter.

In `@nemo_skills/dataset/numb3rs/prepare.py`:
- Around line 112-114: The file has formatting issues flagged by ruff-format
around the audio writing calls (see the audio_file_path variable and
sf.write(...) usages at the block with audio_file_path = audio_dir /
audio_filename and the similar block around lines where sf.write is called
again), so run the code formatter (ruff format) on
nemo_skills/dataset/numb3rs/prepare.py and commit the resulting changes so the
spacing/line breaks and import/order style conform to the project's ruff config;
ensure the two locations referencing audio_file_path and sf.write are updated by
the formatter and no manual code changes are introduced.

In `@tests/test_audio_path_prefix.py`:
- Around line 110-117: Run the ruff formatter on tests/test_audio_path_prefix.py
and commit the changes so pre-commit CI passes; specifically reformat the
assertion block that references formatted["messages"][1]["audio"]["path"] and
the subsequent call to _assert_clean_audio_paths(formatted,
"/data/asr-leaderboard/") (also apply the ruff-format fixes around the later
lines ~133-138 mentioned in the workflow), then stage and commit the updated
file.

---

Outside diff comments:
In `@nemo_skills/dataset/mmau-pro/prepare.py`:
- Around line 72-108: The format_entry function currently adds audio metadata
and formatted_entry["audio_path"] whenever entry.get("audio_path") exists,
ignoring the with_audio flag; update format_entry to check with_audio and only
normalize/set formatted_entry["audio_path"] and user_message["audio"/"audios"]
when with_audio is True, otherwise skip adding any audio-related keys (both in
the single-string branch and the list branch) so the --no-audio behavior is
honored; ensure you still keep the audio_path raw value out of the output when
with_audio is False and use the existing helper _normalize_audio_path for
normalization when allowed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 61635a67-8952-428f-b951-8cb90e7af773

📥 Commits

Reviewing files that changed from the base of the PR and between 4701f10 and 1f9c0cd.

📒 Files selected for processing (12)
  • docs/evaluation/speech-audio.md
  • nemo_skills/dataset/asr-leaderboard/prepare.py
  • nemo_skills/dataset/audiobench/prepare.py
  • nemo_skills/dataset/contextasr-bench/prepare.py
  • nemo_skills/dataset/covost2/prepare.py
  • nemo_skills/dataset/fleurs/prepare.py
  • nemo_skills/dataset/librispeech-pc/prepare.py
  • nemo_skills/dataset/mmau-pro/prepare.py
  • nemo_skills/dataset/musan/prepare.py
  • nemo_skills/dataset/numb3rs/prepare.py
  • nemo_skills/dataset/utils.py
  • tests/test_audio_path_prefix.py

Comment thread nemo_skills/dataset/librispeech-pc/prepare.py Outdated
Comment thread nemo_skills/dataset/numb3rs/prepare.py
Comment thread tests/test_audio_path_prefix.py
Signed-off-by: Dongji Gao <dongjig@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
nemo_skills/dataset/mmau-pro/prepare.py (1)

77-77: ⚡ Quick win

with_audio is threaded but unused.

with_audio is passed through the call chain but never used in format_entry, which makes behavior ambiguous and adds dead API surface. Either remove it, or explicitly gate audio-field emission with it.

♻️ Proposed fix (remove dead parameter)
-def format_entry(entry, with_audio=False, audio_root=DEFAULT_CONTAINER_AUDIO_ROOT):
+def format_entry(entry, audio_root=DEFAULT_CONTAINER_AUDIO_ROOT):
@@
-            formatted_entry = format_entry(entry, with_audio=not args.no_audio, audio_root=audio_root)
+            formatted_entry = format_entry(entry, audio_root=audio_root)

As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing."

Also applies to: 169-169

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/dataset/mmau-pro/prepare.py` at line 77, The parameter with_audio
on format_entry is unused; remove it from the function signature (def
format_entry(entry, audio_root=DEFAULT_CONTAINER_AUDIO_ROOT)) and delete any
callers that pass a with_audio argument (also remove the identical unused
parameter at the other occurrence flagged). If audio emission was intended,
instead implement explicit gating: only add audio-related fields inside
format_entry when with_audio is True and keep the parameter; otherwise prefer
removing the parameter and all propagation of with_audio so the API no longer
accepts an unused argument. Ensure you update all call sites that passed
with_audio to match the new signature.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@nemo_skills/dataset/mmau-pro/prepare.py`:
- Line 77: The parameter with_audio on format_entry is unused; remove it from
the function signature (def format_entry(entry,
audio_root=DEFAULT_CONTAINER_AUDIO_ROOT)) and delete any callers that pass a
with_audio argument (also remove the identical unused parameter at the other
occurrence flagged). If audio emission was intended, instead implement explicit
gating: only add audio-related fields inside format_entry when with_audio is
True and keep the parameter; otherwise prefer removing the parameter and all
propagation of with_audio so the API no longer accepts an unused argument.
Ensure you update all call sites that passed with_audio to match the new
signature.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2b7b11f7-b0de-4523-8573-78d75b0fa050

📥 Commits

Reviewing files that changed from the base of the PR and between 1f9c0cd and f5900a3.

📒 Files selected for processing (9)
  • docs/evaluation/speech-audio.md
  • nemo_skills/dataset/asr-leaderboard/prepare.py
  • nemo_skills/dataset/audiobench/prepare.py
  • nemo_skills/dataset/covost2/prepare.py
  • nemo_skills/dataset/fleurs/prepare.py
  • nemo_skills/dataset/librispeech-pc/prepare.py
  • nemo_skills/dataset/mmau-pro/prepare.py
  • nemo_skills/dataset/musan/prepare.py
  • nemo_skills/dataset/numb3rs/prepare.py
🚧 Files skipped from review as they are similar to previous changes (6)
  • nemo_skills/dataset/audiobench/prepare.py
  • nemo_skills/dataset/musan/prepare.py
  • nemo_skills/dataset/fleurs/prepare.py
  • nemo_skills/dataset/librispeech-pc/prepare.py
  • nemo_skills/dataset/covost2/prepare.py
  • nemo_skills/dataset/asr-leaderboard/prepare.py

Signed-off-by: Dongji Gao <dongjig@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_skills/dataset/librispeech-pc/prepare.py (1)

135-140: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast on required manifest keys instead of silently skipping malformed rows.

audio_filepath and text are required fields for a valid sample, but using .get(..., "") with a skip path can hide manifest corruption and silently reduce dataset size. Use direct key access so missing keys fail loudly.

Proposed fix
-            audio_filepath = entry.get("audio_filepath", "")
-            text = entry.get("text", "")
+            audio_filepath = entry["audio_filepath"]
+            text = entry["text"]

             if not audio_filepath or not text:
                 skipped += 1
                 continue

As per coding guidelines "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/dataset/librispeech-pc/prepare.py` around lines 135 - 140,
Replace the silent .get(...) checks for audio_filepath and text with direct key
access so missing manifest keys fail loudly: use entry["audio_filepath"] and
entry["text"] instead of entry.get("...", "") in the block handling each
manifest row (the variables audio_filepath and text), remove the skipped += 1 /
continue path for missing keys, and either let the KeyError propagate or catch
it and raise a new error with contextual information (e.g., include the manifest
line/index) so malformed rows do not silently disappear.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@nemo_skills/dataset/librispeech-pc/prepare.py`:
- Around line 135-140: Replace the silent .get(...) checks for audio_filepath
and text with direct key access so missing manifest keys fail loudly: use
entry["audio_filepath"] and entry["text"] instead of entry.get("...", "") in the
block handling each manifest row (the variables audio_filepath and text), remove
the skipped += 1 / continue path for missing keys, and either let the KeyError
propagate or catch it and raise a new error with contextual information (e.g.,
include the manifest line/index) so malformed rows do not silently disappear.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5a083dd1-29f7-4813-8352-5f6bb38bfe71

📥 Commits

Reviewing files that changed from the base of the PR and between f5900a3 and 967f6ba.

📒 Files selected for processing (3)
  • nemo_skills/dataset/librispeech-pc/prepare.py
  • nemo_skills/dataset/numb3rs/prepare.py
  • tests/test_audio_path_prefix.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • nemo_skills/dataset/numb3rs/prepare.py
  • tests/test_audio_path_prefix.py

@naymaraq naymaraq self-requested a review May 9, 2026 12:46
@naymaraq
Copy link
Copy Markdown
Collaborator

naymaraq commented May 9, 2026

From the Covost and Fleurs side, LGTM!
Just update the branch, since there might be some changes I made to the Covost/Fleurs benchmarks.

@Kipok
Copy link
Copy Markdown
Collaborator

Kipok commented May 12, 2026

@DongjiGao please recreate from a branch, so that tests can run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants