Skip to content

Add audio JSONL reader and tarred ASR dataset support#1780

Open
ssh-meister wants to merge 16 commits intomainfrom
ameister/reader
Open

Add audio JSONL reader and tarred ASR dataset support#1780
ssh-meister wants to merge 16 commits intomainfrom
ameister/reader

Conversation

@ssh-meister
Copy link
Copy Markdown

@ssh-meister ssh-meister commented Apr 9, 2026

Description

Adds audio-focused JSONL reading and tarred ASR dataset support to NeMo Curator.

This PR extends JsonlReader with task_type="audio" so JSONL manifests can now emit AudioTask objects directly instead of only DocumentBatch. The new audio reader path supports one-manifest-line-per-task fanout and also preserves stable _curator_dedup_id assignment via _generate_ids / _assign_ids.

In addition, this PR introduces a bridge for NeMo-style tarred audio datasets:

  • TarredAudioManifestReader reads sharded manifests and matches them to tar shards by shard id
  • MaterializeTarredAudioStage extracts only the needed tar member into a temporary local file just before path-based audio stages
  • CleanupTemporaryAudioStage removes those temporary files afterwards and restores the original manifest-style audio_filepath

This avoids eager extraction of the whole tar dataset, keeps compatibility with existing path-based ASR stages, and supports both strict and permissive handling of manifest entries missing from tar shards via skip_missing_entries.

TarredAudioManifestReader also follows the transport ideas used in NeMo's nemo_adapters.py, including support for fsspec-based remote access and pipe:-style specifiers, so the same reader can work with local files, S3-backed storage, and AIS-style commands such as pipe:ais get ....

Usage

from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.audio import (
    TarredAudioManifestReader,
    MaterializeTarredAudioStage,
    CleanupTemporaryAudioStage,
)
from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage
from nemo_curator.stages.audio.metrics.get_wer import GetPairwiseWerStage
from nemo_curator.stages.audio.io import AudioToDocumentStage
from nemo_curator.stages.text.io.writer import JsonlWriter

pipeline = Pipeline(name="tarred_asr_pipeline")

pipeline.add_stage(
    TarredAudioManifestReader(
        manifest_paths="/path/to/manifests/manifest__OP_0..255_CL_.json",
        tar_paths="/path/to/audio_shards/audio__OP_0..255_CL_.tar",
        skip_missing_entries=True,
    )
)
pipeline.add_stage(MaterializeTarredAudioStage())
pipeline.add_stage(InferenceAsrNemoStage(model_name="nvidia/stt_en_fastconformer_hybrid_large_pc"))
pipeline.add_stage(GetPairwiseWerStage(text_key="text", pred_text_key="pred_text", wer_key="wer"))
pipeline.add_stage(CleanupTemporaryAudioStage())
pipeline.add_stage(AudioToDocumentStage())
pipeline.add_stage(JsonlWriter(path="/path/to/output"))

UPD

This PR now also adds the first building blocks for audio-stage checkpointing and resume.

A stable sample_key has been introduced for AudioTask as a per-sample identity that is independent of runtime-specific task_id values. This key is now populated by both audio reader paths:

  • JsonlReader(task_type="audio") / JsonlAudioReaderStage
  • TarredAudioManifestReaderStage

If an input manifest entry already contains sample_key, it is preserved as-is. Otherwise, a deterministic key is derived from stable sample identity fields such as audio_filepath, tar shard/member metadata, offset, duration, and dataset_name.

In addition, MaterializeTarredAudioStage now supports optional durable materialization via:

  • materialization_dir: str | None = None

Behavior is:

  • if materialization_dir is None, the stage keeps the previous behavior and writes extracted audio to temporary local files
  • if materialization_dir is set, the stage writes extracted or segmented audio to a deterministic durable path derived from sample_key

This makes the tarred-audio bridge more checkpoint/resume-friendly without changing the default behavior for existing pipelines. CleanupTemporaryAudioStage continues to clean up true temporary files, while durable materialized files are left intact and audio_filepath is still restored back to the original manifest-style path after cleanup.

These changes do not yet implement full pipeline-level checkpoint orchestration, but they establish the two core primitives needed for that next step:

  • a stable per-sample identifier (sample_key)
  • an optional persistent materialization location (materialization_dir)

Signed-off-by: Sasha Meister <ameister@nvidia.com>
Signed-off-by: Sasha Meister <ameister@nvidia.com>
Signed-off-by: Sasha Meister <ameister@nvidia.com>
Signed-off-by: Sasha Meister <ameister@nvidia.com>
Signed-off-by: Sasha Meister <ameister@nvidia.com>
@ssh-meister ssh-meister requested a review from a team as a code owner April 9, 2026 14:46
@ssh-meister ssh-meister requested review from huvunvidia and removed request for a team April 9, 2026 14:46
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 9, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ssh-meister ssh-meister enabled auto-merge (squash) April 9, 2026 14:50
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 9, 2026

Greptile Summary

This PR introduces TarredAudioManifestReader / MaterializeTarredAudioStage / CleanupTemporaryAudioStage for NeMo-style sharded tarred audio datasets, extends JsonlReader with a task_type="audio" path via JsonlAudioReaderStage, and adds a stable per-sample sample_key to AudioTask. Previously raised concerns (_AttrDict dead code, _should_segment silently ignoring duration, SIGPIPE error on early pipe: break, and limit not propagated through the composite reader) are all resolved in this revision.

Confidence Score: 5/5

Safe to merge; all previously raised P1 issues are resolved and only minor style/performance P2 items remain.

All blocking issues from prior review rounds have been addressed: _AttrDict dead-code is fixed, _should_segment now checks duration is not None, _PipeStream tolerates SIGPIPE with allow_sigpipe, and limit is propagated through TarredAudioManifestReader.decompose(). The three remaining comments are P2 style/performance suggestions that do not affect correctness.

nemo_curator/stages/audio/io/tarred.py — minor redundant _should_segment call, O(N²) segment classification, and double tar stream are worth a follow-up pass but do not block merge.

Important Files Changed

Filename Overview
nemo_curator/stages/audio/io/tarred.py New 613-line module implementing TarredAudioManifestReader composite, MaterializeTarredAudioStage, CleanupTemporaryAudioStage, and helpers for pipe:/fsspec transport. Previously-flagged SIGPIPE, limit propagation, and _should_segment duration issues are resolved; minor redundant _should_segment call, O(N²) membership test, and double tar-stream remain.
nemo_curator/stages/text/io/reader/jsonl.py Extends JsonlReader with task_type="audio" path via new JsonlAudioReaderStage. Generic output type updated to DocumentBatch
nemo_curator/tasks/audio_task.py Adds sample_key field and build_audio_sample_key helper for stable per-sample identity. _AttrDict.setattr/delattr dead-code issue from prior review is fixed — methods are properly at class scope.
tests/stages/audio/io/test_tarred.py 471-line test file covering partition, reader, materialization, durable materialization, pipe transport, SIGPIPE handling, shared-member decode optimization, and end-to-end round-trip. Coverage is thorough.
tests/stages/text/io/reader/test_jsonl.py Adds TestJsonlAudioReader with 8 test cases covering field filtering, sample_key preservation, composite decomposition, pipeline integration, and ID generation/assignment.
tests/tasks/test_audio_task.py Adds tests for explicit sample_key propagation from data and stability of build_audio_sample_key hash.
nemo_curator/stages/audio/init.py Exports the four new public symbols from the audio sub-package; straightforward.
nemo_curator/stages/audio/io/init.py New init.py wiring up the io sub-package exports; straightforward.

Sequence Diagram

sequenceDiagram
    participant P as Pipeline
    participant TAMR as TarredAudioManifestReader
    participant TAMPS as TarredAudioManifestPartitionStage
    participant TAMRS as TarredAudioManifestReaderStage
    participant MTAS as MaterializeTarredAudioStage
    participant DS as Downstream ASR/WER Stage
    participant CTAS as CleanupTemporaryAudioStage

    P->>TAMR: process(_EmptyTask)
    TAMR->>TAMPS: decompose → partition manifest paths
    TAMPS-->>TAMR: list[FileGroupTask]
    TAMR->>TAMRS: process(FileGroupTask)
    TAMRS->>TAMRS: stream tar → enumerate member names
    TAMRS->>TAMRS: stream manifest → match entries to members
    TAMRS-->>TAMR: list[AudioTask] (with _tar_path, _tar_member, sample_key)
    TAMR-->>P: list[AudioTask]

    P->>MTAS: process_batch(list[AudioTask])
    MTAS->>MTAS: group tasks by (tar_path, tar_member)
    MTAS->>MTAS: stream tar → extract raw bytes once per member
    MTAS->>MTAS: decode waveform once per shared member (segment tasks)
    MTAS-->>P: list[AudioTask] (audio_filepath → local temp/durable path)

    P->>DS: process_batch(list[AudioTask])
    DS-->>P: list[AudioTask] (with pred_text, wer, …)

    P->>CTAS: process(AudioTask)
    CTAS->>CTAS: unlink temp file (if temporary)
    CTAS->>CTAS: restore audio_filepath to manifest path
    CTAS-->>P: AudioTask (cleaned)
Loading

Reviews (9): Last reviewed commit: "Merge remote-tracking branch 'origin/mai..." | Re-trigger Greptile

Comment thread nemo_curator/stages/audio/io/tarred.py Outdated
Comment thread nemo_curator/stages/text/io/reader/jsonl.py Outdated
Comment thread nemo_curator/stages/audio/io/tarred.py
Comment thread nemo_curator/stages/audio/io/tarred.py Outdated
Signed-off-by: Sasha Meister <ameister@nvidia.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 9, 2026

Tip:

Greploop — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.

Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

@Jorjeous
Copy link
Copy Markdown
Member

Jorjeous commented Apr 9, 2026

@ssh-meister plz fix linter and cherry pick this to dev branch with #1780 hashtag

Signed-off-by: Sasha Meister <ameister@nvidia.com>
Signed-off-by: Sasha Meister <ameister@nvidia.com>
Signed-off-by: Sasha Meister <ameister@nvidia.com>
Signed-off-by: Sasha Meister <ameister@nvidia.com>
Signed-off-by: Sasha Meister <ameister@nvidia.com>
Comment thread nemo_curator/stages/audio/io/tarred.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants