Add audio JSONL reader and tarred ASR dataset support#1780
Add audio JSONL reader and tarred ASR dataset support#1780ssh-meister wants to merge 16 commits intomainfrom
Conversation
Signed-off-by: Sasha Meister <ameister@nvidia.com>
Signed-off-by: Sasha Meister <ameister@nvidia.com>
Signed-off-by: Sasha Meister <ameister@nvidia.com>
Signed-off-by: Sasha Meister <ameister@nvidia.com>
Signed-off-by: Sasha Meister <ameister@nvidia.com>
Greptile SummaryThis PR introduces Confidence Score: 5/5Safe to merge; all previously raised P1 issues are resolved and only minor style/performance P2 items remain. All blocking issues from prior review rounds have been addressed: _AttrDict dead-code is fixed, _should_segment now checks duration is not None, _PipeStream tolerates SIGPIPE with allow_sigpipe, and limit is propagated through TarredAudioManifestReader.decompose(). The three remaining comments are P2 style/performance suggestions that do not affect correctness. nemo_curator/stages/audio/io/tarred.py — minor redundant _should_segment call, O(N²) segment classification, and double tar stream are worth a follow-up pass but do not block merge. Important Files Changed
Sequence DiagramsequenceDiagram
participant P as Pipeline
participant TAMR as TarredAudioManifestReader
participant TAMPS as TarredAudioManifestPartitionStage
participant TAMRS as TarredAudioManifestReaderStage
participant MTAS as MaterializeTarredAudioStage
participant DS as Downstream ASR/WER Stage
participant CTAS as CleanupTemporaryAudioStage
P->>TAMR: process(_EmptyTask)
TAMR->>TAMPS: decompose → partition manifest paths
TAMPS-->>TAMR: list[FileGroupTask]
TAMR->>TAMRS: process(FileGroupTask)
TAMRS->>TAMRS: stream tar → enumerate member names
TAMRS->>TAMRS: stream manifest → match entries to members
TAMRS-->>TAMR: list[AudioTask] (with _tar_path, _tar_member, sample_key)
TAMR-->>P: list[AudioTask]
P->>MTAS: process_batch(list[AudioTask])
MTAS->>MTAS: group tasks by (tar_path, tar_member)
MTAS->>MTAS: stream tar → extract raw bytes once per member
MTAS->>MTAS: decode waveform once per shared member (segment tasks)
MTAS-->>P: list[AudioTask] (audio_filepath → local temp/durable path)
P->>DS: process_batch(list[AudioTask])
DS-->>P: list[AudioTask] (with pred_text, wer, …)
P->>CTAS: process(AudioTask)
CTAS->>CTAS: unlink temp file (if temporary)
CTAS->>CTAS: restore audio_filepath to manifest path
CTAS-->>P: AudioTask (cleaned)
Reviews (9): Last reviewed commit: "Merge remote-tracking branch 'origin/mai..." | Re-trigger Greptile |
Signed-off-by: Sasha Meister <ameister@nvidia.com>
|
Tip: Greploop — Automatically fix all review issues by running Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal. |
|
@ssh-meister plz fix linter and cherry pick this to dev branch with #1780 hashtag |
Signed-off-by: Sasha Meister <ameister@nvidia.com>
Signed-off-by: Sasha Meister <ameister@nvidia.com>
Signed-off-by: Sasha Meister <ameister@nvidia.com>
Signed-off-by: Sasha Meister <ameister@nvidia.com>
Signed-off-by: Sasha Meister <ameister@nvidia.com>
Signed-off-by: Sasha Meister <ameister@nvidia.com>
Description
Adds audio-focused JSONL reading and tarred ASR dataset support to NeMo Curator.
This PR extends
JsonlReaderwithtask_type="audio"so JSONL manifests can now emitAudioTaskobjects directly instead of onlyDocumentBatch. The new audio reader path supports one-manifest-line-per-task fanout and also preserves stable_curator_dedup_idassignment via_generate_ids/_assign_ids.In addition, this PR introduces a bridge for NeMo-style tarred audio datasets:
TarredAudioManifestReaderreads sharded manifests and matches them to tar shards by shard idMaterializeTarredAudioStageextracts only the needed tar member into a temporary local file just before path-based audio stagesCleanupTemporaryAudioStageremoves those temporary files afterwards and restores the original manifest-styleaudio_filepathThis avoids eager extraction of the whole tar dataset, keeps compatibility with existing path-based ASR stages, and supports both strict and permissive handling of manifest entries missing from tar shards via
skip_missing_entries.TarredAudioManifestReaderalso follows the transport ideas used in NeMo's nemo_adapters.py, including support forfsspec-based remote access andpipe:-style specifiers, so the same reader can work with local files, S3-backed storage, and AIS-style commands such aspipe:ais get ....Usage
UPD
This PR now also adds the first building blocks for audio-stage checkpointing and resume.
A stable
sample_keyhas been introduced forAudioTaskas a per-sample identity that is independent of runtime-specifictask_idvalues. This key is now populated by both audio reader paths:JsonlReader(task_type="audio")/JsonlAudioReaderStageTarredAudioManifestReaderStageIf an input manifest entry already contains
sample_key, it is preserved as-is. Otherwise, a deterministic key is derived from stable sample identity fields such asaudio_filepath, tar shard/member metadata,offset,duration, anddataset_name.In addition,
MaterializeTarredAudioStagenow supports optional durable materialization via:materialization_dir: str | None = NoneBehavior is:
materialization_dir is None, the stage keeps the previous behavior and writes extracted audio to temporary local filesmaterialization_diris set, the stage writes extracted or segmented audio to a deterministic durable path derived fromsample_keyThis makes the tarred-audio bridge more checkpoint/resume-friendly without changing the default behavior for existing pipelines.
CleanupTemporaryAudioStagecontinues to clean up true temporary files, while durable materialized files are left intact andaudio_filepathis still restored back to the original manifest-style path after cleanup.These changes do not yet implement full pipeline-level checkpoint orchestration, but they establish the two core primitives needed for that next step:
sample_key)materialization_dir)