Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Audio Curation Tutorials

Hands-on tutorials for curating audio data with NeMo Curator. Complete working examples with detailed explanations.

Quick Start

New to audio curation? Start with the Audio Getting Started Guide for setup and basic concepts.

System Dependencies

Audio pipelines require ffmpeg for resampling and format conversion. Install them before running any audio tutorial:

# Ubuntu / Debian
sudo apt-get install -y ffmpeg

Available Tutorials

Tutorial Description Files
FLEURS Dataset Complete pipeline for multilingual speech data pipeline.py, run.py, pipeline.yaml
Audio Tagging Label raw audio for TTS/ASR via diarization, alignment, and quality metrics main.py, tts_pipeline.yaml, asr_pipeline.yaml
ALM Data Pipeline Create training windows for Audio Language Models main.py, pipeline.yaml

Documentation Links

Category Links
Setup InstallationConfiguration
Concepts ArchitectureData Loading
Advanced Custom PipelinesExecution BackendsNeMo ASR Integration

Known Issues

SIGSEGV in Ray StageWorker during model loading

In some environments, and under certain timing conditions, Ray workers may crash with a SIGSEGV during GPU model initialization. This is not a NeMo Curator code issue: it comes from a thread-safety problem in the gRPC version bundled with Ray. Any GPU pipeline (audio, text, image, or video) that loads models through Ray actors can hit the same failure.

The OpenTelemetry SDK starts a PeriodicExportingMetricReader background thread that periodically calls OtlpGrpcMetricExporter::Export() over gRPC; a getenv() call on that path can race with NeMo/PyTorch model initialization in another thread. Disabling OpenTelemetry for the process prevents Ray’s OpenTelemetry background exporter from starting and removes that race. NeMo Curator does not use OpenTelemetry for its own functionality, so disabling it has no functional impact on Curator workflows.

Container scope: This has been observed with the nemo-curator:26.04.rc0 image (and similar 26.04-era builds). The race was fixed upstream in gRPC ≥ 1.60; it should stop being relevant once the bundled gRPC in the container is upgraded accordingly.

Workaround: Set these environment variables before running the pipeline:

export OTEL_SDK_DISABLED=true
export OTEL_METRICS_EXPORTER=none
export OTEL_TRACES_EXPORTER=none

Support

Documentation: Main DocsAPI ReferenceGitHub Discussions