Hands-on tutorials for curating audio data with NeMo Curator. Complete working examples with detailed explanations.
New to audio curation? Start with the Audio Getting Started Guide for setup and basic concepts.
Audio pipelines require ffmpeg for resampling and format conversion. Install them before running any audio tutorial:
# Ubuntu / Debian
sudo apt-get install -y ffmpeg
| Tutorial | Description | Files |
|---|---|---|
| FLEURS Dataset | Complete pipeline for multilingual speech data | pipeline.py, run.py, pipeline.yaml |
| Audio Tagging | Label raw audio for TTS/ASR via diarization, alignment, and quality metrics | main.py, tts_pipeline.yaml, asr_pipeline.yaml |
| ALM Data Pipeline | Create training windows for Audio Language Models | main.py, pipeline.yaml |
| Category | Links |
|---|---|
| Setup | Installation • Configuration |
| Concepts | Architecture • Data Loading |
| Advanced | Custom Pipelines • Execution Backends • NeMo ASR Integration |
In some environments, and under certain timing conditions, Ray workers may crash with a SIGSEGV during GPU model initialization. This is not a NeMo Curator code issue: it comes from a thread-safety problem in the gRPC version bundled with Ray. Any GPU pipeline (audio, text, image, or video) that loads models through Ray actors can hit the same failure.
The OpenTelemetry SDK starts a PeriodicExportingMetricReader background thread that periodically calls OtlpGrpcMetricExporter::Export() over gRPC; a getenv() call on that path can race with NeMo/PyTorch model initialization in another thread. Disabling OpenTelemetry for the process prevents Ray’s OpenTelemetry background exporter from starting and removes that race. NeMo Curator does not use OpenTelemetry for its own functionality, so disabling it has no functional impact on Curator workflows.
Container scope: This has been observed with the nemo-curator:26.04.rc0 image (and similar 26.04-era builds). The race was fixed upstream in gRPC ≥ 1.60; it should stop being relevant once the bundled gRPC in the container is upgraded accordingly.
Workaround: Set these environment variables before running the pipeline:
export OTEL_SDK_DISABLED=true
export OTEL_METRICS_EXPORTER=none
export OTEL_TRACES_EXPORTER=noneDocumentation: Main Docs • API Reference • GitHub Discussions