Skip to content

ci(speech): split L0_Unit_Tests_GPU_ASR into 5 parallel buckets#15654

Open
ko3n1g wants to merge 4 commits intomainfrom
ko3n1g/feat/split-asr-l0-unit-tests
Open

ci(speech): split L0_Unit_Tests_GPU_ASR into 5 parallel buckets#15654
ko3n1g wants to merge 4 commits intomainfrom
ko3n1g/feat/split-asr-l0-unit-tests

Conversation

@ko3n1g
Copy link
Copy Markdown
Contributor

@ko3n1g ko3n1g commented Apr 29, 2026

Claude summary

L0_Unit_Tests_GPU_ASR was a single job running the entire tests/collections/asr/ directory sequentially, taking ~40 minutes wall-clock. This PR splits it into 5 parallel jobs by grouping test files based on observed durations from run 25112807335.

Job Contents Est. time
L0_Unit_Tests_GPU_ASR_1 confidence/ + fast decoding tests ~6.6m
L0_Unit_Tests_GPU_ASR_2 streaming/rnnt decoding + inference/ + k2/ + mixins/ ~9.8m
L0_Unit_Tests_GPU_ASR_3 numba/ + hybrid/interctc/local-attn models ~8.6m
L0_Unit_Tests_GPU_ASR_4 test_asr_multitask_model_bpe.py (single file, irreducible) ~10.6m ⚠️
L0_Unit_Tests_GPU_ASR_5 RNNT encoder models + remaining small tests + utils/ ~4.9m

Note: Bucket 4 (test_asr_multitask_model_bpe.py) marginally exceeds the 10-minute target. Shrinking it further would require splitting within the test class.

Each job timeout is set to 15 minutes (down from 60).

The monolithic job ran 40+ min total. Each bucket targets ≤10 min,
distributed by observed wall-clock time from run 25112807335.

Bucket mapping (approx times):
  1 (~6.6m): confidence/ + fast decoding tests
  2 (~9.8m): streaming/rnnt decoding + inference/ + k2/ + mixins/
  3 (~8.6m): numba/ + hybrid/interctc/local_attn models
  4 (~10.6m): test_asr_multitask_model_bpe.py (single file, irreducible)
  5 (~4.9m): rnnt encoder models + remaining small tests + utils/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
… flaky batchnorm test

ASR_1/ASR_2 were failing because the split scripts were missing
TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=1, which is required by all other
model-loading test scripts (Core, Common, TTS). The .nemo checkpoints
on /home/TestData use a legacy PyTorch storage format incompatible with
weights_only=True in PyTorch 2.6+.

ASR_5 was failing because test_from_batchnorm is order-dependent: the
monolithic run consumed random state from ~1641 prior tests, whereas
the isolated split starts fresh. Fix: add torch.manual_seed(0) for
determinism and use atol=1e-5 to reflect the float32 rounding
difference between fused (x*W+B) and standard ((x-mean)/std*w+b)
formulations.

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
ko3n1g and others added 2 commits April 29, 2026 22:06
…ripts

The original L0_Unit_Tests_GPU_ASR.sh launched with:
  python -c "from nemo.collections.asr.models import ASRModel" && ...
which performs the required module initialization before running the test
suite. All five splits were missing this prefix, causing failures in
model-loading tests (kenlm, RNNT decoding) that depend on the import
side-effects triggered by ASRModel initialization.

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant