feat: add voxcpm validate CLI for pre-flight training data checks by SuperMarioYL · Pull Request #253 · OpenBMB/VoxCPM

SuperMarioYL · 2026-04-12T19:16:45Z

Summary

Adds a new validate subcommand to the VoxCPM CLI that performs pre-flight validation of JSONL training manifests without requiring GPU or model loading. This catches data format issues and missing files before starting expensive fine-tuning jobs.

Motivation

Users starting multi-hour GPU fine-tuning runs often discover data format issues only after the model is loaded and training begins. The validate command lets users check their manifests upfront, saving time and compute.

Related: #185 (validation problems during fine-tuning)

What it does

# Basic validation
voxcpm validate --manifest train.jsonl

# With options
voxcpm validate --manifest train.jsonl --sample-rate 16000 --max-samples 100 --verbose

Checks performed:

JSONL format validity (each line must be valid JSON)
Required columns present (text, audio)
Audio file existence and readability
Duration statistics (min, max, mean, median, total hours)
Text length statistics
Optional ref_audio column validation
Warnings for very short (<0.3s) or very long (>30s) audio samples

Example output:

============================================================
  VoxCPM Training Data Validation Report
============================================================
  Manifest : train.jsonl
  Samples  : 1000/1000 valid

  Audio Duration Statistics:
    Total    : 2.34 hours
    Range    : 0.50s - 25.3s
    Mean     : 8.42s
    Median   : 7.15s

  Text Length Statistics (characters):
    Range    : 5 - 512
    Mean     : 87

============================================================
  PASSED: Manifest is valid for training.
============================================================

Design decisions

No GPU/model dependency: validation runs on CPU using only stdlib + optional soundfile
Lazy imports: soundfile is imported only when needed, so the module works in minimal environments
Relative path resolution: audio paths are resolved relative to the manifest directory
Truncated error reporting: shows first 5 errors of each type, then summarizes the rest
Exit code: returns 1 on failure for CI integration

Files changed

File	Change
`src/voxcpm/training/validate.py`	New validation module (~200 lines of logic)
`src/voxcpm/cli.py`	Add `validate` subparser and `cmd_validate` handler
`src/voxcpm/training/__init__.py`	Export `validate_manifest` and `ValidationResult`
`tests/test_validate.py`	11 unit tests covering all validation paths

Testing

All 11 new tests pass:

tests/test_validate.py - 11 passed in 0.03s

Add a new `validate` subcommand that checks JSONL training manifests before starting expensive fine-tuning jobs. This catches format issues, missing audio files, and data quality problems early. The validator performs: - JSONL format validation (each line must be valid JSON) - Required column checks (text, audio) - Audio file existence and readability verification - Duration and text length statistics (min, max, mean, median) - Optional ref_audio column validation - Warnings for very short (<0.3s) or very long (>30s) audio samples Usage: voxcpm validate --manifest train.jsonl voxcpm validate --manifest train.jsonl --sample-rate 16000 --verbose The module uses lazy imports for soundfile, so it works even in minimal environments. Includes 11 unit tests covering all validation paths.

a710128 · 2026-04-21T09:37:00Z

Thanks for putting this together — the new validate command is useful, and I like the overall direction of catching manifest problems before users start expensive training runs.

I pulled the PR locally and tested the implementation in an isolated checkout. I found a few behavior gaps that I think are worth fixing before merge:

valid_samples can include invalid rows
In validate_manifest(), rows are only excluded from valid_samples when required columns are missing. If the row has an invalid audio path (or the audio cannot be read), an error is recorded, but the row still increments valid_samples.

I verified this with a manifest containing a nonexistent audio path. The result had total_samples = 1, valid_samples = 1, and an audio file not found error. The CLI output also showed Samples: 1/1 valid while still printing an error below. That makes the report misleading even though the final exit code is failure. I think any row that produces a validation error should not be counted as valid.

--sample-rate is exposed but not actually validated
The CLI accepts --sample-rate, and the PR description also suggests sample-rate validation is part of the pre-flight checks. But the current implementation never compares the actual file sample rate against the expected one.

I tested this with an 8000 Hz WAV while passing --sample-rate 16000, and the validator still returned no errors, no warnings, and is_valid = True. So at the moment this flag is effectively unused. Either the validator should enforce this check, or the option/help text/PR description should be adjusted so users are not misled.

Missing ref_audio can be silently skipped
For ref_audio, the warning path currently depends on whether has_ref_audio is still 0. That means once the validator has already seen one valid ref_audio, later missing ref_audio files no longer generate warnings.

I reproduced this with one row containing a valid ref_audio and a second row containing a missing ref_audio. The result counted the valid reference audio, but did not warn about the missing one. I think each invalid ref_audio should be handled independently, otherwise mixed-quality manifests can hide real issues.

Tests are missing coverage for the above cases
The current tests pass, but they do not cover invalid audio rows not incrementing valid_samples, sample-rate mismatch behavior, mixed ref_audio validity, or CLI-level output and exit code for the validate command.

Overall, the feature itself looks good and the CLI integration seems straightforward. My main concern is that the current reporting can give users a false sense that bad rows are valid, and that --sample-rate appears supported without being enforced.

- Invalid audio rows (bad path or sample-rate mismatch) no longer increment valid_samples; has_error is now set on any audio failure - _check_audio_file now enforces the expected sample rate when soundfile is available, making --sample-rate actually useful - ref_audio missing-file warning is emitted for every invalid entry independently, not only before the first valid one is seen - New tests cover each of the four corrected behaviours: invalid audio count, sample-rate mismatch, mixed ref_audio, and CLI exit code

SuperMarioYL · 2026-04-21T21:06:54Z

Thanks for the careful testing — all four issues are now fixed in the latest commit.

1. valid_samples including invalid rows
has_error is now set whenever an audio file fails validation (not only for missing required columns). Any row with a bad audio path or sample-rate mismatch will no longer increment valid_samples.

2. --sample-rate not enforced
_check_audio_file now compares sf.info(path).samplerate against the expected rate when soundfile is available and returns an error on mismatch.

3. Missing ref_audio silently skipped
Removed the elif result.has_ref_audio == 0 guard. Each missing ref_audio entry now emits a warning independently, regardless of whether earlier entries were valid.

4. Missing test coverage
Added four new test cases:

test_invalid_audio_not_counted_as_valid — confirms bad-audio rows don't inflate valid_samples
test_sample_rate_mismatch — verifies the rate check fires (skipped automatically when soundfile is absent)
test_mixed_ref_audio_warns_for_each_missing — catches the mixed-validity scenario you described
test_cli_validate_exit_code — ensures the CLI exits non-zero on validation failure

All 14 tests pass locally. Happy to adjust anything further.

a710128 · 2026-04-22T03:37:57Z

+        _write_manifest(manifest, [{"text": "hi", "audio": "/nonexistent/x.wav"}])
+
+        proc = subprocess.run(
+            [sys.executable, "-m", "voxcpm.cli", "validate", str(manifest)],


I think this test is currently exercising argument parsing failure rather than validation failure. The validate subcommand requires --manifest or -m, but this invocation passes the manifest path as a positional argument, so argparse exits before cmd_validate runs. Because the assertion only checks returncode != 0, the test can pass for the wrong reason. Could we change this to pass --manifest explicitly and ideally assert the expected exit code or validation output as well?

a710128 reviewed Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add voxcpm validate CLI for pre-flight training data checks#253

feat: add voxcpm validate CLI for pre-flight training data checks#253
SuperMarioYL wants to merge 2 commits intoOpenBMB:mainfrom
SuperMarioYL:feat/validate-training-data

SuperMarioYL commented Apr 12, 2026

Uh oh!

a710128 commented Apr 21, 2026

Uh oh!

SuperMarioYL commented Apr 21, 2026

Uh oh!

a710128 Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SuperMarioYL commented Apr 12, 2026

Summary

Motivation

What it does

Design decisions

Files changed

Testing

Uh oh!

a710128 commented Apr 21, 2026

Uh oh!

SuperMarioYL commented Apr 21, 2026

Uh oh!

a710128 Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants