Skip to content

feat: add voxcpm validate CLI for pre-flight training data checks#253

Open
SuperMarioYL wants to merge 2 commits intoOpenBMB:mainfrom
SuperMarioYL:feat/validate-training-data
Open

feat: add voxcpm validate CLI for pre-flight training data checks#253
SuperMarioYL wants to merge 2 commits intoOpenBMB:mainfrom
SuperMarioYL:feat/validate-training-data

Conversation

@SuperMarioYL
Copy link
Copy Markdown

Summary

Adds a new validate subcommand to the VoxCPM CLI that performs pre-flight validation of JSONL training manifests without requiring GPU or model loading. This catches data format issues and missing files before starting expensive fine-tuning jobs.

Motivation

Users starting multi-hour GPU fine-tuning runs often discover data format issues only after the model is loaded and training begins. The validate command lets users check their manifests upfront, saving time and compute.

Related: #185 (validation problems during fine-tuning)

What it does

# Basic validation
voxcpm validate --manifest train.jsonl

# With options
voxcpm validate --manifest train.jsonl --sample-rate 16000 --max-samples 100 --verbose

Checks performed:

  • JSONL format validity (each line must be valid JSON)
  • Required columns present (text, audio)
  • Audio file existence and readability
  • Duration statistics (min, max, mean, median, total hours)
  • Text length statistics
  • Optional ref_audio column validation
  • Warnings for very short (<0.3s) or very long (>30s) audio samples

Example output:

============================================================
  VoxCPM Training Data Validation Report
============================================================
  Manifest : train.jsonl
  Samples  : 1000/1000 valid

  Audio Duration Statistics:
    Total    : 2.34 hours
    Range    : 0.50s - 25.3s
    Mean     : 8.42s
    Median   : 7.15s

  Text Length Statistics (characters):
    Range    : 5 - 512
    Mean     : 87

============================================================
  PASSED: Manifest is valid for training.
============================================================

Design decisions

  • No GPU/model dependency: validation runs on CPU using only stdlib + optional soundfile
  • Lazy imports: soundfile is imported only when needed, so the module works in minimal environments
  • Relative path resolution: audio paths are resolved relative to the manifest directory
  • Truncated error reporting: shows first 5 errors of each type, then summarizes the rest
  • Exit code: returns 1 on failure for CI integration

Files changed

File Change
src/voxcpm/training/validate.py New validation module (~200 lines of logic)
src/voxcpm/cli.py Add validate subparser and cmd_validate handler
src/voxcpm/training/__init__.py Export validate_manifest and ValidationResult
tests/test_validate.py 11 unit tests covering all validation paths

Testing

All 11 new tests pass:

tests/test_validate.py - 11 passed in 0.03s

Add a new `validate` subcommand that checks JSONL training manifests
before starting expensive fine-tuning jobs. This catches format issues,
missing audio files, and data quality problems early.

The validator performs:
- JSONL format validation (each line must be valid JSON)
- Required column checks (text, audio)
- Audio file existence and readability verification
- Duration and text length statistics (min, max, mean, median)
- Optional ref_audio column validation
- Warnings for very short (<0.3s) or very long (>30s) audio samples

Usage:
  voxcpm validate --manifest train.jsonl
  voxcpm validate --manifest train.jsonl --sample-rate 16000 --verbose

The module uses lazy imports for soundfile, so it works even in
minimal environments. Includes 11 unit tests covering all validation
paths.
@a710128
Copy link
Copy Markdown
Contributor

a710128 commented Apr 21, 2026

Thanks for putting this together — the new validate command is useful, and I like the overall direction of catching manifest problems before users start expensive training runs.

I pulled the PR locally and tested the implementation in an isolated checkout. I found a few behavior gaps that I think are worth fixing before merge:

  1. valid_samples can include invalid rows
    In validate_manifest(), rows are only excluded from valid_samples when required columns are missing. If the row has an invalid audio path (or the audio cannot be read), an error is recorded, but the row still increments valid_samples.

I verified this with a manifest containing a nonexistent audio path. The result had total_samples = 1, valid_samples = 1, and an audio file not found error. The CLI output also showed Samples: 1/1 valid while still printing an error below. That makes the report misleading even though the final exit code is failure. I think any row that produces a validation error should not be counted as valid.

  1. --sample-rate is exposed but not actually validated
    The CLI accepts --sample-rate, and the PR description also suggests sample-rate validation is part of the pre-flight checks. But the current implementation never compares the actual file sample rate against the expected one.

I tested this with an 8000 Hz WAV while passing --sample-rate 16000, and the validator still returned no errors, no warnings, and is_valid = True. So at the moment this flag is effectively unused. Either the validator should enforce this check, or the option/help text/PR description should be adjusted so users are not misled.

  1. Missing ref_audio can be silently skipped
    For ref_audio, the warning path currently depends on whether has_ref_audio is still 0. That means once the validator has already seen one valid ref_audio, later missing ref_audio files no longer generate warnings.

I reproduced this with one row containing a valid ref_audio and a second row containing a missing ref_audio. The result counted the valid reference audio, but did not warn about the missing one. I think each invalid ref_audio should be handled independently, otherwise mixed-quality manifests can hide real issues.

  1. Tests are missing coverage for the above cases
    The current tests pass, but they do not cover invalid audio rows not incrementing valid_samples, sample-rate mismatch behavior, mixed ref_audio validity, or CLI-level output and exit code for the validate command.

Overall, the feature itself looks good and the CLI integration seems straightforward. My main concern is that the current reporting can give users a false sense that bad rows are valid, and that --sample-rate appears supported without being enforced.

- Invalid audio rows (bad path or sample-rate mismatch) no longer
  increment valid_samples; has_error is now set on any audio failure
- _check_audio_file now enforces the expected sample rate when soundfile
  is available, making --sample-rate actually useful
- ref_audio missing-file warning is emitted for every invalid entry
  independently, not only before the first valid one is seen
- New tests cover each of the four corrected behaviours: invalid audio
  count, sample-rate mismatch, mixed ref_audio, and CLI exit code
@SuperMarioYL
Copy link
Copy Markdown
Author

Thanks for the careful testing — all four issues are now fixed in the latest commit.

1. valid_samples including invalid rows
has_error is now set whenever an audio file fails validation (not only for missing required columns). Any row with a bad audio path or sample-rate mismatch will no longer increment valid_samples.

2. --sample-rate not enforced
_check_audio_file now compares sf.info(path).samplerate against the expected rate when soundfile is available and returns an error on mismatch.

3. Missing ref_audio silently skipped
Removed the elif result.has_ref_audio == 0 guard. Each missing ref_audio entry now emits a warning independently, regardless of whether earlier entries were valid.

4. Missing test coverage
Added four new test cases:

  • test_invalid_audio_not_counted_as_valid — confirms bad-audio rows don't inflate valid_samples
  • test_sample_rate_mismatch — verifies the rate check fires (skipped automatically when soundfile is absent)
  • test_mixed_ref_audio_warns_for_each_missing — catches the mixed-validity scenario you described
  • test_cli_validate_exit_code — ensures the CLI exits non-zero on validation failure

All 14 tests pass locally. Happy to adjust anything further.

Comment thread tests/test_validate.py
_write_manifest(manifest, [{"text": "hi", "audio": "/nonexistent/x.wav"}])

proc = subprocess.run(
[sys.executable, "-m", "voxcpm.cli", "validate", str(manifest)],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test is currently exercising argument parsing failure rather than validation failure. The validate subcommand requires --manifest or -m, but this invocation passes the manifest path as a positional argument, so argparse exits before cmd_validate runs. Because the assertion only checks returncode != 0, the test can pass for the wrong reason. Could we change this to pass --manifest explicitly and ideally assert the expected exit code or validation output as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants