feat: add voxcpm validate CLI for pre-flight training data checks#253
feat: add voxcpm validate CLI for pre-flight training data checks#253SuperMarioYL wants to merge 2 commits intoOpenBMB:mainfrom
Conversation
Add a new `validate` subcommand that checks JSONL training manifests before starting expensive fine-tuning jobs. This catches format issues, missing audio files, and data quality problems early. The validator performs: - JSONL format validation (each line must be valid JSON) - Required column checks (text, audio) - Audio file existence and readability verification - Duration and text length statistics (min, max, mean, median) - Optional ref_audio column validation - Warnings for very short (<0.3s) or very long (>30s) audio samples Usage: voxcpm validate --manifest train.jsonl voxcpm validate --manifest train.jsonl --sample-rate 16000 --verbose The module uses lazy imports for soundfile, so it works even in minimal environments. Includes 11 unit tests covering all validation paths.
|
Thanks for putting this together — the new validate command is useful, and I like the overall direction of catching manifest problems before users start expensive training runs. I pulled the PR locally and tested the implementation in an isolated checkout. I found a few behavior gaps that I think are worth fixing before merge:
I verified this with a manifest containing a nonexistent audio path. The result had total_samples = 1, valid_samples = 1, and an audio file not found error. The CLI output also showed Samples: 1/1 valid while still printing an error below. That makes the report misleading even though the final exit code is failure. I think any row that produces a validation error should not be counted as valid.
I tested this with an 8000 Hz WAV while passing --sample-rate 16000, and the validator still returned no errors, no warnings, and is_valid = True. So at the moment this flag is effectively unused. Either the validator should enforce this check, or the option/help text/PR description should be adjusted so users are not misled.
I reproduced this with one row containing a valid ref_audio and a second row containing a missing ref_audio. The result counted the valid reference audio, but did not warn about the missing one. I think each invalid ref_audio should be handled independently, otherwise mixed-quality manifests can hide real issues.
Overall, the feature itself looks good and the CLI integration seems straightforward. My main concern is that the current reporting can give users a false sense that bad rows are valid, and that --sample-rate appears supported without being enforced. |
- Invalid audio rows (bad path or sample-rate mismatch) no longer increment valid_samples; has_error is now set on any audio failure - _check_audio_file now enforces the expected sample rate when soundfile is available, making --sample-rate actually useful - ref_audio missing-file warning is emitted for every invalid entry independently, not only before the first valid one is seen - New tests cover each of the four corrected behaviours: invalid audio count, sample-rate mismatch, mixed ref_audio, and CLI exit code
|
Thanks for the careful testing — all four issues are now fixed in the latest commit. 1. valid_samples including invalid rows 2. --sample-rate not enforced 3. Missing ref_audio silently skipped 4. Missing test coverage
All 14 tests pass locally. Happy to adjust anything further. |
| _write_manifest(manifest, [{"text": "hi", "audio": "/nonexistent/x.wav"}]) | ||
|
|
||
| proc = subprocess.run( | ||
| [sys.executable, "-m", "voxcpm.cli", "validate", str(manifest)], |
There was a problem hiding this comment.
I think this test is currently exercising argument parsing failure rather than validation failure. The validate subcommand requires --manifest or -m, but this invocation passes the manifest path as a positional argument, so argparse exits before cmd_validate runs. Because the assertion only checks returncode != 0, the test can pass for the wrong reason. Could we change this to pass --manifest explicitly and ideally assert the expected exit code or validation output as well?
Summary
Adds a new
validatesubcommand to the VoxCPM CLI that performs pre-flight validation of JSONL training manifests without requiring GPU or model loading. This catches data format issues and missing files before starting expensive fine-tuning jobs.Motivation
Users starting multi-hour GPU fine-tuning runs often discover data format issues only after the model is loaded and training begins. The
validatecommand lets users check their manifests upfront, saving time and compute.Related: #185 (validation problems during fine-tuning)
What it does
Checks performed:
text,audio)ref_audiocolumn validationExample output:
Design decisions
soundfileis imported only when needed, so the module works in minimal environmentsFiles changed
src/voxcpm/training/validate.pysrc/voxcpm/cli.pyvalidatesubparser andcmd_validatehandlersrc/voxcpm/training/__init__.pyvalidate_manifestandValidationResulttests/test_validate.pyTesting
All 11 new tests pass: