Skip to content

Commit adc5539

Browse files
beveradbmakhlwfclaude
authored
feat: multi-model ensemble separation with 9 community-curated presets (#265)
* add ensembler * refactor(ensembler): fix state mutation, handle mono input, and add fallback writer * try fix test * review comments * review comments * fix test * fix: resolve ensemble PR review issues — CLI compat, state bugs, test coverage - Revert -m to single value, add --extra_models for ensemble (fixes CLI breaking change) - Initialize model_filename/model_filenames in __init__ (prevents AttributeError) - Fix list reference copy in load_model (use list() instead of shared reference) - Move original_output_dir capture outside per-model loop (state mutation fix) - Extract stem name map to module-level STEM_NAME_MAP constant - Preserve mono channel count through ensemble (avoid fake stereo) - Add trailing newlines to all files - Add 8 new unit tests: median/min/max_fft, uvr_max/min_spec, invalid algo, weight mismatch - Add 3 CLI tests: --extra_models, single model string compat, old syntax backward compat - Update README ensemble examples for new --extra_models flag Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add ensemble preset system with 9 community-curated presets Add a JSON-based ensemble preset system that lets users select known-good model combinations by name instead of specifying every detail manually. Presets are sourced from deton24's community-maintained audio separation guide and cover instrumental (4), vocal (4), and karaoke (1) use cases. New features: - ensemble_presets.json with 9 presets (instrumental_clean/full/balanced/low_resource, vocal_balanced/clean/full/rvc, karaoke) - --ensemble_preset CLI flag and Separator(ensemble_preset=...) Python API - --list_presets CLI flag to show available presets - Preset algorithm/weights can be overridden by explicit user args - ensemble_algorithm parameter now accepts None (defaults to avg_wave) - 10 new unit tests for preset loading, validation, override, JSON validity - 2 new CLI tests for --ensemble_preset and --list_presets - README updated with preset documentation and usage examples Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: correct stem labeling for ensemble — swap mismatched target_instrument, map "other" to "Instrumental" Three fixes for stem name handling in ensemble mode: 1. common_separator.py: When a model's target_instrument doesn't match instruments[0], swap primary/secondary stem names so the model's prediction gets the correct label. Fixes bs_roformer_instrumental_ resurrection_unwa whose "vocals" output was actually instrumental. 2. separator.py: In _separate_ensemble, when a model produces exactly 2 stems and one is vocal-like, map "other" to "Instrumental" instead of keeping it as a separate group. This ensures all 2-stem models contribute to the same Vocals/Instrumental ensemble regardless of whether they label their non-vocal stem "Instrumental" or "other". 3. separator.py: Use preset name in ensemble output filenames (preset_<name>) and descriptive slugs for manual ensembles (custom_ensemble_<slug1>_<slug2>). Also adds tests/utils_audio_verification.py — a content verification utility that correlates output stems against known references to detect label mismatches programmatically. Verified: all 9 presets now produce exactly 2 correctly-labeled stems (18/18 OK, 0 mismatches). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add e2e integration tests for all 9 ensemble presets with reference spectrograms - 36 reference spectrogram/waveform PNGs for 9 presets × 2 stems each - test_ensemble_integration.py: parametrized test that for each preset: 1. Runs the preset separation on mardy20s.flac 2. Verifies stems contain correct content (correlation-based) 3. Compares spectrograms against committed references (SSIM) - generate_reference_images_ensemble.py: script to regenerate references - utils_audio_verification.py: content verification utility (already committed) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add P0/P1 tests for stem swap, preset validation, and filename logic - 5 tests for CommonSeparator stem name swap (target_instrument mismatch, no swap when matching, edge cases) - 2 tests for STEM_NAME_MAP completeness and lowercase invariant - 2 tests for ensemble output filename format (preset and custom slugs) - 5 tests for preset validation edge cases (bad weights length, bad algorithm, single model, weights applied, weights override) Total: 233 unit tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add ensemble_preset to Python API parameter reference in README Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * revert: restore original arg order in integration tests The nargs="+" change on -m was reverted in favor of --extra_models, so the old CLI arg order (audio-separator -m model audio.wav) works again. No need to change these tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add on-demand regression test to verify stem labels for all 163 models Runs every supported model on mardy20s.flac and verifies each output stem's label matches its actual content using correlation against known vocal and instrumental references. Usage: pytest tests/regression/test_all_models_stem_verification.py -v -s pytest ... -k "VR" (single architecture) pytest ... -k "resurrection" (single model) STEM_VERIFY_REPORT_ONLY=1 pytest ... (report without failing) Handles: - Vocal/Instrumental stems: verified via Pearson correlation (>0.7 threshold) - Sub-stems (drums, bass, guitar, piano): verified not-full-mix; near-silence OK - Full mix detection: any stem with >0.95 correlation to original mix fails - Demucs 6-stem models: sub-stems like Piano can be legitimately silent Not run in CI — requires downloading all models. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: handle utility models and sub-stems in stem verification test - Utility models (de-echo, de-noise, de-reverb, BVE) get relaxed verification — their stems don't follow standard vocal/instrumental patterns on clean source audio - Sub-stems (drums, bass, guitar, "No X" variants) skip the full-mix check since "No X" is legitimately ≈ the mix when X isn't present - Partial vocal stems (backing/lead vocals) skip full-vocal correlation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: add missing stem types to verification test (drumsep, gender, aspiration, etc.) Full 163-model run revealed stem types not yet in SUB_STEMS or UTILITY_STEMS: - Drumsep: kick, snare, toms, hh, ride, crash - Gender split: male, female - Specialized: aspiration, bleed, no bleed - Utility: noreverb 160 passed, 0 real failures, 3 skipped (download failures). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add 4 multi-instrument test clips for stem verification New test input audio clips with diverse instrumentation for testing instrument-specific separation models: - levee_drums.flac (20s, 24-bit) — Led Zeppelin, drums+guitar+vocals - clocks_piano.flac (20s, 16-bit) — Coldplay, piano+instruments+vocals - sing_sing_sing_brass.flac (25s, 16-bit) — Benny Goodman, drums+brass+wind - only_time_reverb.flac (25s, 16-bit) — Enya, reverb-heavy vocal+synths Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add multi-stem integration test framework with 30 reference stems New integration test suite verifying instrument-specific separation models across 4 test clips with diverse instrumentation: Test matrix: - Vocal/Instrumental: resurrection model on all 4 clips - 4-stem (drums/bass/other/vocals): htdemucs_ft on levee + clocks - DrumSep pipeline: mix → htdemucs_ft drums → drumsep kit parts - Karaoke: aufr33/viperx model on levee + clocks - Wind/Brass: 17_HP-Wind on sing_sing_sing - De-reverb pipeline: mix → resurrection vocals → dereverb 30 reference stems generated by best-in-class models, committed as tests/inputs/reference/ref_*.flac. Tests verify new model outputs correlate > 0.70 with references. Includes generate_multi_stem_references.py for regenerating references. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refine: karaoke test verifies output differs from standard vocal split Karaoke models remove lead vocals while preserving backing vocals. The test now additionally checks that karaoke vocal output differs from standard vocal output (correlating < 0.95), confirming the model is doing karaoke-specific extraction, not just a generic split. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add Under Pressure clip for karaoke backing vocal verification Queen & David Bowie — Under Pressure 1:35-1:55 (20s, 16-bit). Section has clear lead vocal over dense backing harmonies, making karaoke vs standard vocal separation measurably different (0.740 correlation vs 0.961 for Clocks which lacks strong backing vocals). Karaoke test now runs on 3 clips: levee, clocks, under_pressure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version to 0.42.0 for ensemble feature release New minor version for: - Multi-model ensemble separation - 9 community-curated ensemble presets - Stem label fixes (target_instrument swap, contextual "other" mapping) - New CLI flags: --extra_models, --ensemble_preset, --list_presets - Multi-stem integration test framework Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add meaningful ensemble integration tests Three tests verifying ensemble presets produce semantically correct output: 1. test_vocal_ensemble_matches_best_single_model: vocal_balanced ensemble output should correlate >0.90 with the best single model (Resurrection), confirming ensemble doesn't degrade quality. 2. test_karaoke_ensemble_extracts_lead_only: On Under Pressure (prominent backing harmonies), karaoke ensemble vocals should differ from standard vocal extraction (<0.90 correlation), confirming it extracts only lead. 3. test_karaoke_on_vocals_produces_lead_backing_split: Pipeline test — mix → vocal model → karaoke model should produce distinct lead and backing vocal stems (both non-silent, correlation <0.50). Includes 9 new reference stems for these tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: fix find_stem pipeline matching and nan correlation in integration tests find_stem() matched the first _(StemName) group in filenames, which broke pipeline tests where the input filename already contained a parenthesized stem from a prior step. Now uses the last match. Also handle near-silent stems (e.g. vocals from instrumental-only audio) returning nan correlation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: makhlwf <altrhwnyashrf1@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 4efec66 commit adc5539

103 files changed

Lines changed: 30440 additions & 14 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 100 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
<div align="center">
2-
2+
33
# 🎶 Audio Separator 🎶
44

55
[![PyPI version](https://badge.fury.io/py/audio-separator.svg)](https://badge.fury.io/py/audio-separator)
@@ -318,6 +318,101 @@ The chunking feature supports all model types:
318318

319319
Chunks are concatenated without crossfading, which may result in minor artifacts at chunk boundaries in rare cases. For most use cases, these are not noticeable. The simple concatenation approach keeps processing time minimal while solving out-of-memory issues.
320320

321+
### Ensembling Multiple Models
322+
323+
You can combine the results of multiple models to improve separation quality. This will run each model and then combine their outputs using a specified algorithm.
324+
325+
#### CLI Usage
326+
327+
Use `-m` for the primary model and `--extra_models` for additional models. You can also specify the ensemble algorithm using `--ensemble_algorithm`.
328+
329+
```sh
330+
# Ensemble two models using the default 'avg_wave' algorithm
331+
audio-separator audio.wav -m model1.ckpt --extra_models model2.onnx
332+
333+
# Ensemble multiple models using a specific algorithm
334+
audio-separator audio.wav -m model1.ckpt --extra_models model2.onnx model3.ckpt --ensemble_algorithm max_fft
335+
336+
# With custom weights (must match the number of models)
337+
audio-separator audio.wav -m model1.ckpt --extra_models model2.onnx --ensemble_weights 2.0 1.0
338+
```
339+
340+
#### Python API Usage
341+
342+
```python
343+
from audio_separator.separator import Separator
344+
345+
# Initialize the Separator class with custom parameters
346+
separator = Separator(
347+
output_dir='output',
348+
ensemble_algorithm='avg_wave'
349+
)
350+
351+
# List of models to ensemble
352+
# Note: These models will be downloaded automatically if not present
353+
models = [
354+
'UVR-MDX-NET-Inst_HQ_3.onnx',
355+
'UVR_MDXNET_KARA_2.onnx'
356+
]
357+
358+
# Specify multiple models for ensembling
359+
separator.load_model(model_filename=models)
360+
361+
# Perform separation
362+
output_files = separator.separate('audio.wav')
363+
```
364+
365+
#### Supported Ensemble Algorithms
366+
- `avg_wave`: Weighted average of waveforms (default)
367+
- `median_wave`: Median of waveforms
368+
- `min_wave`: Minimum of waveforms
369+
- `max_wave`: Maximum of waveforms
370+
- `avg_fft`: Weighted average of spectrograms
371+
- `median_fft`: Median of spectrograms
372+
- `min_fft`: Minimum of spectrograms
373+
- `max_fft`: Maximum of spectrograms
374+
- `uvr_max_spec`: UVR-based maximum spectrogram ensemble
375+
- `uvr_min_spec`: UVR-based minimum spectrogram ensemble
376+
- `ensemble_wav`: UVR-based least noisy chunk ensemble
377+
378+
#### Ensemble Presets
379+
380+
Instead of specifying models and algorithms manually, you can use curated presets based on community-tested combinations:
381+
382+
```sh
383+
# List available presets
384+
audio-separator --list_presets
385+
386+
# Use a preset (models and algorithm are configured automatically)
387+
audio-separator audio.wav --ensemble_preset vocal_balanced
388+
389+
# Override a preset's algorithm
390+
audio-separator audio.wav --ensemble_preset vocal_balanced --ensemble_algorithm max_fft
391+
```
392+
393+
**Python API:**
394+
```python
395+
separator = Separator(output_dir='output', ensemble_preset='vocal_balanced')
396+
separator.load_model() # Uses preset's models automatically
397+
output_files = separator.separate('audio.wav')
398+
```
399+
400+
Available presets:
401+
402+
| Preset | Use Case | Models | Algorithm |
403+
|--------|----------|--------|-----------|
404+
| `instrumental_clean` | Cleanest instrumentals, minimal vocal bleed | 2 | `uvr_max_spec` |
405+
| `instrumental_full` | Maximum instrument preservation | 2 | `uvr_max_spec` |
406+
| `instrumental_balanced` | Good noise/fullness balance | 2 | `uvr_max_spec` |
407+
| `instrumental_low_resource` | Fast, low VRAM | 2 | `avg_fft` |
408+
| `vocal_balanced` | Best overall vocal quality | 2 | `avg_fft` |
409+
| `vocal_clean` | Minimal instrument bleed | 2 | `min_fft` |
410+
| `vocal_full` | Maximum vocal capture | 2 | `max_fft` |
411+
| `vocal_rvc` | Optimized for RVC/AI training | 2 | `avg_wave` |
412+
| `karaoke` | Lead vocal removal | 3 | `avg_wave` |
413+
414+
Presets are defined in `audio_separator/ensemble_presets.json` — contributions welcome via PR!
415+
321416
### Full command-line interface options
322417

323418
```sh
@@ -525,6 +620,9 @@ You can also rename specific stems:
525620
- **`vr_params`:** (Optional) VR Architecture Specific Attributes & Defaults. `Default: {"batch_size": 1, "window_size": 512, "aggression": 5, "enable_tta": False, "enable_post_process": False, "post_process_threshold": 0.2, "high_end_process": False}`
526621
- **`demucs_params`:** (Optional) Demucs Architecture Specific Attributes & Defaults. `Default: {"segment_size": "Default", "shifts": 2, "overlap": 0.25, "segments_enabled": True}`
527622
- **`mdxc_params`:** (Optional) MDXC Architecture Specific Attributes & Defaults. `Default: {"segment_size": 256, "override_model_segment_size": False, "batch_size": 1, "overlap": 8, "pitch_shift": 0}`
623+
- **`ensemble_algorithm`:** (Optional) Algorithm to use for ensembling multiple models. `Default: 'avg_wave'`
624+
- **`ensemble_weights`:** (Optional) Weights for each model in the ensemble. `Default: None` (equal weights)
625+
- **`ensemble_preset`:** (Optional) Named ensemble preset (e.g. `'vocal_balanced'`, `'karaoke'`). Sets models, algorithm, and weights automatically. Use `Separator(info_only=True).list_ensemble_presets()` to see all. `Default: None`
528626
529627
## Remote API Usage 🌐
530628
@@ -653,4 +751,4 @@ For questions or feedback, please raise an issue or reach out to @beveradb ([And
653751
<img src="https://contrib.rocks/image?repo=nomadkaraoke/python-audio-separator" />
654752
</a>
655753
656-
</div>
754+
</div>
Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
{
2+
"version": 1,
3+
"presets": {
4+
"instrumental_clean": {
5+
"name": "Instrumental Clean",
6+
"description": "Cleanest instrumentals with minimal vocal bleed — Fv7z (bleedless 44.61) + Resurrection Inst (SDR 17.25)",
7+
"models": [
8+
"mel_band_roformer_instrumental_fv7z_gabox.ckpt",
9+
"bs_roformer_instrumental_resurrection_unwa.ckpt"
10+
],
11+
"algorithm": "uvr_max_spec",
12+
"weights": null,
13+
"contributor": "deton24 community guide"
14+
},
15+
"instrumental_full": {
16+
"name": "Instrumental Full",
17+
"description": "Maximum instrument preservation — v1e+ (fullness 37.89) + becruily inst (SOTA SDR 17.55)",
18+
"models": [
19+
"melband_roformer_inst_v1e_plus.ckpt",
20+
"mel_band_roformer_instrumental_becruily.ckpt"
21+
],
22+
"algorithm": "uvr_max_spec",
23+
"weights": null,
24+
"contributor": "deton24 community guide"
25+
},
26+
"instrumental_balanced": {
27+
"name": "Instrumental Balanced",
28+
"description": "Good balance of noise and fullness — Gabox INSTV8 + Resurrection Inst",
29+
"models": [
30+
"mel_band_roformer_instrumental_instv8_gabox.ckpt",
31+
"bs_roformer_instrumental_resurrection_unwa.ckpt"
32+
],
33+
"algorithm": "uvr_max_spec",
34+
"weights": null,
35+
"contributor": "deton24 community guide"
36+
},
37+
"instrumental_low_resource": {
38+
"name": "Instrumental Low Resource",
39+
"description": "Fast ensemble for low VRAM — Resurrection Inst (200MB) + MDX HQ_5 (ONNX, very fast)",
40+
"models": [
41+
"bs_roformer_instrumental_resurrection_unwa.ckpt",
42+
"UVR-MDX-NET-Inst_HQ_5.onnx"
43+
],
44+
"algorithm": "avg_fft",
45+
"weights": null,
46+
"contributor": "deton24 community guide"
47+
},
48+
"vocal_balanced": {
49+
"name": "Vocal Balanced",
50+
"description": "Best overall vocal quality — Resurrection (SDR 11.34) + Beta 6X (SDR 11.12) averaged",
51+
"models": [
52+
"bs_roformer_vocals_resurrection_unwa.ckpt",
53+
"melband_roformer_big_beta6x.ckpt"
54+
],
55+
"algorithm": "avg_fft",
56+
"weights": null,
57+
"contributor": "deton24 community guide"
58+
},
59+
"vocal_clean": {
60+
"name": "Vocal Clean",
61+
"description": "Minimal instrument bleed in vocals — Revive 2 (bleedless 40.07) + FT2 bleedless (39.30) with min FFT",
62+
"models": [
63+
"bs_roformer_vocals_revive_v2_unwa.ckpt",
64+
"mel_band_roformer_kim_ft2_bleedless_unwa.ckpt"
65+
],
66+
"algorithm": "min_fft",
67+
"weights": null,
68+
"contributor": "deton24 community guide"
69+
},
70+
"vocal_full": {
71+
"name": "Vocal Full",
72+
"description": "Maximum vocal capture including harmonies — Revive 3e (fullness 21.43) + becruily vocal with max FFT",
73+
"models": [
74+
"bs_roformer_vocals_revive_v3e_unwa.ckpt",
75+
"mel_band_roformer_vocals_becruily.ckpt"
76+
],
77+
"algorithm": "max_fft",
78+
"weights": null,
79+
"contributor": "deton24 community guide"
80+
},
81+
"vocal_rvc": {
82+
"name": "Vocal RVC",
83+
"description": "Optimized for RVC/AI voice training data — Beta 6X + Gabox voc_fv4 averaged",
84+
"models": [
85+
"melband_roformer_big_beta6x.ckpt",
86+
"mel_band_roformer_vocals_fv4_gabox.ckpt"
87+
],
88+
"algorithm": "avg_wave",
89+
"weights": null,
90+
"contributor": "deton24 community guide"
91+
},
92+
"karaoke": {
93+
"name": "Karaoke",
94+
"description": "Lead vocal removal — 3-model karaoke ensemble reaches SDR ~10.6 vs ~10.2 single model",
95+
"models": [
96+
"mel_band_roformer_karaoke_aufr33_viperx_sdr_10.1956.ckpt",
97+
"mel_band_roformer_karaoke_gabox_v2.ckpt",
98+
"mel_band_roformer_karaoke_becruily.ckpt"
99+
],
100+
"algorithm": "avg_wave",
101+
"weights": null,
102+
"contributor": "deton24 community guide"
103+
}
104+
}
105+
}

audio_separator/separator/common_separator.py

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -103,8 +103,18 @@ def __init__(self, config):
103103
if "training" in self.model_data and "instruments" in self.model_data["training"]:
104104
instruments = self.model_data["training"]["instruments"]
105105
if instruments:
106-
self.primary_stem_name = instruments[0]
107-
self.secondary_stem_name = instruments[1] if len(instruments) > 1 else self.secondary_stem(self.primary_stem_name)
106+
target_instrument = self.model_data["training"].get("target_instrument")
107+
108+
# When target_instrument is set and doesn't match instruments[0],
109+
# the model's prediction would be labeled with the wrong stem name.
110+
# Swap so primary_stem_name always matches the model's actual target output.
111+
if target_instrument and len(instruments) >= 2 and instruments[0] != target_instrument and instruments[1] == target_instrument:
112+
self.logger.debug(f"Swapping stem names: target_instrument '{target_instrument}' doesn't match instruments[0] '{instruments[0]}'")
113+
self.primary_stem_name = instruments[1]
114+
self.secondary_stem_name = instruments[0]
115+
else:
116+
self.primary_stem_name = instruments[0]
117+
self.secondary_stem_name = instruments[1] if len(instruments) > 1 else self.secondary_stem(self.primary_stem_name)
108118

109119
if self.primary_stem_name is None:
110120
self.primary_stem_name = self.model_data.get("primary_stem", "Vocals")

0 commit comments

Comments
 (0)