nomadkaraoke
diff --git a/‎README.md‎
Lines changed: 100 additions & 2 deletions b/‎README.md‎
Lines changed: 100 additions & 2 deletions
diff --git a/‎audio_separator/ensemble_presets.json‎
Lines changed: 105 additions & 0 deletions b/‎audio_separator/ensemble_presets.json‎
Lines changed: 105 additions & 0 deletions
diff --git a/‎audio_separator/separator/common_separator.py‎
Lines changed: 12 additions & 2 deletions b/‎audio_separator/separator/common_separator.py‎
Lines changed: 12 additions & 2 deletions
@@ -1,5 +1,5 @@
 <div align="center">
- 
+
 # 🎶 Audio Separator 🎶
 
 [![PyPI version](https://badge.fury.io/py/audio-separator.svg)](https://badge.fury.io/py/audio-separator)
@@ -318,6 +318,101 @@ The chunking feature supports all model types:
 
 Chunks are concatenated without crossfading, which may result in minor artifacts at chunk boundaries in rare cases. For most use cases, these are not noticeable. The simple concatenation approach keeps processing time minimal while solving out-of-memory issues.
 
+### Ensembling Multiple Models
+
+You can combine the results of multiple models to improve separation quality. This will run each model and then combine their outputs using a specified algorithm.
+
+#### CLI Usage
+
+Use `-m` for the primary model and `--extra_models` for additional models. You can also specify the ensemble algorithm using `--ensemble_algorithm`.
+
+```sh
+# Ensemble two models using the default 'avg_wave' algorithm
+audio-separator audio.wav -m model1.ckpt --extra_models model2.onnx
+
+# Ensemble multiple models using a specific algorithm
+audio-separator audio.wav -m model1.ckpt --extra_models model2.onnx model3.ckpt --ensemble_algorithm max_fft
+
+# With custom weights (must match the number of models)
+audio-separator audio.wav -m model1.ckpt --extra_models model2.onnx --ensemble_weights 2.0 1.0
+```
+
+#### Python API Usage
+
+```python
+from audio_separator.separator import Separator
+
+# Initialize the Separator class with custom parameters
+separator = Separator(
+    output_dir='output',
+    ensemble_algorithm='avg_wave'
+)
+
+# List of models to ensemble
+# Note: These models will be downloaded automatically if not present
+models = [
+    'UVR-MDX-NET-Inst_HQ_3.onnx',
+    'UVR_MDXNET_KARA_2.onnx'
+]
+
+# Specify multiple models for ensembling
+separator.load_model(model_filename=models)
+
+# Perform separation
+output_files = separator.separate('audio.wav')
+```
+
+#### Supported Ensemble Algorithms
+- `avg_wave`: Weighted average of waveforms (default)
+- `median_wave`: Median of waveforms
+- `min_wave`: Minimum of waveforms
+- `max_wave`: Maximum of waveforms
+- `avg_fft`: Weighted average of spectrograms
+- `median_fft`: Median of spectrograms
+- `min_fft`: Minimum of spectrograms
+- `max_fft`: Maximum of spectrograms
+- `uvr_max_spec`: UVR-based maximum spectrogram ensemble
+- `uvr_min_spec`: UVR-based minimum spectrogram ensemble
+- `ensemble_wav`: UVR-based least noisy chunk ensemble
+
+#### Ensemble Presets
+
+Instead of specifying models and algorithms manually, you can use curated presets based on community-tested combinations:
+
+```sh
+# List available presets
+audio-separator --list_presets
+
+# Use a preset (models and algorithm are configured automatically)
+audio-separator audio.wav --ensemble_preset vocal_balanced
+
+# Override a preset's algorithm
+audio-separator audio.wav --ensemble_preset vocal_balanced --ensemble_algorithm max_fft
+```
+
+**Python API:**
+```python
+separator = Separator(output_dir='output', ensemble_preset='vocal_balanced')
+separator.load_model()  # Uses preset's models automatically
+output_files = separator.separate('audio.wav')
+```
+
+Available presets:
+
+| Preset | Use Case | Models | Algorithm |
+|--------|----------|--------|-----------|
+| `instrumental_clean` | Cleanest instrumentals, minimal vocal bleed | 2 | `uvr_max_spec` |
+| `instrumental_full` | Maximum instrument preservation | 2 | `uvr_max_spec` |
+| `instrumental_balanced` | Good noise/fullness balance | 2 | `uvr_max_spec` |
+| `instrumental_low_resource` | Fast, low VRAM | 2 | `avg_fft` |
+| `vocal_balanced` | Best overall vocal quality | 2 | `avg_fft` |
+| `vocal_clean` | Minimal instrument bleed | 2 | `min_fft` |
+| `vocal_full` | Maximum vocal capture | 2 | `max_fft` |
+| `vocal_rvc` | Optimized for RVC/AI training | 2 | `avg_wave` |
+| `karaoke` | Lead vocal removal | 3 | `avg_wave` |
+
+Presets are defined in `audio_separator/ensemble_presets.json` — contributions welcome via PR!
+
 ### Full command-line interface options
 
 ```sh
@@ -525,6 +620,9 @@ You can also rename specific stems:
 - **`vr_params`:** (Optional) VR Architecture Specific Attributes & Defaults. `Default: {"batch_size": 1, "window_size": 512, "aggression": 5, "enable_tta": False, "enable_post_process": False, "post_process_threshold": 0.2, "high_end_process": False}`
 - **`demucs_params`:** (Optional) Demucs Architecture Specific Attributes & Defaults. `Default: {"segment_size": "Default", "shifts": 2, "overlap": 0.25, "segments_enabled": True}`
 - **`mdxc_params`:** (Optional) MDXC Architecture Specific Attributes & Defaults. `Default: {"segment_size": 256, "override_model_segment_size": False, "batch_size": 1, "overlap": 8, "pitch_shift": 0}`
+- **`ensemble_algorithm`:** (Optional) Algorithm to use for ensembling multiple models. `Default: 'avg_wave'`
+- **`ensemble_weights`:** (Optional) Weights for each model in the ensemble. `Default: None` (equal weights)
+- **`ensemble_preset`:** (Optional) Named ensemble preset (e.g. `'vocal_balanced'`, `'karaoke'`). Sets models, algorithm, and weights automatically. Use `Separator(info_only=True).list_ensemble_presets()` to see all. `Default: None`
 
 ## Remote API Usage 🌐
 
@@ -653,4 +751,4 @@ For questions or feedback, please raise an issue or reach out to @beveradb ([And
   <img src="https://contrib.rocks/image?repo=nomadkaraoke/python-audio-separator" />
 </a>
 
-</div>
+</div>
@@ -0,0 +1,105 @@
+{
+    "version": 1,
+    "presets": {
+        "instrumental_clean": {
+            "name": "Instrumental Clean",
+            "description": "Cleanest instrumentals with minimal vocal bleed — Fv7z (bleedless 44.61) + Resurrection Inst (SDR 17.25)",
+            "models": [
+                "mel_band_roformer_instrumental_fv7z_gabox.ckpt",
+                "bs_roformer_instrumental_resurrection_unwa.ckpt"
+            ],
+            "algorithm": "uvr_max_spec",
+            "weights": null,
+            "contributor": "deton24 community guide"
+        },
+        "instrumental_full": {
+            "name": "Instrumental Full",
+            "description": "Maximum instrument preservation — v1e+ (fullness 37.89) + becruily inst (SOTA SDR 17.55)",
+            "models": [
+                "melband_roformer_inst_v1e_plus.ckpt",
+                "mel_band_roformer_instrumental_becruily.ckpt"
+            ],
+            "algorithm": "uvr_max_spec",
+            "weights": null,
+            "contributor": "deton24 community guide"
+        },
+        "instrumental_balanced": {
+            "name": "Instrumental Balanced",
+            "description": "Good balance of noise and fullness — Gabox INSTV8 + Resurrection Inst",
+            "models": [
+                "mel_band_roformer_instrumental_instv8_gabox.ckpt",
+                "bs_roformer_instrumental_resurrection_unwa.ckpt"
+            ],
+            "algorithm": "uvr_max_spec",
+            "weights": null,
+            "contributor": "deton24 community guide"
+        },
+        "instrumental_low_resource": {
+            "name": "Instrumental Low Resource",
+            "description": "Fast ensemble for low VRAM — Resurrection Inst (200MB) + MDX HQ_5 (ONNX, very fast)",
+            "models": [
+                "bs_roformer_instrumental_resurrection_unwa.ckpt",
+                "UVR-MDX-NET-Inst_HQ_5.onnx"
+            ],
+            "algorithm": "avg_fft",
+            "weights": null,
+            "contributor": "deton24 community guide"
+        },
+        "vocal_balanced": {
+            "name": "Vocal Balanced",
+            "description": "Best overall vocal quality — Resurrection (SDR 11.34) + Beta 6X (SDR 11.12) averaged",
+            "models": [
+                "bs_roformer_vocals_resurrection_unwa.ckpt",
+                "melband_roformer_big_beta6x.ckpt"
+            ],
+            "algorithm": "avg_fft",
+            "weights": null,
+            "contributor": "deton24 community guide"
+        },
+        "vocal_clean": {
+            "name": "Vocal Clean",
+            "description": "Minimal instrument bleed in vocals — Revive 2 (bleedless 40.07) + FT2 bleedless (39.30) with min FFT",
+            "models": [
+                "bs_roformer_vocals_revive_v2_unwa.ckpt",
+                "mel_band_roformer_kim_ft2_bleedless_unwa.ckpt"
+            ],
+            "algorithm": "min_fft",
+            "weights": null,
+            "contributor": "deton24 community guide"
+        },
+        "vocal_full": {
+            "name": "Vocal Full",
+            "description": "Maximum vocal capture including harmonies — Revive 3e (fullness 21.43) + becruily vocal with max FFT",
+            "models": [
+                "bs_roformer_vocals_revive_v3e_unwa.ckpt",
+                "mel_band_roformer_vocals_becruily.ckpt"
+            ],
+            "algorithm": "max_fft",
+            "weights": null,
+            "contributor": "deton24 community guide"
+        },
+        "vocal_rvc": {
+            "name": "Vocal RVC",
+            "description": "Optimized for RVC/AI voice training data — Beta 6X + Gabox voc_fv4 averaged",
+            "models": [
+                "melband_roformer_big_beta6x.ckpt",
+                "mel_band_roformer_vocals_fv4_gabox.ckpt"
+            ],
+            "algorithm": "avg_wave",
+            "weights": null,
+            "contributor": "deton24 community guide"
+        },
+        "karaoke": {
+            "name": "Karaoke",
+            "description": "Lead vocal removal — 3-model karaoke ensemble reaches SDR ~10.6 vs ~10.2 single model",
+            "models": [
+                "mel_band_roformer_karaoke_aufr33_viperx_sdr_10.1956.ckpt",
+                "mel_band_roformer_karaoke_gabox_v2.ckpt",
+                "mel_band_roformer_karaoke_becruily.ckpt"
+            ],
+            "algorithm": "avg_wave",
+            "weights": null,
+            "contributor": "deton24 community guide"
+        }
+    }
+}
@@ -103,8 +103,18 @@ def __init__(self, config):
         if "training" in self.model_data and "instruments" in self.model_data["training"]:
             instruments = self.model_data["training"]["instruments"]
             if instruments:
-                self.primary_stem_name = instruments[0]
-                self.secondary_stem_name = instruments[1] if len(instruments) > 1 else self.secondary_stem(self.primary_stem_name)
+                target_instrument = self.model_data["training"].get("target_instrument")
+
+                # When target_instrument is set and doesn't match instruments[0],
+                # the model's prediction would be labeled with the wrong stem name.
+                # Swap so primary_stem_name always matches the model's actual target output.
+                if target_instrument and len(instruments) >= 2 and instruments[0] != target_instrument and instruments[1] == target_instrument:
+                    self.logger.debug(f"Swapping stem names: target_instrument '{target_instrument}' doesn't match instruments[0] '{instruments[0]}'")
+                    self.primary_stem_name = instruments[1]
+                    self.secondary_stem_name = instruments[0]
+                else:
+                    self.primary_stem_name = instruments[0]
+                    self.secondary_stem_name = instruments[1] if len(instruments) > 1 else self.secondary_stem(self.primary_stem_name)
 
         if self.primary_stem_name is None:
             self.primary_stem_name = self.model_data.get("primary_stem", "Vocals")