Remove librosa dependency from audio loading

Acelogic · Acelogic · commit d29dbccb9092 · 2026-01-05T09:30:08.000-05:00
Replaced librosa functions in load_audio_infer with numpy for mono conversion and scipy for resampling, reducing dependencies and improving performance. Also added mx.eval after MLX model weight loading in infer_mlx.py and rmvpe.py to ensure weights are cached. Updated context.md with new benchmarks and a detailed TODO list for future optimizations.
diff --git a/context.md b/context.md
@@ -37,5 +37,36 @@ Replaced librosa CPU-based mel spectrogram (645ms first call) with GPU-accelerat
 ### Backend Selection
 | Backend | Description | Performance |
 |---------|-------------|-------------|
-| `torch` | PyTorch with MPS | 3.14s |
-| `mlx` | Full MLX inference | **3.12s** (-0.5%) |
+| `torch` | PyTorch with MPS | 2.81s |
+| `mlx` | Full MLX inference | **2.91s** |
+
+## 🚀 TODO / Future Optimizations
+
+### 1. Batch Processing in RMVPE mel2hidden
+- [ ] Optimize `mel2hidden` to process the mel spectrogram in chunks for better GPU cache utilization and throughput.
+
+### 2. Fused Operations in Hubert
+- [ ] Profile and possibly fuse transformer blocks (Q/K/V projections, softmax, output projection) in Hubert using `mx.compile` more strategically or custom kernels.
+
+### 3. Cache Warmup on Model Load
+- [ ] Run a single dummy inference iteration immediately after loading models to trigger all MLX kernel compilation (JIT). This shifts the one-time "first run" penalty to the startup phase.
+
+### 4. Proper End-to-End float16 Support
+Currently, float16 caused a slowdown because of constant casting between float32 (audio/mel) and float16 (model). To fix:
+- [ ] Convert input audio to `float16` immediately after loading.
+- [ ] Update `mel_spectrogram` to output `float16`.
+- [ ] Implement `tree_map` to cast all model parameters to `float16` at load time.
+- [ ] Ensure the entire pipeline operates in `float16`, only casting back to `float32` for final storage.
+
+### 5. Streaming Synthesis
+- [ ] Implement overlapping chunk processing for the Synthesizer to reduce peak memory usage and potentially enable real-time/streaming output.
+
+### 6. Remove librosa Dependency Entirely
+- [x] Replaced librosa `to_mono` and `resample` with `scipy`/`numpy` in `load_audio_infer`.
+- [ ] Investigate moving audio loading entirely to a more lightweight solution if `soundfile`/`scipy` overhead is still noticeable.
+
+### 7. Custom Metal Kernels
+- [ ] For absolute peak performance, write optimized Metal shaders for the most compute-intensive operations if `mx.compile` isn't sufficient.
+
+### 8. Quantization (INT8/INT4)
+- [ ] Explore `mlx.nn.QuantizedLinear` for the Synthesizer model to reduce memory bandwidth requirements.
diff --git a/rvc/infer/infer_mlx.py b/rvc/infer/infer_mlx.py
@@ -208,7 +208,7 @@ def get_vc(self, weight_root, sid):
             )
             # Use load_weights assuming flattened structure
             # self.mlx_model.load_weights(list(renamed_weights.items())) -- expects file
-            self.mlx_model.update(renamed_weights) 
+            self.mlx_model.update(renamed_weights)
             # MX eval to ensure weights loaded/cached
             mx.eval(self.mlx_model.parameters())
             
diff --git a/rvc/lib/mlx/rmvpe.py b/rvc/lib/mlx/rmvpe.py
@@ -438,6 +438,7 @@ def __init__(self, model_path=None, weights_path=None, device=None):
             print(f"RMVPE MLX weights not found at {weights_path}")
         else:
             self.model.load_weights(weights_path)
+            mx.eval(self.model.parameters())
             
         # Constants for decode
         N_CLASS = 360
diff --git a/rvc/lib/utils.py b/rvc/lib/utils.py
@@ -71,12 +71,17 @@ def load_audio_infer(
         if not os.path.isfile(file):
             raise FileNotFoundError(f"File not found: {file}")
         audio, sr = sf.read(file)
+        
+        # Convert to mono using numpy (no librosa)
         if len(audio.shape) > 1:
-            audio = librosa.to_mono(audio.T)
+            audio = np.mean(audio, axis=1)  # Average channels for mono
+            
+        # Resample using scipy (no librosa)
         if sr != sample_rate:
-            audio = librosa.resample(
-                audio, orig_sr=sr, target_sr=sample_rate, res_type="soxr_vhq"
-            )
+            from scipy import signal
+            num_samples = int(len(audio) * sample_rate / sr)
+            audio = signal.resample(audio, num_samples)
+            
         if formant_shifting:
             formant_qfrency = kwargs.get("formant_qfrency", 0.8)
             formant_timbre = kwargs.get("formant_timbre", 0.8)

Original file line number	Diff line number	Diff line change
`@@ -208,7 +208,7 @@ def get_vc(self, weight_root, sid):`
`208`	`208`	`)`
`209`	`209`	`# Use load_weights assuming flattened structure`
`210`	`210`	`# self.mlx_model.load_weights(list(renamed_weights.items())) -- expects file`
`211`		`- self.mlx_model.update(renamed_weights)`
	`211`	`+ self.mlx_model.update(renamed_weights)`
`212`	`212`	`# MX eval to ensure weights loaded/cached`
`213`	`213`	`mx.eval(self.mlx_model.parameters())`
`214`	`214`