Training: --yes flag, MFA --copy-only, skip successful steps

Alex J Lennon · Alex J Lennon · commit eeb2950d42f4 · 2026-03-08T18:13:41.000Z
- Add --yes/-y to train.py to auto-confirm download/prepare (no EOF in non-interactive runs)
- Add --copy-only to run_mfa_alignment_prepared.sh to copy alignment from cache only (fixes Step 5 with 100k+ files via find -print0)
- Dataset manager: run MFA with --copy-only when alignment cache exists; require JSONs for 'prepared'; skip corpus creation when WAV+LAB already exist
- QUICKSTART: AMD GPU (ROCm) section, note about uv run reverting to CUDA wheel
- SSH multiplexing rule for ai-tools LXC

Made-with: Cursor
diff --git a/QUICKSTART.md b/QUICKSTART.md
@@ -83,16 +83,82 @@ To train for **UK English** (British phoneme set and viseme mapping):
 
    The UK recipe uses `training/configs/viseme_map_en_uk_mfa.json`, which maps the UK MFA phone set (IPA-style symbols) to the same 15 visemes. When prompted to download/prepare data, answer **`y`**; alignment will run with the UK model.
 
+## 4c. Full training (production ONNX)
+
+The quick recipes (4 and 4b) use **dev-clean** only and produce a small ONNX suitable for testing. For a **production-quality** model you need to train on the **full** LibriSpeech training sets and then export to ONNX.
+
+**Data:** Full training uses LibriSpeech **train-clean-100** (~6GB), **train-clean-360** (~23GB), and **train-other-500** (~30GB). The first time you run, the script will prompt to download and prepare these; preparation (WAV + MFA alignment) takes a long time per split. **GPU optional:** the recipes default to `device = "cpu"` so training runs without a GPU; for much faster training set `[hardware] device = "cuda"` in the recipe (or `mps` on Apple Silicon).
+
+**US English (full):**
+
+```bash
+uv run python training/train.py --config training/recipes/tcn_config.toml
+```
+
+When prompted to download and prepare missing datasets, answer **`y`**. Each split (train-clean-100, train-clean-360, train-other-500) will be downloaded, converted, and aligned with MFA (US) in turn. Training runs for up to 100 epochs with early stopping.
+
+**UK English (full):**
+
+1. Install UK MFA models (see 4b).
+2. Set MFA env vars and run the full UK recipe:
+
+```bash
+export MFA_ACOUSTIC_MODEL=english_mfa MFA_DICTIONARY_MODEL=english_uk_mfa
+uv run python training/train.py --config training/recipes/tcn_full_uk.toml
+```
+
+Answer **`y`** when asked to download and prepare datasets. Alignment will use the UK dictionary.
+
+**Export to ONNX:** After training, export the best checkpoint so the realtime harness and C# app can use it:
+
+```bash
+uv run python training/tools/export_onnx.py --list
+uv run python training/tools/export_onnx.py --run <run_name> --checkpoint best
+```
+
+`--list` shows available runs under `training/runs/`. Use the run name (e.g. `tcn_full_uk_2026-02-21_12-00-00`) with `--run`. The export writes to `export/<run_name>/` (model.onnx and config.json). The realtime script and C# app pick the newest `export/*/model.onnx` by default.
+
+**Smaller full run:** To try full training with less data, edit the recipe and set e.g. `splits = ["train-clean-100"]` (100h only). Use `training/recipes/tcn_config.toml` (US) or `training/recipes/tcn_full_uk.toml` (UK).
+
 ## 5. Optional: use GPU
 
-Edit `training/recipes/tcn_quick_laptop.toml` and set:
+Edit the recipe (e.g. `training/recipes/tcn_quick_laptop.toml` or `tcn_config.toml`) and set:
 
 ```toml
 [hardware]
-device = "cuda"   # or "mps" on Apple Silicon
+device = "cuda"   # NVIDIA GPU, or AMD GPU with ROCm (same API)
+# device = "mps"  # Apple Silicon
 ```
 
-If CUDA/MPS isn’t available, the trainer falls back to CPU and logs a warning.
+If CUDA/ROCm/MPS isn’t available, the trainer falls back to CPU and logs a warning.
+
+### 5b. AMD GPU (ROCm)
+
+The default `uv sync` installs PyTorch built for **NVIDIA CUDA**. On a machine with an **AMD GPU** (e.g. Radeon RX 7700/7800, Navi 32), you need PyTorch built for **ROCm** so that `torch.cuda.is_available()` is True (ROCm uses the same `torch.cuda` API).
+
+**1. Ensure the GPU is visible**
+
+- Kernel driver: `/dev/kfd` and `/dev/dri/renderD*` should exist (amdgpu driver).
+- Your user must be in the `render` (and usually `video`) group so the process can open those devices:  
+  `groups` should list `render`; if not, add with `sudo usermod -aG render,video $USER` and log in again.
+
+**2. Install PyTorch with ROCm**
+
+From the project root, override the default torch/torchaudio with the ROCm wheels. Use the index that matches your ROCm version (see [PyTorch get-started](https://pytorch.org/get-started/locally/) and choose Linux → Pip → ROCm). Example for ROCm 6.3:
+
+```bash
+uv pip install torch torchaudio --index-url https://download.pytorch.org/whl/rocm6.3
+```
+
+If your distro uses a different ROCm version, use the matching index (e.g. `rocm5.6`, `rocm6.2`). Python 3.13 may not have ROCm wheels on all indices; if so, try the [AMD ROCm docs](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/) or PyTorch “Previous versions” for a compatible wheel.
+
+**3. Use the GPU in training**
+
+In the recipe set `device = "cuda"` (same as for NVIDIA). Then run training as usual; the trainer will use the AMD GPU via ROCm.
+
+**Verify:** `uv run python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0) if torch.cuda.is_available() else '')"` should print `True` and the GPU name.
+
+**Note:** If you use `uv` and install ROCm via `uv pip install ... --index-url ...rocm6.3`, then `uv run` will re-sync from the lock file and can revert to the default CUDA wheel. To keep using the GPU, run training with the venv Python directly, e.g. `.venv/bin/python training/train.py --config ...`, or a wrapper script that calls `.venv/bin/python`.
 
 ## 6. Where outputs go
 
diff --git a/README.md b/README.md
@@ -27,7 +27,10 @@ mfa model download g2p english_us_arpa
 
 Dataset Download is now integrated in the training script.
 
-```python training/train.py --config training/recipes/tcn_config.toml```
+**Quick (laptop) training:**  
+`uv run python training/train.py --config training/recipes/tcn_quick_laptop.toml`
+
+**Full training (production ONNX):** See [QUICKSTART.md](QUICKSTART.md) section 4c. Use `training/recipes/tcn_config.toml` (US) or `training/recipes/tcn_full_uk.toml` (UK), then export with `training/tools/export_onnx.py --run <run_name> --checkpoint best`.
 
 
 This project uses the [LibriSpeech ASR corpus](https://openslr.org/12/) (CC BY 4.0 license).
diff --git a/run_mfa_alignment_prepared.sh b/run_mfa_alignment_prepared.sh
@@ -40,12 +40,21 @@ command_exists() {
 }
 
 # Check arguments
-# Usage: DATASET [MODEL]  or  DATASET ACOUSTIC DICTIONARY (for UK: english_mfa english_uk_mfa)
+# Usage: DATASET [MODEL]  or  DATASET ACOUSTIC DICTIONARY  or  DATASET [MODEL] --copy-only
+COPY_ONLY=0
+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        --copy-only) COPY_ONLY=1; shift ;;
+        *) break ;;
+    esac
+done
+
 if [ $# -lt 1 ] || [ $# -gt 3 ]; then
-    print_error "Usage: $0 DATASET_NAME [MFA_MODEL]"
-    print_error "   or: $0 DATASET_NAME ACOUSTIC_MODEL DICTIONARY_MODEL"
+    print_error "Usage: $0 DATASET_NAME [MFA_MODEL] [--copy-only]"
+    print_error "   or: $0 DATASET_NAME ACOUSTIC_MODEL DICTIONARY_MODEL [--copy-only]"
     print_error "Example: $0 test-clean"
     print_error "Example (UK): $0 dev-clean english_mfa english_uk_mfa"
+    print_error "Example (copy only, skip MFA when alignment cache exists): $0 train-clean-360 --copy-only"
     exit 1
 fi
 
@@ -113,6 +122,33 @@ fi
 
 print_status "Found ${WAV_COUNT} WAV files and ${LAB_COUNT} LAB files"
 
+# --- Copy-only mode: only copy existing alignment from cache to prepared (skip MFA) ---
+if [ "${COPY_ONLY}" -eq 1 ]; then
+    if [ ! -d "${TEMP_OUT_ALIGN}" ]; then
+        print_error "Copy-only mode: alignment output not found at ${TEMP_OUT_ALIGN}"
+        print_error "Run without --copy-only to perform full MFA alignment first."
+        exit 1
+    fi
+    JSON_COUNT=$(find "${TEMP_OUT_ALIGN}" -name "*.json" -not -name "alignment_analysis*" | wc -l)
+    if [ "${JSON_COUNT}" -eq 0 ]; then
+        print_error "Copy-only mode: no JSON alignment files in ${TEMP_OUT_ALIGN}"
+        exit 1
+    fi
+    print_status "Copy-only: copying ${JSON_COUNT} alignment files to prepared dataset..."
+    ALIGNED_COUNT=0
+    while IFS= read -r -d '' json_file; do
+        base_name=$(basename "$json_file" .json)
+        dest_file="${PREPARED_DIR}/${base_name}.json"
+        if [ -f "${PREPARED_DIR}/${base_name}.wav" ]; then
+            cp "$json_file" "$dest_file"
+            ALIGNED_COUNT=$((ALIGNED_COUNT + 1))
+        fi
+    done < <(find "${TEMP_OUT_ALIGN}" -name "*.json" -not -name "alignment_analysis*" -print0)
+    print_success "Copied ${ALIGNED_COUNT} alignment files to ${PREPARED_DIR}"
+    print_success "Done (copy-only). No cleanup - cache left at ${TEMP_OUT_ALIGN}"
+    exit 0
+fi
+
 # Create necessary directories
 mkdir -p "${TEMP_CORPUS}"
 mkdir -p "${MFA_DIR}"
@@ -220,8 +256,8 @@ if [ $? -eq 0 ]; then
     print_status "Step 5: Copying alignment results to prepared dataset..."
     
     ALIGNED_COUNT=0
-    # Find all JSON files in speaker subdirectories (skip alignment_analysis.csv)
-    for json_file in $(find "${TEMP_OUT_ALIGN}" -name "*.json" -not -name "alignment_analysis*"); do
+    # Use find -exec to avoid command-line length limits with 100k+ files
+    while IFS= read -r -d '' json_file; do
         base_name=$(basename "$json_file" .json)
         dest_file="${PREPARED_DIR}/${base_name}.json"
         
@@ -231,7 +267,7 @@ if [ $? -eq 0 ]; then
         else
             print_warning "No corresponding WAV file for alignment: ${base_name}"
         fi
-    done
+    done < <(find "${TEMP_OUT_ALIGN}" -name "*.json" -not -name "alignment_analysis*" -print0)
     
     print_success "Copied ${ALIGNED_COUNT} alignment files to prepared dataset"
     
diff --git a/training/modules/data_pipeline.py b/training/modules/data_pipeline.py
@@ -320,7 +320,7 @@ class LibriSpeechDataset(Dataset):
     """
     
     def __init__(self, config: TrainingConfiguration, split: str, 
-                 is_training: bool = True, data_root: Optional[str] = None):
+                 is_training: bool = True, data_root: Optional[str] = None, interactive: bool = True):
         """
         Initialize LibriSpeech dataset
         
@@ -329,6 +329,7 @@ def __init__(self, config: TrainingConfiguration, split: str,
             split: Dataset split (e.g., "train-clean-100", "dev-clean")
             is_training: Whether this is for training (affects augmentation)
             data_root: Root directory for LibriSpeech data
+            interactive: If False, auto-confirm dataset download/prepare (--yes)
         """
         self.config = config
         self.split = split
@@ -347,7 +348,7 @@ def __init__(self, config: TrainingConfiguration, split: str,
             self.dataset_manager = DatasetManager()
         
         # Ensure dataset is prepared
-        if not self.dataset_manager.prepare_datasets([split], interactive=True):
+        if not self.dataset_manager.prepare_datasets([split], interactive=interactive):
             raise RuntimeError(f"Failed to prepare dataset: {split}")
         
         # Load prepared data file list
@@ -688,14 +689,16 @@ def collate_audio_samples(batch: List[AudioSample]) -> Dict[str, torch.Tensor]:
 
 def create_data_loaders(config: TrainingConfiguration, 
                        data_root: Optional[str] = None,
-                       pin_memory: Optional[bool] = None) -> Tuple[DataLoader, DataLoader, DataLoader]:
+                       pin_memory: Optional[bool] = None,
+                       interactive: bool = True) -> Tuple[DataLoader, DataLoader, DataLoader]:
     """
     Create training, validation, and test data loaders
     
     Args:
         config: Training configuration
         data_root: Root directory for data (optional)
         pin_memory: Override pin_memory setting (optional)
+        interactive: If False, auto-confirm dataset download/prepare (--yes)
         
     Returns:
         Tuple of (train_loader, val_loader, test_loader)
@@ -710,7 +713,8 @@ def create_data_loaders(config: TrainingConfiguration,
             config=config,
             split=split,
             is_training=True,
-            data_root=data_root
+            data_root=data_root,
+            interactive=interactive
         )
         train_datasets.append(dataset)
     
@@ -722,14 +726,16 @@ def create_data_loaders(config: TrainingConfiguration,
         config=config,
         split=config.data.val_split,
         is_training=False,
-        data_root=data_root
+        data_root=data_root,
+        interactive=interactive
     )
     
     test_dataset = LibriSpeechDataset(
         config=config,
         split=config.data.test_split,
         is_training=False,
-        data_root=data_root
+        data_root=data_root,
+        interactive=interactive
     )
     
     # Create data loaders
diff --git a/training/modules/dataset_manager.py b/training/modules/dataset_manager.py
@@ -99,18 +99,21 @@ def _check_aligned_data(self, dataset: str) -> bool:
         return False
     
     def _check_prepared_data(self, dataset: str) -> bool:
-        """Check if prepared data exists (WAV + LAB in flat structure).
-
-        JSON alignment files are optional and may be added later.
-        """
+        """Check if prepared data exists and is ready for training (WAV + LAB + alignment JSONs)."""
         prepared_dataset_dir = self.prepared_dir / dataset
         if not prepared_dataset_dir.exists():
             return False
-            
-        # Require matching WAV and LAB pairs; JSONs are not required
+
         wav_files = set(f.stem for f in prepared_dataset_dir.glob("*.wav"))
         lab_files = set(f.stem for f in prepared_dataset_dir.glob("*.lab"))
-        return len(wav_files) > 0 and wav_files == lab_files
+        if len(wav_files) == 0 or wav_files != lab_files:
+            return False
+
+        # Require at least one alignment JSON so we don't treat "MFA copy failed" as ready
+        json_stems = set(f.stem for f in prepared_dataset_dir.glob("*.json"))
+        if not json_stems or not json_stems.intersection(wav_files):
+            return False
+        return True
     
     def prepare_datasets(self, datasets: List[str], interactive: bool = True) -> bool:
         """
@@ -163,8 +166,12 @@ def prepare_datasets(self, datasets: List[str], interactive: bool = True) -> boo
                     logger.error("Cannot proceed without required datasets")
                     return False
             else:
-                logger.error(f"Missing datasets: {', '.join(missing_datasets)}")
-                return False
+                # Non-interactive (e.g. --yes): auto-confirm download and prepare
+                logger.info(f"Auto-confirming download/prepare for: {', '.join(missing_datasets)}")
+                if not self._download_datasets(missing_datasets):
+                    logger.error("Failed to download datasets")
+                    return False
+                needs_preparation.extend(missing_datasets)
         
         # Handle datasets that need preparation
         if needs_preparation:
@@ -224,9 +231,15 @@ def _download_datasets(self, datasets: List[str]) -> bool:
     def _prepare_single_dataset(self, dataset: str) -> bool:
         """Prepare a single dataset through the full pipeline"""
         logger.info(f"Preparing dataset: {dataset}")
-        
-        # Step 1: Create prepared dataset (WAV + LAB) directly
-        if not self._check_prepared_data(dataset):
+        prepared_dataset_dir = self.prepared_dir / dataset
+
+        # Step 1: Create prepared dataset (WAV + LAB) only if missing
+        has_wav_lab = (
+            prepared_dataset_dir.exists()
+            and len(list(prepared_dataset_dir.glob("*.wav"))) > 0
+            and len(list(prepared_dataset_dir.glob("*.lab"))) > 0
+        )
+        if not has_wav_lab and not self._check_prepared_data(dataset):
             logger.info("Creating prepared dataset...")
             if not self._create_corpus(dataset):
                 return False
@@ -279,22 +292,30 @@ def _create_corpus(self, dataset: str) -> bool:
             return False
     
     def _run_alignment(self, dataset: str) -> bool:
-        """Run MFA alignment using the prepared dataset MFA script"""
+        """Run MFA alignment using the prepared dataset MFA script.
+        If alignment output already exists in cache, runs with --copy-only to skip MFA.
+        """
         try:
-            # Path to the MFA alignment script
             mfa_script = self.project_root / "run_mfa_alignment_prepared.sh"
-            
             if not mfa_script.exists():
                 logger.error(f"MFA alignment script not found: {mfa_script}")
                 return False
-            
-            logger.info(f"Running MFA alignment for {dataset}...")
-            cmd = [str(mfa_script), dataset]
-            
+
+            cache_align_dir = self.cache_dir / f"out_align_{dataset}"
+            copy_only = cache_align_dir.is_dir() and any(
+                cache_align_dir.rglob("*.json")
+            )
+            if copy_only:
+                logger.info(f"Alignment cache exists for {dataset}, running copy-only...")
+                cmd = [str(mfa_script), dataset, "--copy-only"]
+            else:
+                logger.info(f"Running MFA alignment for {dataset}...")
+                cmd = [str(mfa_script), dataset]
+
             result = subprocess.run(cmd, check=True, capture_output=False)
             logger.info(f"MFA alignment completed for {dataset}")
             return True
-            
+
         except subprocess.CalledProcessError as e:
             logger.error(f"MFA alignment failed for {dataset}: {e}")
             return False
diff --git a/training/recipes/tcn_full_uk.toml b/training/recipes/tcn_full_uk.toml
diff --git a/training/train.py b/training/train.py