Easy XTTS Trainer

Easy XTTS Trainer is a CLI for preparing speech datasets and fine-tuning XTTS models. It is designed to run inside the Pandrator ecosystem, while still supporting standalone usage.

What is new

Pixi-first WhisperX runtime (auto, pixi, conda) with legacy Conda fallback.
Optional/lazy DeepFilterNet (--denoise soft-fails when DeepFilter is not installed).
Robust --source-text correction flow for full-book .txt and .epub sources with partial-audio inputs.
Python baseline moved to 3.13 with updated Torch/Coqui compatibility pins.

Requirements

Windows (primary target) with NVIDIA GPU recommended for training.
Python >=3.13,<3.14.
For source-text alignment: optional ctc-forced-aligner dependencies (ffmpeg on PATH; MSVC/Build Tools if building from source on Windows).

Installation

Pandrator-first (recommended)

Install and launch through Pandrator. Pandrator provides the Pixi environment layout used by this trainer.

Standalone

pip install -r requirements.txt

Optional features:

# Optional: denoise support (only on supported Python/platform wheels)
pip install DeepFilterNet==0.5.6 DeepFilterLib==0.5.6

# Optional: breath removal support for --breath
pip install "breath-removal @ git+https://github.com/lukaszliniewicz/breath-removal.git@4f3cda4936ae09ec50af010853577fdd4472fd6a"

# Optional: source-text alignment (requires ffmpeg on PATH; prefer vendored prebuilt wheel)
pip install ./vendor/ctc_forced_aligner-0.3.0-cp313-cp313-win_amd64.whl

# Optional fallback: build latest MahmoudAshraf97 source from Git
pip install "ctc-forced-aligner @ git+https://github.com/MahmoudAshraf97/ctc-forced-aligner.git@264e7a1f81bff9ff5e787a5537020c2ad0b0df02"

Usage

Run help:

python -m easy_xtts_trainer --help

Or, if installed as a package:

easy-xtts-trainer --help

Example:

python -m easy_xtts_trainer \
  --source-language en \
  --input "D:\\audiobook" \
  --session "xtts-finetune-mybook" \
  --sample-method mixed \
  --method-proportion 6_4

WhisperX runtime modes

--whisperx-runner controls transcription execution:

auto (default):
1. explicit CLI Pixi args,
2. WHISPERX_PIXI_EXE + WHISPERX_PIXI_MANIFEST,
3. Pandrator layout discovery (bin/pixi.exe + envs/whisperx_installer/pixi.toml),
4. Conda fallback.
pixi: require Pixi runtime.
conda: use Conda flow only.

Related options:

--whisperx-pixi-exe
--whisperx-pixi-manifest
legacy fallback options: -conda_env, -conda_path

Source-text correction (`--source-text`)

When source text is provided, the pipeline is:

Run WhisperX baseline transcription.
Build query words from Whisper output.
Retrieve top text candidates from the full source (works with whole-book .txt/.epub).
Validate candidates with CTC alignment.
Keep the best candidate only when confidence is high; otherwise keep WhisperX output.

This prevents destructive over-correction while still improving proper nouns and wording when good source context exists.

Key CLI arguments

Argument	Description	Default
`--source-language`	Source language code for transcription/training	required
`--input`	Input folder or single audio file	optional when reusing session
`--session`	Session folder name/path	`xtts-finetune-YYYY-MM-DD-HH-MM`
`--whisper-model`	Whisper model	`large-v3`
`--whisperx-runner`	WhisperX runtime mode (`auto`, `pixi`, `conda`)	`auto`
`--source-text`	`.txt` or `.epub` source text for correction/alignment	off
`--chapter-per-audio`	EPUB chapter grouping size before full-source merge	`1`
`--sample-method`	Segmentation strategy (`maximise-punctuation`, `punctuation-only`, `mixed`)	`maximise-punctuation`
`--method-proportion`	Mixed segmentation split (`N_M`)	`6_4`
`--training-proportion`	Train/validation split (`N_M`)	`8_2`
`--max-audio-time`	Max segment duration (seconds)	`11`
`--max-text-length`	Max segment text length (chars)	`200`
`--denoise`	Enable DeepFilterNet denoise (optional dependency)	off
`--breath`	Enable breath-removal preprocessing (optional dependency)	off
`--dess`	De-esser	off
`--normalize`	Target LUFS normalization	unset (or `-16.0` when flag is used without value)
`--compress`	Compression profile (`male`, `female`, `neutral`)	off
`--prepare_dataset`	Prepare dataset only, skip training	off

Testing

python -m pytest

Notes

ctc-forced-aligner is optional by design.
If DeepFilter is unavailable, --denoise prints a warning and processing continues.
If breath-removal is unavailable, --breath prints a warning and processing continues.
For Pandrator installations, prefer leaving WhisperX runner mode on auto.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
easy_xtts_trainer		easy_xtts_trainer
tests		tests
vendor		vendor
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
easy_xtts_trainer.py		easy_xtts_trainer.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Easy XTTS Trainer

What is new

Requirements

Installation

Pandrator-first (recommended)

Standalone

Usage

WhisperX runtime modes

Source-text correction (`--source-text`)

Key CLI arguments

Testing

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Easy XTTS Trainer

What is new

Requirements

Installation

Pandrator-first (recommended)

Standalone

Usage

WhisperX runtime modes

Source-text correction (--source-text)

Key CLI arguments

Testing

Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Source-text correction (`--source-text`)

Packages