VODER is a professional-grade voice processing tool that provides 10 distinct audio transformation modes in a unified CLI interface. This skill enables AI agents to leverage VODER's full potential for complex audio processing workflows that would be impossible or extremely difficult without this knowledge.
The ten modes are: TTS (Text-to-Speech with optional voice cloning), STS (Speech-to-Speech voice conversion), TTM (Text-to-Music with optional voice cloning), STT (Speech-to-Text transcription), SE (Speech Enhancement), SFX (Sound Effects generation), STT+TTS (transcribe → edit → resynthesize), SVS (Source/Track Vocal Separation), SLC (Spoken Language Conversion — dubbing), and SS (Speaker Separation).
Core Philosophy: VODER prioritizes quality over speed. There are no "fast" or "degraded" model options. The tool uses the best available models (Whisper large-v3-turbo / large-v3, Qwen3-TTS, Seed-VC, ACE-Step XL-Turbo / XL-Base / 1.5, BS-RoFormer Resurrection, VibeVoice ASR, Pyannote, UniSE, TangoFlux) to produce professional-quality output.
VODER is not a single AI model — it is an orchestration layer that coordinates multiple state-of-the-art AI models to perform audio transformations. Understanding this architecture is crucial for combining features effectively.
| Model | Purpose | Used In Modes |
|---|---|---|
| Whisper large-v3-turbo | Fast speech-to-text transcription | STT, STT+TTS, Dialogue Source Analysis |
| Whisper large-v3 | High-accuracy transcription + translation to English | STT with translate flag, SLC (translation step) |
| Qwen3-TTS VoiceDesign | Generate speech from voice descriptions | TTS (voice design path) |
| Qwen3-TTS Base | Text-to-speech with built-in voice cloning | TTS (voice clone path via target), STT+TTS, SLC (resynthesis step) |
| Seed-VC v2 | Voice conversion (22.05kHz speech) | STS, TTM with vc flag |
| Seed-VC v1 | Voice conversion (44.1kHz music) | MSTS (music voice conversion) |
| ACE-Step XL-Turbo | Enhanced music generation (highest quality) | TTM with overdose flag |
| ACE-Step XL-Base | Music generation (complete-mode sub-tasks) | TTM (complete, extract, lego) |
| ACE-Step 1.5 | Music generation (legacy / background music) | TTM (default), Background Music (dialogue music param) |
| BS-RoFormer Resurrection | Vocal/music separation (stem extraction) | SVS, STS (auto vocal extraction), STT (pre-cleanup), TTS (voice clone cleanup), TTM bgm (strip music + reference cleanup) |
| VibeVoice ASR | Advanced ASR with native speaker diarization | STT with overdose flag, SS |
| Pyannote | Speaker diarization (who spoke when) | STT with dialogue flag |
| EasyOCR | Text extraction from images | STT with image input |
| UniSE | Speech enhancement/denoising | SE |
| TangoFlux | Text-to-audio sound effects | SFX |
INPUT TYPES:
┌─────────────────────────────────────────────────────────────────┐
│ Text ──────────────────► TTS, TTM, SFX │
│ Audio ─────────────────► STS, STT, STT+TTS, SE, SVS, SLC, SS │
│ Video ─────────────────► STS, STT, SE, SVS, SS (auto-extract) │
│ Image ─────────────────► STT (OCR text extraction) │
│ YouTube/URL ───────────► STT, STS, TTM, SVS, SLC (auto-dl) │
└─────────────────────────────────────────────────────────────────┘
OUTPUT TYPES:
┌─────────────────────────────────────────────────────────────────┐
│ Audio Output: TTS, STS, TTM, SE, SFX, SVS, SLC │
│ Audio Stems: SVS (voice + instrumental) │
│ Audio Files: SS (per-speaker segments) │
│ Text Output: STT, SS (transcript) │
│ Interactive: STT+TTS (requires text editing step) │
└─────────────────────────────────────────────────────────────────┘
Understanding how data flows through VODER helps you chain operations:
TEXT INPUT PATH (TTS - Voice Design):
Text + Voice Description → Qwen3-TTS VoiceDesign → [Speech with Designed Voice]
TEXT INPUT PATH (TTS - Voice Cloning):
Text + Reference Audio → Qwen3-TTS Base (extract voice embedding → synthesize with clone) → [Speech with Cloned Voice]
AUDIO INPUT PATH (Voice Conversion - STS):
Source Audio + Target Voice Audio → Seed-VC → [Converted Audio]
AUDIO INPUT PATH (Transcription):
Audio → Whisper (large-v3-turbo) → [Transcript Text]
AUDIO INPUT PATH (Translation):
Audio → Whisper large-v3 → [English Transcript Text]
AUDIO INPUT PATH (Overdose Transcription):
Audio → VibeVoice ASR → [Transcript with Native Diarization]
MUSIC GENERATION PATH (Standard):
Lyrics + Style → ACE-Step 1.5 → [Music]
MUSIC GENERATION PATH (Overdose):
Lyrics + Style → ACE-Step XL-Turbo → [Enhanced Quality Music]
MUSIC GENERATION PATH (Voice Clone):
Lyrics + Style → ACE-Step → [Music with Vocals] → Seed-VC Voice Clone → Final Music
ENHANCEMENT PATH:
Degraded Audio → UniSE → [Clean Audio at 16kHz]
SEPARATION PATH (SVS):
Mixed Audio → BS-RoFormer → [Vocals] + [Instrumental]
LANGUAGE CONVERSION PATH (SLC):
Source Audio → Whisper Translate → English Text → Qwen3-TTS (with voice ref) → Translated Audio
SPEAKER SEPARATION PATH (SS):
Multi-Speaker Audio → VibeVoice ASR → Speaker Segments → Individual Audio Files
BGM REPLACEMENT PATH (TTM BGM):
Source Audio/Video → SVS Voice Pipe (strip music) → Detect Duration → ACE-Step (generate new bgm) → Mix at level → [Re-mux if video]
VODER uses three types of parameters:
| Type | Description | Examples |
|---|---|---|
| Positional | Mode name comes first, input files follow | stt "audio.wav" |
| Named | Key-value pairs with space separation | voice "male" duration 30 |
| Flags | Standalone keywords that enable features | timestamp dialogue music translate overdose mimic vc |
Some parameters accept multiple values (dialogue mode), others accept single values:
| Parameter | Single Value | Multiple Values | Mode |
|---|---|---|---|
script |
"Hello world" |
"James: Hello" "Sarah: Hi" |
TTS |
voice |
"male voice" |
"James: male" "Sarah: female" |
TTS |
target |
"voice.wav" |
"James: james.wav" "Sarah: sarah.wav" |
TTS, STS, SLC |
music |
"ambient" |
(single only) | TTS (dialogue) |
level |
"35" |
(single only) | TTS (dialogue) |
reference |
"ref.wav" |
(single only) | TTS (dialogue bgm) |
lyrics |
"..." |
(single only) | TTM |
styling |
"pop" |
(single only) | TTM |
stem |
"voice" |
(single only) | SVS |
sound |
"rain" |
(single only) | SFX |
steps |
30 |
(single only) | SFX |
guide |
4.5 |
(single only) | SFX |
- Mode comes first:
tts,stt,sts,ttm,svs,slc,ss, etc. - Required parameters follow:
script,voice,target,base,lyrics,styling, etc. - Optional parameters come after:
music,level,result,vc,stem,task, etc. - Flags can appear anywhere after mode:
timestamp,dialogue,music(STS),mimic(STS),translate(STT),overdose(STT, TTM),vc(TTM)
| Mode | Section | Input Type | Output Type | One-Liner Support |
|---|---|---|---|---|
| TTS | 2.1 | Text [ + Audio ] | Audio | ✅ Full (single + dialogue, voice cloning via target) |
| STS | 2.2 | Audio/Video + Audio | Audio/Video | ✅ Single only |
| TTM | 2.3 | Text [ + Audio ] | Audio | ✅ Single only (voice cloning via vc + clone) |
| STT | 2.4 | Audio/Video/Image/URL | Text | ✅ Full (single + batch) |
| SE | 2.5 | Audio/Video | Audio/Video | ✅ Full |
| SFX | 2.6 | Text | Audio | ✅ Full |
| STT+TTS | 2.7 | Audio + Audio | Audio | ❌ Interactive only |
| SVS | 2.8 | Audio/Video/URL | Audio (stems) | ✅ Full |
| SLC | 2.9 | Audio [ + Audio ] | Audio | ✅ Full |
| SS | 2.10 | Audio/Video/URL | Audio + Text | ✅ Full |
Note:
tts+vcandttm+vcare no longer accepted as commands and will produce an error. Usettswithtargetfor voice cloning, andttmwithvc+clonefor voice conversion in TTM.
TTS mode generates human-like speech from text input. It supports two synthesis paths in a single unified mode:
-
Voice Design (
voiceparameter): Creates voices from scratch based on natural language descriptions using Qwen3-TTS VoiceDesign. You can describe voices that don't exist in any database — a "weathered old sailor with a gravelly voice" or a "cheerful AI assistant with a slight metallic quality." -
Voice Cloning (
targetparameter): Generates speech that sounds like a specific real person from a reference audio file using Qwen3-TTS Base's built-in cloning capability. The reference audio can be a recording of anyone (with ethical consent), and the output will match their voice characteristics.
Both paths can be mixed in the same dialogue using the cross-use feature — some characters designed, others cloned.
Note: The old
tts+vccommand is no longer accepted. Usettswith thetargetparameter instead.
Voice Design Path:
- Voice Prompt Interpretation: The model parses your voice description to extract characteristics (age, gender, tone, pace, accent)
- Speech Synthesis: Text is converted to mel-spectrograms based on the voice characteristics
- Audio Generation: Spectrograms are converted to waveform audio
Voice Cloning Path (IMPORTANT: Uses Qwen3-TTS Base Built-in Cloning):
Voice cloning does NOT use Seed-VC. It uses Qwen3-TTS Base's built-in voice cloning capability:
- Voice Embedding Extraction: Qwen3-TTS Base's
create_voice_clone_prompt()method analyzes the reference audio and extracts a voice embedding (x-vector) usingx_vector_only_mode=True - Direct Synthesis with Clone: The
generate_voice_clone()method synthesizes the text directly with the cloned voice characteristics embedded — this is NOT a two-step process (synthesis then conversion), but a single integrated process - Consistency Optimization: In dialogue mode, the voice embedding is extracted once per character at the start and reused for all their lines
Why This Matters: Unlike a two-stage process (synthesize → convert), Qwen3-TTS Base's integrated cloning produces more natural results because the voice characteristics are considered during the entire synthesis process, not applied as a transformation afterward.
Shared Path:
4. Optional Music Addition: If music parameter is provided, ACE-Step generates background music that matches the dialogue duration
| Scenario | Voice Design (voice) |
Voice Cloning (target) |
|---|---|---|
| Fictional characters | ✅ Ideal | ❌ No reference exists |
| Brand-consistent content | ✅ If voice profile defined | ✅ If reference available |
| Localization | ✅ Possible | ✅ Better — preserves identity |
| Accessibility | ❌ No reference | ✅ Use familiar voice |
| Podcast/narration | ✅ Full control | ✅ Match existing host |
| Testing/prototyping | ✅ Fast iteration | ❌ Need reference first |
# Minimal command
python src/voder.py tts script "Your text here" voice "voice description"
# With output routing
python src/voder.py tts script "Your text here" voice "voice description" result "/output/file.wav"
# Full command with music
python src/voder.py tts script "Your text here" voice "voice description" music "music description" level "volume" result "/output/file.wav"
# OCR input (image to narration)
python src/voder.py tts ocr "path/to/image.png" voice "text: professional male narrator"
python src/voder.py tts ocr "script_screenshot.jpg" voice "text: warm female voice"# Voice cloning with target parameter
python src/voder.py tts script "Your text here" target "voice_reference.wav"
# With output routing
python src/voder.py tts script "Your text here" target "voice_reference.wav" result "/output/file.wav"
# OCR input with voice clone
python src/voder.py tts ocr "path/to/image.png" target "text: voice_reference.wav"# Two characters
python src/voder.py tts script "Character1: line1" "Character2: line2" voice "Character1: voice prompt1" "Character2: voice prompt2"
# Three+ characters
python src/voder.py tts script "A: line" "B: line" "C: line" voice "A: prompt" "B: prompt" "C: prompt"
# Dialogue with background music
python src/voder.py tts script "A: line1" "B: line2" voice "A: prompt1" "B: prompt2" music "ambient description"
# Dialogue with music and volume control
python src/voder.py tts script "A: line1" "B: line2" voice "A: prompt1" "B: prompt2" music "ambient description" level "35"
# Dialogue with SFX lines embedded
python src/voder.py tts script "A: Hello" "sfx: door bell /duration:3" "B: Who's there?" voice "A: male" "B: female"
# Full dialogue command with all features
python src/voder.py tts script "A: Welcome /time:0" "sfx: intro /duration:5 /level:40 /time:0" "B: Hello! /time:6" voice "A: deep male" "B: bright female" music "soft ambient" level "0:30-60:20" result "/output/podcast.wav"
# Dialogue with background music and reference for style guidance
python src/voder.py tts script "A: line1" "B: line2" voice "A: prompt" "B: prompt" music "ambient" reference "style_ref.wav"# Two characters with cloned voices
python src/voder.py tts script "James: line1" "Sarah: line2" target "James: /path/to/james.wav" "Sarah: /path/to/sarah.wav"
# With background music
python src/voder.py tts script "J: Hello" "S: Hi" target "J: james.wav" "S: sarah.wav" music "jazz background" level "30"# Mix designed and cloned voices in the same dialogue
python src/voder.py tts script \
"James: Welcome to our podcast!" \
"Sarah: Thanks for having me!" \
voice "James: deep male voice, authoritative" \
target "Sarah: /path/to/sarah_voice_reference.wav"
# Cross-use: James cloned, Sarah designed
python src/voder.py tts script \
"James: Let me share my screen." \
"Sarah: Go ahead, I'm ready." \
target "James: /path/to/james_voice.wav" \
voice "Sarah: bright female voice, enthusiastic"
# Three characters: mixed approach
python src/voder.py tts script \
"Host: Welcome to the debate!" \
"Guest1: Thank you for having me." \
"Guest2: Pleasure to be here." \
voice "Host: professional broadcaster, neutral accent" \
target "Guest1: /path/to/guest1.wav" "Guest2: /path/to/guest2.wav"| Parameter | Required | Purpose | Single Mode | Dialogue Mode |
|---|---|---|---|---|
script |
Yes | Text to synthesize | Single text string | Multiple "Char: text" strings |
voice |
Yes* | Voice description | Single prompt | "Char: prompt" per character |
target |
No* | Voice reference file | Single path | "Char: /path/to/file.wav" |
music |
No | Background music style | Ignored | Single description |
level |
No | Music volume | Ignored | Volume specification |
reference |
No | Reference audio for bgm style guidance | Ignored | Single path (processed via SVS music pipe) |
result |
No | Output destination | Path | Path |
*Either voice or target required for non-SFX lines. Can mix both using cross-use feature. If target is provided without voice, voice cloning path is used automatically.
Voice prompts are natural language descriptions. The model extracts semantic meaning, so order doesn't matter:
"adult male, deep voice, authoritative tone, British accent, measured pace"
"young female, energetic, fast-paced, cheerful, American accent"
"elderly male, gravelly voice, slow and deliberate, storytelling quality"
Effective Elements to Include:
- Age: young adult, middle-aged, elderly
- Gender: male, female, androgynous
- Tone: warm, cold, friendly, authoritative, dramatic
- Pace: fast-paced, measured, slow, deliberate
- Quality: clear, gravelly, breathy, resonant
- Accent: British, American, Southern, neutral
- Context: professional, casual, broadcast, conversational
| Factor | Requirement | Why |
|---|---|---|
| Duration | 10-30 seconds optimal | Enough data for voice extraction; longer doesn't help |
| Quality | Clear, minimal noise | Noise interferes with voice feature extraction |
| Content | Continuous speech | Silence or music doesn't contribute voice data |
| Speaker | Single speaker only | Mixed speakers confuse the extraction |
| Format | WAV preferred, MP3 supported | WAV preserves audio fidelity |
Pro Tip: Run noisy reference audio through SE (Speech Enhancement) before using for voice cloning. VODER can also use BS-RoFormer to extract clean vocals from a mixed recording before cloning.
VODER extracts voice characteristics once per character at the start of dialogue processing. This means:
- All lines from "James" use the same extracted voice profile
- No variation between the 1st and 10th line of the same character
- Professional-quality consistency throughout long dialogues
STS mode transforms the voice in source audio to sound like a different person, while preserving everything else — the words, emotion, timing, prosody, pauses, and delivery style. Only the speaker identity changes.
STS supports audio and video input/output. When given a video file, the audio track is extracted, processed, and optionally re-attached to the video.
- Input Processing: Audio extracted from video if needed; optionally BS-RoFormer auto-extracts vocals from mixed audio
- Content Extraction: Seed-VC extracts the linguistic and prosodic content from source audio (what was said, how it was said)
- Voice Extraction: The target voice reference is analyzed for speaker characteristics
- Voice Transfer: The content is re-synthesized with the target voice characteristics
- Output Assembly: Converted audio is written (or re-attached to video container)
- Sample Rate Handling: v2 model outputs at 22.05kHz (speech), v1 at 44.1kHz (music)
When the source audio contains music or background noise, BS-RoFormer can automatically extract the vocal track before voice conversion. This produces cleaner results by separating the voice from interference before Seed-VC processes it. This is particularly useful for:
- Converting vocals in songs (use with
musicflag) - Cleaning up interview recordings before conversion
- Processing audio with significant background noise
| Scenario | Use STS When... | Use TTS When... |
|---|---|---|
| Input | You have audio you want to preserve | You have text you want to speak |
| Delivery | You want to keep original emotion/timing | You want fresh synthesis |
| Content | Content is fixed (what was said) | You can edit the text |
| Source | Performance matters (acting, singing) | Text-only workflow |
# Basic command
python src/voder.py sts base "source_audio.wav" target "voice_reference.wav"
# With output routing
python src/voder.py sts base "source.wav" target "voice.wav" result "/output/converted.wav"
# From video file (audio auto-extracted, output re-attached to video)
python src/voder.py sts base "presentation.mp4" target "voice_actor.wav" result "/output/output.mp4"
# Audio-only output from video
python src/voder.py sts base "presentation.mp4" target "voice_actor.wav" result "/output/output.wav"# For songs/musical content - uses 44.1kHz model
python src/voder.py sts base "song.wav" target "singer_voice.wav" music
# Convert singing voice in a song
python src/voder.py sts base "original_song.wav" target "new_singer.wav" music result "/output/cover.wav"
# From video (music video voice conversion)
python src/voder.py sts base "music_video.mp4" target "new_singer.wav" music result "/output/cover.mp4"# Transfer voice timbre AND accent/emotion/style from target
python src/voder.py sts base "source.wav" target "character.wav" mimic
# This is invalid - mimic and music cannot be combined
python src/voder.py sts base "source.wav" target "reference.wav" mimic music
# Error: music and mimic cannot be used togetherMimic Language Quality Note: When using mimic for cross-language voice conversion (e.g., converting Spanish speech to an English speaker's voice), quality may vary. Mimic transfers timbre and style but does not translate content. For language conversion, use SLC mode (2.9) instead.
| Flag | Model | Sample Rate | Use Case |
|---|---|---|---|
| (none) | Seed-VC v2 | 22.05kHz | Speech, podcasts, interviews |
music |
Seed-VC v1 | 44.1kHz | Songs, musical content, singing |
| Input | Output | Behavior |
|---|---|---|
| Audio (WAV, MP3, FLAC) | Audio (WAV) | Standard processing |
| Video (MP4, MKV, AVI) | Audio (WAV) | Audio extracted, processed, output as audio |
| Video (MP4, MKV, AVI) | Video (MP4) | Audio extracted, processed, re-attached to original video |
Tip: The output format (audio vs video) is determined by the
resultfile extension. Use.wavfor audio-only,.mp4for video output.
TTM mode generates complete musical compositions from lyrics and style descriptions using the ACE-Step model family. The model creates both the instrumental arrangement AND the vocal performance. You provide lyrics, describe the musical style, specify duration, and receive a fully produced song.
With the vc flag and clone parameter, TTM also supports voice cloning — generating music where the vocalist sounds like a specific real person.
Note: The old
ttm+vccommand is no longer accepted. Usettm vcwithclone "path"for voice clone source. Thetargetparameter is reserved for optional music references (target voice "path"/target music "path").
- Lyrics Processing: Lyrics are parsed into vocal melody and rhythm
- Style Interpretation: Style prompt guides instrumentation, genre, mood, tempo
- Music Generation: ACE-Step model creates aligned instrumental and vocal tracks
- Duration Matching: Output is stretched/compressed to hit target duration
- Optional Voice Conversion (with
vcflag): ACE-Step is offloaded from memory, Seed-VC converts the vocal track to match reference voice, converted vocals are mixed back with instrumental
TTM mode automatically selects the best ACE-Step model based on the task and flags:
| Tier | Model | Quality | Speed | When Used |
|---|---|---|---|---|
| Turbo | ACE-Step XL-Turbo | Highest | Slowest | overdose flag — maximum quality generation |
| Base | ACE-Step XL-Base | High | Medium | complete, extract, lego sub-tasks |
| Legacy | ACE-Step 1.5 | Standard | Fastest | Default (no task specified), background music in dialogue |
TTM supports multiple sub-tasks via the task parameter:
| Sub-Task | CLI Keyword | Description | Model |
|---|---|---|---|
| Complete | complete |
Add missing tracks to existing audio | XL-Base |
| Lego | lego |
Build/generate individual instrument tracks | XL-Base |
| Extract | extract |
Extract individual tracks from audio | XL-Base |
| Remix | remix |
Style transfer (cover) with bias control; supports reference for additional guidance |
XL-Turbo (overdose) or Legacy |
| Repaint | repaint |
Restyle a specific time range of a song; supports reference for additional guidance |
XL-Turbo (overdose) or Legacy |
| BGM | bgm |
Replace background music in existing audio/video; strips music, generates new bgm, mixes at level | 1.5 Turbo (standard) or XL-Turbo (overdose) |
| Overdose | (flag) | Maximum quality full generation | XL-Turbo |
The 12 available tracks are used by lego, extract, and complete sub-tasks:
| Track | Name | Description |
|---|---|---|
drums |
Drums | Drum kit, percussion backbone |
bass |
Bass | Bass guitar, synth bass, upright bass |
guitar |
Guitar | Electric guitar (lead/rhythm) |
keyboard |
Keyboard | Piano, organ, synthesizer keys |
strings |
Strings | Violin, cello, string ensemble |
brass |
Brass | Trumpet, trombone, horn section |
woodwinds |
Woodwinds | Flute, clarinet, saxophone |
synth |
Synthesizer | Synth leads, pads, arpeggios |
percussion |
Percussion | Hand percussion, shakers, congas |
fx |
FX / Sound Design | Sound effects, textures, atmospheric elements |
vocals |
Vocals | Lead vocal track |
backing_vocals |
Backing Vocals | Background vocals, harmonies |
Shortcuts:
everything= all 12 tracksinstruments= first 10 (drums through fx, non-voice)voices=vocals+backing_vocals
Using "..." as lyrics generates purely instrumental music with no vocals. This is how background music for dialogue is created internally.
# With lyrics (song with vocals) — uses ACE-Step 1.5 (legacy, fast)
python src/voder.py ttm lyrics "Verse 1:\nLyrics here\n\nChorus:\nChorus lyrics" styling "pop, upbeat, female vocals" duration 60
# Instrumental only (no vocals)
python src/voder.py ttm lyrics "..." styling "cinematic orchestral, dramatic" duration 90
# With output routing
python src/voder.py ttm lyrics "..." styling "ambient electronic, chill" duration 120 result "/output/background.wav"
# Short jingle
python src/voder.py ttm lyrics "..." styling "upbeat corporate, bright" duration 15 result "/output/jingle.wav"# Overdose: highest quality generation using ACE-Step XL-Turbo
python src/voder.py ttm lyrics "Verse 1:\nLyrics here\n\nChorus:\nChorus lyrics" styling "pop, upbeat, female vocals" duration 60 overdose
# Overdose instrumental
python src/voder.py ttm lyrics "..." styling "cinematic orchestral, dramatic" duration 90 overdose result "/output/high_quality.wav"# Complete: add missing tracks to existing audio
python src/voder.py ttm complete "base_track.wav" add "drums bass" styling "rock ballad" result "/output/completed.wav"
# Lego: build/generate individual instrument tracks
python src/voder.py ttm lego "..." make "drums bass strings" styling "jazz trio" duration 120 result "/output/stems.wav"
# Extract: extract individual tracks from audio
python src/voder.py ttm extract "existing_song.wav" stems "vocals drums bass" result "/output/extracted/"
# Remix: style transfer (cover) with bias control
python src/voder.py ttm remix "input.wav" styling "jazz" bias 40 result "/output/remix.wav"
# Remix with reference (voice extraction from reference for guidance)
python src/voder.py ttm remix "input.wav" styling "jazz" reference voice "ref.wav" result "/output/remix.wav"
# Remix with reference (music extraction from reference)
python src/voder.py ttm remix "input.wav" styling "jazz" reference music "ref.wav" result "/output/remix.wav"
# Remix with reference (used as-is, no extraction)
python src/voder.py ttm remix "input.wav" styling "jazz" reference "ref.wav" result "/output/remix.wav"
# Overdose remix with reference
python src/voder.py ttm overdose remix "input.wav" styling "jazz" reference voice "ref.wav" result "/output/remix.wav"
# Repaint: restyle a specific time range of a song
python src/voder.py ttm repaint "source.wav" time:20-80 styling "more energetic" result "/output/repainted.wav"
# Repaint with reference (voice extraction from reference for guidance)
python src/voder.py ttm repaint "source.wav" time:20-80 styling "more energetic" reference voice "ref.wav" result "/output/repainted.wav"
# Repaint with reference (used as-is)
python src/voder.py ttm repaint "source.wav" time:20-80 styling "more energetic" reference "ref.wav" result "/output/repainted.wav"
# Overdose repaint with reference
python src/voder.py ttm overdose repaint "source.wav" time:20-80 styling "more energetic" reference music "ref.wav" result "/output/repainted.wav"# Generate song with cloned vocalist (vc flag + clone)
python src/voder.py ttm vc lyrics "Verse 1:\nMy lyrics here" styling "rock ballad, emotional" duration 60 clone "singer_reference.wav"
# Instrumental backing + cloned voice
python src/voder.py ttm vc lyrics "..." styling "acoustic guitar backing" duration 180 clone "voice.wav" result "/output/backing.wav"
# With output routing
python src/voder.py ttm vc lyrics "Chorus:\nThis is our moment" styling "pop anthem" duration 45 clone "artist.wav" result "/output/song.wav"
# With optional music reference
python src/voder.py ttm vc lyrics "Chorus:\nThis is our moment" styling "pop" duration 30 clone "singer.wav" target music "backing_track.wav"python src/voder.py ttm overdose vc lyrics "content" styling "prompt" duration 20 clone "path/link" target music "path/link" result "path"# Replace background music (standard quality, ACE-Step 1.5 Turbo)
python src/voder.py ttm bgm "podcast.wav" music "soft ambient piano" level 30
# Replace background music (overdose quality, ACE-Step XL-Turbo)
python src/voder.py ttm overdose bgm "video.mp4" music "cinematic orchestral" level 50
# Replace background music with reference for style guidance
python src/voder.py ttm bgm "podcast.wav" music "upbeat electronic" level 35 reference "style_ref.wav"
# From YouTube URL with result routing
python src/voder.py ttm bgm "https://youtube.com/watch?v=..." music "ambient chill" level 25 result "/output/new_bgm.wav"BGM Pipeline: Source → SVS voice pipe (strip existing music) → detect duration → ACE-Step generate new bgm in 250-300s chunks → [optional SVS music pipe on reference] → mix at level → re-mux to video if needed
BGM Output Naming: voder_ttm_bgm_{original-name}_{timestamp}.wav (audio) or .mp4 (video)
BGM Key Rules:
bgmcannot be combined withvc,remix,repaint,complete,lego, orextract- Source supports audio, video, and URL inputs
- Normal uses ACE-Step turbo 1.5; overdose uses ACE-Step XL 1.5 turbo
- Default volume level is 35
Verse 1:
First line of verse
Second line of verse
Chorus:
Chorus lyrics here
More chorus lyrics
Verse 2:
Second verse content
Bridge:
Bridge section lyrics
Outro:
Final lines
| Element | Examples |
|---|---|
| Genre | pop, rock, electronic, jazz, classical, hip-hop, folk |
| Mood | upbeat, melancholic, dramatic, peaceful, energetic |
| Instrumentation | piano and strings, heavy guitars, synthesizer, acoustic guitar |
| Tempo | slow ballad, mid-tempo, fast-paced |
| Vocals | female vocals, male vocals, choir, no vocals |
| Duration | Best For | Quality |
|---|---|---|
| 10-30s | Jingles, transitions, intros | Very consistent |
| 30-60s | Verses, choruses | Consistent |
| 60-120s | Complete short songs | Generally consistent |
| 120-300s | Full compositions | May have variation |
The automatic model offloading between ACE-Step and Seed-VC stages means voice clone mode uses less peak memory than running TTM and STS separately.
| Parameter | Required | Purpose | Default |
|---|---|---|---|
lyrics |
Yes* | Song lyrics or "..." for instrumental |
— |
styling |
Yes** | Musical style description | — |
duration |
Yes** | Target duration in seconds | — |
clone |
No* | Voice clone source path (required when vc is set) |
— |
target |
No | Music reference audio (optional, with type prefix: target voice "path" or target music "path") |
— |
remix |
No | Source audio for remix style transfer | — |
repaint |
No | Source audio for section repaint | — |
time:start-end |
No† | Time range (for repaint, required) | — |
bias |
No | Cover strength 0-100 (for remix/repaint) | 40 |
reference |
No | Reference audio for remix/repaint guidance (reference voice "path", reference music "path", or reference "path" for as-is) |
— |
complete |
No | Complete sub-task flag | Off |
lego |
No | Lego sub-task flag | Off |
extract |
No | Extract sub-task flag | Off |
add |
No | Instrument list for complete (e.g., add "drums bass") |
All instruments |
make |
No | Instrument list for lego (e.g., make "drums bass") |
All instruments |
stems |
No | Instrument list for extract (e.g., stems "vocals drums") |
All stems |
only |
No | Extract single track only (no mix) | Off |
mix |
No | Mix extracted lego/extract tracks back | Off |
blend |
No | Blend mode for lego | — |
voice |
No | Use vocals category (for complete/lego) | Off |
music |
No | Use instruments category (for complete/lego) | Off |
video |
No | Output video (for complete) | Off |
bgm |
No | Replace background music in source (audio/video/URL) | — |
level |
No | Music volume for bgm sub-task (0-100) | 35 |
reference |
No | Reference audio for remix/repaint/bgm guidance | — |
vc |
No | Enable voice cloning on vocalist | Off |
overdose |
No | Use XL-Turbo for maximum quality | Off |
result |
No | Output destination | Auto-generated |
*clone required when vc is set. target is optional music reference only.
**Required for generation tasks (default, overdose).
†Required when using repaint sub-task.
STT mode converts audio, video, images, and URLs into text. It uses Whisper for transcription and can optionally translate to English, identify who spoke when using Pyannote speaker diarization, or use VibeVoice ASR for advanced transcription with native diarization. This is the only mode that produces text output as its primary output (SS also produces text as a secondary output).
- Input Processing: Audio extracted from video; text extracted from images via OCR; URLs downloaded via yt-dlp
- Pre-Cleanup (optional): BS-RoFormer can separate vocals from music/noise before transcription for cleaner results
- Transcription: Whisper transcribes with word-level timestamps (or VibeVoice ASR with
overdose) - Translation (optional): With
translateflag, Whisper large-v3 translates non-English audio to English - Optional Diarization: Pyannote identifies speaker segments (or VibeVoice with
overdose) - Alignment: Transcription and diarization are aligned using three-tier overlap matching
- Output: Text file saved to results/ directory
| Input Type | How It's Processed |
|---|---|
| Audio file (WAV, MP3, FLAC, etc.) | Direct transcription |
| Video file (MP4, MKV, AVI, etc.) | Audio track extracted, then transcribed |
| Image file (PNG, JPG, etc.) | Text extracted via EasyOCR |
| YouTube URL | Audio downloaded via yt-dlp, then transcribed |
| Bilibili URL | Audio downloaded via yt-dlp, then transcribed |
| TikTok URL | Audio downloaded via yt-dlp, then transcribed |
STT supports two advanced flags that cannot be used together (mutually exclusive):
| Flag | Model Used | What It Does |
|---|---|---|
translate |
Whisper large-v3 | Transcribes AND translates non-English audio to English text |
overdose |
VibeVoice ASR | Advanced transcription with native speaker diarization built-in |
translate flag: Uses Whisper large-v3 (not turbo) for maximum translation accuracy. The output is English text regardless of the source language. Useful for subtitling foreign content, translating meetings, or processing multilingual media.
overdose flag: Uses VibeVoice ASR which provides superior transcription quality with native speaker diarization — no separate Pyannote step needed. Ideal for challenging audio (multiple speakers, overlapping speech, noisy environments). Note: overdose implies diarization; the dialogue flag is redundant and ignored when overdose is active.
Mutual Exclusivity: overdose and translate cannot be combined. If both are specified, an error is raised. Choose one based on your need:
- Need English text from foreign audio? →
translate - Need best transcription + speaker labels? →
overdose - Need standard transcription? → Neither flag (uses Whisper large-v3-turbo)
For audio with significant background music or noise, STT can internally use BS-RoFormer (SVS mode) to extract clean vocals before transcription. This is triggered automatically when the audio is detected to have high noise/music content, or can be manually invoked by running SVS first:
# Manual pre-cleanup: separate vocals, then transcribe
python src/voder.py svs "noisy_recording.wav" stem voice result "/clean/vocals.wav"
python src/voder.py stt "/clean/vocals.wav" timestamp# Single audio file
python src/voder.py stt "audio.wav"
# Video file (audio auto-extracted)
python src/voder.py stt "video.mp4"
# Image file (OCR text extraction)
python src/voder.py stt "screenshot.png"
# YouTube URL
python src/voder.py stt "https://www.youtube.com/watch?v=VIDEO_ID"
# Bilibili URL
python src/voder.py stt "https://www.bilibili.com/video/BV1xx411c7mD"
# TikTok URL
python src/voder.py stt "https://www.tiktok.com/@user/video/123456789"python src/voder.py stt "audio.wav" timestamppython src/voder.py stt "audio.wav" dialogue# Translate non-English audio to English
python src/voder.py stt "spanish_interview.mp3" translate
# Translate with timestamps
python src/voder.py stt "french_meeting.wav" translate timestamp
# Translate YouTube video
python src/voder.py stt "https://youtube.com/watch?v=VIDEO_ID" translate result "/output/english_transcript.txt"# Advanced transcription with native diarization
python src/voder.py stt "noisy_meeting.wav" overdose
# Overdose with timestamps
python src/voder.py stt "podcast_episode.wav" overdose timestamp
# Overdose for YouTube
python src/voder.py stt "https://youtube.com/watch?v=VIDEO_ID" overdose result "/output/overdose_transcript.txt"python src/voder.py stt "audio.wav" timestamp dialogue result "/output/transcript.txt"# Multiple files
python src/voder.py stt "file1.wav" "file2.mp3" "file3.mp4"
# Batch with timestamps and diarization
python src/voder.py stt "meeting1.wav" "meeting2.wav" timestamp dialogue result "/output/transcripts/"
# Batch with translation
python src/voder.py stt "spanish_ep1.wav" "spanish_ep2.wav" translate result "/output/translations/"| Flags | Output Format | Example |
|---|---|---|
| (none) | Plain text | Hello everyone welcome to today's meeting |
timestamp |
Timestamped segments | [00:00.000 → 00:03.500] Hello everyone |
dialogue |
Speaker-labeled | Speaker 1: Hello everyone |
timestamp dialogue |
Combined | [00:00.000 → 00:03.500] Speaker 1: Hello everyone |
translate |
English text | Hello everyone welcome to today's meeting (translated) |
translate timestamp |
Translated + timestamps | [00:00.000 → 00:03.500] Hello everyone |
overdose |
Enhanced + speaker-labeled | Speaker 1 (00:00): Hello everyone |
overdose timestamp |
Enhanced + timestamps + speakers | [00:00.000 → 00:03.500] Speaker 1: Hello everyone |
Speaker diarization (dialogue flag) requires:
- HuggingFace account
- Token from https://huggingface.co/settings/tokens
- Accept conditions at https://huggingface.co/pyannote/speaker-diarization-community-1
- Token in
HF_TOKEN.txtfile orHF_TOKENenvironment variable
Note:
overdoseflag uses VibeVoice ASR and does NOT require HF_TOKEN for diarization. Useoverdoseif you don't have a HF_TOKEN but still need speaker identification.
SE mode improves audio quality by removing noise, reducing reverberation, and restoring speech clarity. It's designed specifically for speech content — not music.
- Audio Analysis: UniSE model separates speech from noise/reverb
- Noise Reduction: Background noise is suppressed
- Dereverberation: Room echo and reverb are reduced
- Restoration: Speech frequencies are enhanced for clarity
- Output: Clean audio at 16kHz sample rate
- Cannot recover severely corrupted audio
- Not designed for music (will degrade musical content)
- Cannot fix very low sample rate recordings
- Cannot restore missing frequencies
# Basic enhancement
python src/voder.py se "noisy_audio.wav"
# From video file (enhanced audio re-attached to video)
python src/voder.py se "recording.mp4"
# Audio-only output from video
python src/voder.py se "recording.mp4" result "/output/clean.wav"
# With output routing
python src/voder.py se "audio.wav" result "/output/clean.wav"
# Enhance before using for voice cloning
python src/voder.py se "noisy_reference.wav" result "/clean/reference.wav"- Noisy meeting recordings
- Distant microphone recordings
- Room echo removal
- Pre-processing before voice cloning
- Cleaning up field recordings
SFX mode generates custom sound effects from text descriptions using TangoFlux. Any sound you can describe, you can generate — natural sounds, mechanical sounds, ambient environments, impacts, transitions, sci-fi effects.
- Text Encoding: The sound description is encoded into a semantic representation
- Diffusion Process: Audio is generated through iterative denoising
- Duration Control: Output is trimmed/looped to match requested duration
- Quality Scaling: More steps = higher quality but slower generation
# Basic sound effect
python src/voder.py sfx sound "thunder rumbling in the distance" duration 10
# With quality parameters
python src/voder.py sfx sound "rain on a tin roof" duration 15 steps 50 guide 3.5
# With output routing
python src/voder.py sfx sound "footsteps on gravel" duration 8 result "/output/footsteps.wav"
# Short transition sound
python src/voder.py sfx sound "swoosh transition" duration 2 steps 20 result "/sfx/swoosh.wav"
# Ambient environment
python src/voder.py sfx sound "busy coffee shop with clinking cups and muffled conversations" duration 30 result "/sfx/cafe.wav"| Parameter | Range | Default | Effect |
|---|---|---|---|
sound |
any text | required | Description of the sound |
duration |
1-30 | required | Length in seconds |
steps |
1-100 | 30 | Higher = better quality, slower |
guide |
1.0-10.0 | 4.5 | Higher = stricter adherence to prompt |
result |
path | optional | Output destination |
| Sound Type | Prompt Strategy |
|---|---|
| Natural | Include environment: "rain on metal roof in a forest" |
| Impacts | Specify intensity and reverb: "heavy punch impact with long reverb tail" |
| Ambient | Layer elements: "forest at night with crickets and distant owl" |
| Transitions | Describe movement: "whoosh from left to right" |
| Mechanical | Include rhythm: "old clock ticking steadily" |
| Sci-fi | Mix familiar and unfamiliar: "futuristic laser with digital distortion" |
STT+TTS mode transcribes audio to text, allows editing of the text, then re-synthesizes with a target voice. This enables content modification while maintaining the general structure of the original.
The text editing step requires user interaction. You must:
- Review the transcription
- Edit the text (fix errors, change words, modify content)
- Approve for synthesis
# Interactive mode only
python src/voder.py cli
# Then select STT+TTS from the menuSVS mode separates mixed audio into individual stems using BS-RoFormer Resurrection. The most common separation is vocals vs instrumental, but the model can also separate other track components. SVS is also used internally by other modes: STS (auto vocal extraction before voice conversion), STT (pre-cleanup before transcription), and TTS (voice clone cleanup to extract clean vocals from mixed reference audio).
- Input Processing: Audio extracted from video if needed; URLs downloaded via yt-dlp
- Stem Separation: BS-RoFormer Resurrection analyzes the audio spectrogram and separates it into requested stems
- Output: Individual stem files saved (or merged based on request)
| Stem Value | Output | Description |
|---|---|---|
voice |
Vocal track | Isolated vocals, singing, speech |
music |
Instrumental track | Everything except vocals |
both |
Two files (sequential) | Extracts voice stem first, then music stem |
SVS can directly download and process audio from YouTube, Bilibili, and TikTok URLs. The audio is downloaded via yt-dlp and then separated.
# Separate vocals from instrumental (outputs both stems)
python src/voder.py svs "mixed_audio.wav"
# Get only the vocal stem
python src/voder.py svs "mixed_audio.wav" stem voice
# Get only the instrumental stem
python src/voder.py svs "mixed_audio.wav" stem music
# Extract both stems sequentially (voice first, then music)
python src/voder.py svs "mixed_audio.wav" stem both
# From video file (audio auto-extracted)
python src/voder.py svs "music_video.mp4" stem voice result "/output/vocals.wav"
# From YouTube URL
python src/voder.py svs "https://www.youtube.com/watch?v=VIDEO_ID"
# From YouTube with specific stem and output routing
python src/voder.py svs "https://www.youtube.com/watch?v=VIDEO_ID" stem music result "/output/instrumental.wav"
# With output routing
python src/voder.py svs "song.wav" result "/output/separated/"
# Batch processing
python src/voder.py svs "song1.wav" "song2.mp3" "song3.flac" result "/output/stems/"| Parameter | Required | Purpose | Default |
|---|---|---|---|
stem |
No | Which stem to extract: voice, music |
Both stems |
result |
No | Output destination | Auto-generated |
| Mode | How SVS Is Used |
|---|---|
| STS | Before voice conversion, extracts clean vocals from mixed source audio |
| STT | Pre-cleanup: separates vocals for cleaner transcription of music-heavy audio |
| TTS | When target reference contains background noise/music, extracts clean voice |
- Creating karaoke tracks (extract instrumental from songs)
- Isolating vocals for voice cloning reference
- Pre-cleaning audio before STT transcription
- Extracting acapella for remixing
- Podcast noise removal (separate speech from background)
SLC mode translates spoken content from one language to another and re-synthesizes it with speech. It combines Whisper's translation capability with Qwen3-TTS's voice synthesis to produce dubbed audio — the content is translated but the voice character is preserved (or replaced).
- Source Transcription + Translation: Whisper large-v3 transcribes the source audio and translates to English
- Voice Extraction (if no target): The source audio's voice characteristics are analyzed
- Re-Synthesis: Qwen3-TTS Base synthesizes the English text with the extracted or target voice
- Output: Translated audio file
| Mode | Parameter | Result |
|---|---|---|
| Same-Voice Translation | No target |
Translated in the original speaker's voice |
| Different-Voice Translation | With target |
Translated in a different person's voice |
Same-Voice Translation (No Target): When no target is provided, SLC extracts the voice characteristics from the source audio and uses them for synthesis. The result sounds like the original speaker speaking English — ideal for dubbing where you want to preserve speaker identity.
Different-Voice Translation (With Target): When a target audio file is provided, the translation is synthesized in the target voice. Useful for creating localized content with a specific voice actor.
To preserve specific words or phrases in the original language (e.g., names, technical terms, cultural expressions), wrap them in {original} markers within the translation:
# The translator preserves text inside { } braces
# Example: Source is Japanese, and you want to keep names in Japanese
"{Tanaka-san} visited the {shrine} yesterday."
This is handled at the translation step — Whisper detects these markers and passes them through without translation.
# Same-voice translation (preserves original speaker's voice)
python src/voder.py slc "foreign_speech.wav"
# Different-voice translation
python src/voder.py slc "foreign_speech.wav" target "english_voice.wav"
# From video file (audio auto-extracted)
python src/voder.py slc "foreign_movie.mp4" target "dub_actor.wav" result "/output/dubbed.mp4"
# From YouTube URL
python src/voder.py slc "https://www.youtube.com/watch?v=VIDEO_ID"
# With output routing
python src/voder.py slc "spanish_interview.mp3" target "narrator.wav" result "/output/english_version.wav"
# Same-voice with language preservation
python src/voder.py slc "japanese_speech.wav" result "/output/english_dub.wav"| Parameter | Required | Purpose | Default |
|---|---|---|---|
target |
No | Voice reference for dubbing voice | (uses source speaker's voice) |
result |
No | Output destination | Auto-generated |
- Source language is auto-detected; output is always English
- Same-voice quality depends on how distinct the source voice features are
- Very short audio segments (< 3 seconds) may produce lower quality voice matching
- Heavy background noise reduces voice extraction accuracy
- Dubbing foreign language video content
- Translating interviews while preserving speaker identity
- Creating English versions of non-English podcasts
- Localizing training materials with specific voice actors
SS mode takes multi-speaker audio and separates it into individual audio files — one per identified speaker, along with a full transcript. It uses VibeVoice ASR which provides native speaker diarization — identifying who spoke when and extracting each speaker's segments into separate files.
- Input Processing: Audio extracted from video if needed; URLs downloaded via yt-dlp
- Speaker Identification: VibeVoice ASR analyzes the audio and identifies distinct speakers
- Segment Extraction: Each speaker's segments are extracted and concatenated into individual files
- Transcript Generation: A full transcript with speaker labels and timestamps is generated
- Output: Individual speaker audio files + combined transcript text file
| Requirement | Details |
|---|---|
| Model | VibeVoice ASR (bundled with VODER) |
| HF_TOKEN | Not required (VibeVoice handles diarization natively) |
| Audio Quality | Clearer audio produces better separation |
| Minimum Speakers | 2 (single-speaker audio is returned as-is) |
| Maximum Speakers | No hard limit, but accuracy decreases beyond ~8 speakers |
If VibeVoice ASR fails or is unavailable, SS falls back to a two-step process:
- Pyannote performs speaker diarization (identifies who spoke when)
- Audio segmentation extracts each speaker's segments based on diarization timestamps
The fallback requires HF_TOKEN for Pyannote (see STT section for setup instructions). The fallback produces slightly lower quality segmentation because Pyannote only provides timestamps (not VibeVoice's enhanced speaker embeddings).
# Basic speaker separation
python src/voder.py ss "multi_speaker_audio.wav"
# From video file (audio auto-extracted)
python src/voder.py ss "panel_discussion.mp4" result "/output/speakers/"
# From YouTube URL
python src/voder.py ss "https://www.youtube.com/watch?v=VIDEO_ID"
# With output routing
python src/voder.py ss "podcast_episode.wav" result "/output/separated/"
# With timestamp flag (adds timestamps to transcript)
python src/voder.py ss "interview.wav" timestamp result "/output/interview/"
# Batch processing
python src/voder.py ss "ep1.wav" "ep2.wav" "ep3.wav" result "/output/all_episodes/"
# With overdose (VibeVoice ASR for higher quality)
python src/voder.py ss "meeting.wav" overdose
python src/voder.py ss "podcast.mp4" overdose result "/output/speakers/"When result is /output/separated/, SS creates:
/output/separated/
├── transcript.txt # Full transcript with speaker labels
├── speaker_0.wav # Speaker 0's audio segments concatenated
├── speaker_1.wav # Speaker 1's audio segments concatenated
├── speaker_2.wav # Speaker 2's audio segments concatenated
└── ...
The transcript format:
[00:00.000 → 00:05.200] Speaker 0: Welcome everyone to today's discussion.
[00:05.500 → 00:08.300] Speaker 1: Thank you for having me.
[00:09.000 → 00:15.100] Speaker 0: Let's start with the first topic.
[00:15.500 → 00:22.800] Speaker 2: I have some thoughts on that.
| Parameter | Required | Purpose | Default |
|---|---|---|---|
timestamp |
No | Include timestamps in transcript | Off (speaker labels only) |
result |
No | Output directory | Auto-generated |
- Separating podcast guests for individual processing
- Extracting individual speaker audio for voice cloning references
- Pre-processing interviews before transcription
- Creating speaker-specific training data
- Analyzing multi-speaker recordings
Script directives are special commands embedded inside dialogue lines that control how that specific line is processed. They allow fine-grained control over timing, volume, and duration at the per-line level.
Without directives, all dialogue lines are:
- Concatenated sequentially (no gaps)
- At uniform volume (100%)
- With duration determined by text length
Directives break these constraints, enabling:
- Overlapping audio (multiple lines at same time position)
- Volume variation (background lines at lower volume)
- SFX duration control (sound effects have fixed duration)
- Audio layering (SFX playing under speech)
| Directive | Format | Purpose | Applies To |
|---|---|---|---|
/time:nn |
/time:5 |
Position line at 5 seconds from start | All lines |
/time:nn-nn |
/time:10-3 |
Position at 10s, cut 3s from end | All lines |
/time:nn+nn |
/time:5+2 |
Position at 5s, cut 2s from start | All lines |
/time:nn-nn+nn |
/time:10-3+2 |
Position at 10s, cut 3s from end AND cut 2s from start | All lines |
/level:0-100 |
/level:75 |
Volume percentage for this line | All lines |
/duration:1-30 |
/duration:10 |
Duration in seconds | SFX lines (required) |
Without /time: With /time:
┌────────────────────┐ ┌────────────────────┐
│ Line 1 (plays now) │ │ Line 1 /time:0 │
│ Line 2 (after 1) │ │ Line 2 /time:0 │ ← overlaps with Line 1
│ Line 3 (after 2) │ │ Line 3 /time:5 │ ← starts at 5 seconds
└────────────────────┘ └────────────────────┘
Sequential Controlled positioning
The /time: directive uses a flexible syntax that combines three operations in any order:
/time:<position>[-<cut_from_end>][+<cut_from_start>]
- Position (plain number): When the line should start (in seconds from the beginning of the output)
- -nn (minus prefix): Cut this many seconds from the END of the generated audio
- +nn (plus prefix): Cut this many seconds from the START (beginning) of the generated audio
The cutting terminology can be confusing. Here's how to think about it:
-nn(cut from end): Removes audio from the tail. Think of it as "trim off the last N seconds"+nn(cut from start): Removes audio from the head. Think of it as "skip the first N seconds"
Original generated audio (10 seconds total):
┌────────────────────────────────────┐
│ 0s 5s 10s │
│ [=========AUDIO CONTENT=========] │
└────────────────────────────────────┘
/time:5-3 (start at 5s, cut 3s from end):
┌──────────────┐
│ 5s 7s │ (plays 0s-7s of original, positioned at 5s in output)
│ [=========] │ (last 3 seconds removed)
└──────────────┘
/time:5+2 (start at 5s, cut 2s from start):
┌──────────────────────┐
│ 5s 13s │
│ [=============] │ (first 2 seconds skipped, plays 2s-10s of original)
└──────────────────────┘
/time:5-3+2 (start at 5s, cut 3s from end AND 2s from start):
┌────────────┐
│ 5s 10s │
│ [====] │ (first 2s and last 3s removed, plays 2s-7s of original)
└────────────┘
Scenario 1: Remove intro/outro padding
- Generated audio often has a slight intro breath or outro silence
/time:0-1+0.5removes the half-second intro breath and 1-second outro tail
Scenario 2: Tight dialogue timing
- Two speakers' lines should slightly overlap for natural conversation flow
- Line 1:
"A: Hello there!" /time:0-0.5(trim tail to make room) - Line 2:
"B: Hi!" /time:1.5(starts before Line 1 fully ends, creating overlap)
Scenario 3: SFX that's too long
- Generated SFX might be 10 seconds but you only need the middle section
"sfx: engine revving /duration:10 /time:0-2+1"keeps seconds 1-8 (removes 1s intro, 2s outro)
# Podcast intro: music fades in under host speech
python src/voder.py tts script \
"sfx: upbeat podcast intro theme /duration:15 /level:40 /time:0-2" \
"Host: Welcome back to the show! /time:2" \
voice "Host: warm male voice"
# The SFX has its last 2 seconds trimmed so the transition feels cleaner
# Dialogue overlap for natural conversation
python src/voder.py tts script \
"Alice: I was thinking about what you said... /time:0-0.8" \
"Bob: And? /time:3.5" \
"Alice: I think you're right. /time:4.5" \
voice "Alice: female, thoughtful" "Bob: male, curious"
# Alice's first line is trimmed at the end, Bob's response starts before she fully finishes
# SFX with precise timing - remove intro breath and outro decay
python src/voder.py tts script \
"sfx: thunder rumble /duration:8 /level:60 /time:5-2+1" \
"Narrator: The storm was approaching. /time:0" \
voice "Narrator: deep voice"
# Thunder starts at 5s mark, but we remove 1s intro and 2s outro, keeping the "meat" of the soundpython src/voder.py tts script \
"Host: Welcome to the show! /time:0" \
"sfx: intro music /duration:10 /level:40 /time:0" \
"Host: Today we have a special guest. /time:10" \
voice "Host: male broadcaster"python src/voder.py tts script \
"Narrator: The scene opens on a quiet street. /level:100" \
"sfx: distant traffic /duration:20 /level:20" \
"Narrator: A car approaches slowly. /level:100" \
"sfx: car engine /duration:5 /level:40" \
voice "Narrator: deep male voice"python src/voder.py tts script \
"sfx: rain and thunder /duration:60 /level:30 /time:0" \
"Character: What a terrible night... /time:5 /level:90" \
"sfx: door creaking /duration:3 /level:50 /time:10" \
"Character: Who's there? /time:13 /level:100" \
voice "Character: nervous male voice" \
music "tense atmospheric horror" level "25"SFX lines are a special type of dialogue line where the "character" is sfx: (case-insensitive). Instead of speech synthesis, VODER generates a sound effect matching the description.
Before SFX lines, you had to:
- Generate dialogue audio
- Generate SFX audio separately
- Use audio editing software to mix them
- Manually align timing and adjust volumes
With SFX lines, everything happens in one command — VODER generates speech and SFX, positions them correctly, adjusts volumes, and produces the final mixed output.
"sfx: sound description /duration:nn /level:nn"
Required:
- Character must be
sfx:(case-insensitive) /duration:nnmust be present (1-30 seconds)
Optional:
/level:nnfor volume (0-100, default 100)/time:nnfor positioning
python src/voder.py tts script \
"James: Hello, who's at the door?" \
"sfx: door bell ringing /duration:3" \
"Sarah: That must be the pizza!" \
voice "James: male" "Sarah: female"python src/voder.py tts script \
"Narrator: The forest was alive with sounds." \
"sfx: birds chirping and rustling leaves /duration:15 /level:30" \
"Narrator: But something else was watching." \
voice "Narrator: deep male storytelling voice"python src/voder.py tts script \
"sfx: ambient cafe noise /duration:60 /level:25 /time:0" \
"Barista: What can I get you today? /time:5" \
"Customer: I'll have a large coffee, please. /time:8" \
"sfx: coffee machine grinding /duration:5 /level:40 /time:12" \
"Barista: Coming right up! /time:18" \
voice "Barista: cheerful female" "Customer: casual male"Cross-use allows mixing generated voices (via voice parameter) and cloned voices (via target parameter) in the same dialogue. This works in TTS mode (which now includes voice cloning via target).
Before the TTS merge, cross-use required switching between TTS and TTS+VC modes. Now everything is in one mode:
- Some characters with designed voices (
voice), others with cloned voices (target) - Perfect for scenarios where you have reference audio for some speakers but not others
- Mix known voices with new character voices
- Each character must use EITHER
voiceORtarget, not both - Character names must match between script and parameter
- Case-insensitive matching (James = james = JAMES)
python src/voder.py tts script \
"James: Welcome to our podcast!" \
"Sarah: Thanks for having me!" \
voice "James: deep male voice, authoritative" \
target "Sarah: /path/to/sarah_voice_reference.wav"# This now works in tts mode too — cross-use is the default behavior
python src/voder.py tts script \
"James: Let me share my screen." \
"Sarah: Go ahead, I'm ready." \
target "James: /path/to/james_voice.wav" \
voice "Sarah: bright female voice, enthusiastic"python src/voder.py tts script \
"Host: Welcome to the debate!" \
"Guest1: Thank you for having me." \
"Guest2: Pleasure to be here." \
voice "Host: professional broadcaster, neutral accent" \
target "Guest1: /path/to/guest1.wav" "Guest2: /path/to/guest2.wav"When using music parameter in dialogue mode, VODER automatically:
- Generates all dialogue segments
- Measures total dialogue duration
- Creates music matching that exact duration
- Mixes music at specified volume level
- Outputs final file with
_msuffix
Dialogue Lines → Speech Synthesis → Concatenation → Duration Measurement
↓
Music Description → ACE-Step 1.5 (lyrics: "...") → Duration-Matched Music
↓
Mix (Dialogue + Music at Level %)
↓
Final Output (_m suffix)
The music parameter internally uses lyrics "..." for ACE-Step, which tells the model to generate instrumental-only music with no vocals. This is specifically designed for background/ambient use. The legacy ACE-Step 1.5 model is used for background music generation because it is faster and sufficient for ambient/background quality.
| Format | Meaning | Use Case |
|---|---|---|
"35" |
Constant 35% volume | Simple ambient background |
"50" |
Constant 50% volume | More prominent music |
"0:30-60:50" |
30% at 0s, 50% at 60s | Fade in over time |
"0:50-30:20+10" |
Fade from 50% to 20% over 10s starting at 0s | Intro fade out |
python src/voder.py tts script \
"Host: Welcome to our show!" \
"Guest: Great to be here!" \
voice "Host: male" "Guest: female" \
music "soft jazz background"python src/voder.py tts script \
"A: Let's discuss the topic." \
"B: I have some thoughts." \
voice "A: male" "B: female" \
music "ambient electronic, chill" \
level "25"python src/voder.py tts script \
"Intro: Welcome to the podcast!" \
"Host: Today we'll explore..." \
voice "Intro: energetic" "Host: professional" \
music "upbeat intro music" \
level "0:50-30:20"Not all features work together. This section maps out exactly what combinations are possible and in what order parameters should appear.
| Feature | TTS | STS | TTM | STT | SE | SFX | STT+TTS | SVS | SLC | SS |
|---|---|---|---|---|---|---|---|---|---|---|
| Single mode | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ |
| Dialogue mode | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
voice param |
✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
target param |
✅ | ✅ | ✅† | ❌ | ❌ | ❌ | ✅ | ❌ | ✅ | ❌ |
| Cross-use | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
music param |
✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
level param |
✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| SFX lines | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Script directives | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
timestamp flag |
❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ |
dialogue flag |
❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
translate flag |
❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
overdose flag |
❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ |
clone param |
❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
mimic flag |
❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
vc flag |
❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
music flag (STS) |
❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
task param (TTM) |
❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
stems param |
❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ |
stem param (SVS) |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ |
steps param |
❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ |
guide param |
❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ |
result param |
✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ |
*clone for TTM requires vc flag.
†target in TTM is optional music reference only (use target voice or target music prefix).
| Rule | Modes Affected | Details |
|---|---|---|
overdose XOR translate |
STT | Cannot use both; choose one based on need |
overdose XOR dialogue |
STT | overdose includes native diarization; dialogue is redundant |
mimic XOR music |
STS | Cannot transfer style and switch to music model simultaneously |
remix XOR vc |
TTM | Remix and voice cloning are mutually exclusive |
python src/voder.py tts script "text" [script "text2" ...] [voice "prompt" [voice "prompt2" ...]] [target "path" [target "Char: path2" ...]] [music "description"] [level "spec"] [result "path"]
python src/voder.py sts base "source.wav" [source2.wav ...] target "voice.wav" [music] [mimic] [result "path"]
python src/voder.py ttm [lyrics "lyrics text"] styling "style prompt" duration N [vc] [clone "path"] [target music "path"] [overdose] [result "path"]
python src/voder.py ttm complete "source.wav" [add "instruments"] styling "style" [result "path"]
python src/voder.py ttm lego "..." [make "instruments"] styling "style" duration N [result "path"]
python src/voder.py ttm extract "source.wav" [stems "instruments"] [result "path"]
python src/voder.py ttm remix "source.wav" styling "style" [bias N] [result "path"]
python src/voder.py ttm repaint "source.wav" time:start-end styling "style" [bias N] [result "path"]
python src/voder.py stt "file1" ["file2" ...] [timestamp] [dialogue] [translate] [overdose] [result "path"]
python src/voder.py se "input.wav" [result "path"]
python src/voder.py sfx sound "description" duration N [steps N] [guide N.N] [result "path"]
python src/voder.py svs "input.wav" [stem voice|music] [result "path"]
python src/voder.py slc "input.wav" [target "voice.wav"] [result "path"]
python src/voder.py ss "input.wav" ["input2.wav" ...] [timestamp] [result "path"]
Mode: TTS Features: Dialogue mode + SFX lines + music param + level param
python src/voder.py tts script \
"sfx: intro jingle /duration:5 /level:50 /time:0" \
"Host: Welcome to our show!" \
"sfx: applause /duration:3 /level:40 /time:3" \
"Guest: Thanks for having me!" \
voice "Host: male broadcaster" "Guest: female, enthusiastic" \
music "upbeat podcast intro music" \
level "0:50-30:30"Mode: TTS Features: Dialogue mode + voice + target (cross-use) + music
python src/voder.py tts script \
"James: Let's start the interview." \
"Sarah: I'm ready when you are." \
target "James: /path/to/james_voice.wav" \
voice "Sarah: bright female voice" \
music "soft ambient electronic"Mode: STT Features: timestamp + dialogue + result
python src/voder.py stt "podcast_episode.wav" timestamp dialogue result "/output/transcripts/episode1.txt"Mode: STT Features: translate + timestamp + result
python src/voder.py stt "spanish_interview.mp3" translate timestamp result "/output/english_translation.txt"Mode: STT Features: overdose + timestamp + result
python src/voder.py stt "noisy_panel.wav" overdose timestamp result "/output/overdose_transcript.txt"Mode: STT Features: Multiple files + timestamp + dialogue + result
python src/voder.py stt "ep1.wav" "ep2.wav" "ep3.wav" timestamp dialogue result "/output/transcripts/"Mode: STT Features: URL input + timestamp + dialogue
python src/voder.py stt "https://youtube.com/watch?v=VIDEO_ID" timestamp dialogue result "/output/video_transcript.txt"Mode: STS Features: music flag + result (video output)
python src/voder.py sts base "original_song.mp4" target "new_singer_voice.wav" music result "/output/cover.mp4"Mode: TTM Features: lyrics + styling + duration + vc + clone
python src/voder.py ttm vc lyrics "Verse 1:\nMy custom lyrics\n\nChorus:\nChorus text" styling "pop ballad, emotional" duration 90 clone "artist_voice.wav" result "/output/custom_song.wav"Mode: TTM Features: lyrics + styling + duration + overdose
# TTM overdose (highest quality music generation)
python src/voder.py ttm overdose lyrics "Verse:\nAmazing lyrics" styling "epic orchestral, cinematic" duration 120 result "/output/high_quality.wav"
# TTM overdose with voice cloning
python src/voder.py ttm overdose vc lyrics "Chorus:\nWe are one" styling "stadium rock" duration 30 clone "singer.wav"Mode: TTM Features: lego + make + styling + duration
python src/voder.py ttm lego "..." make "keyboard bass drums saxophone" styling "jazz combo" duration 180 result "/output/jazz_stems.wav"Mode: SVS then STT (two commands) Features: Vocal separation + transcription
python src/voder.py svs "noisy_recording.wav" stem voice result "/clean/vocals.wav"
python src/voder.py stt "/clean/vocals.wav" timestamp result "/output/clean_transcript.txt"Mode: SE then TTS (two commands) Features: Enhancement + voice cloning
python src/voder.py se "noisy_reference.wav" result "/clean/reference.wav"
python src/voder.py tts script "Hello, this is a voice clone test." target "/clean/reference.wav" result "/output/cloned_speech.wav"Mode: SVS then STS (two commands) Features: Vocal separation + voice conversion
python src/voder.py svs "mixed_song.wav" stem voice result "/clean/vocals.wav"
python src/voder.py sts base "/clean/vocals.wav" target "new_singer.wav" result "/output/converted.wav"Mode: SS then TTS (two commands) Features: Speaker separation + voice cloning per speaker
python src/voder.py ss "interview.wav" result "/output/speakers/"
# Then clone each speaker's voice:
python src/voder.py tts script "Speaker 0's lines here..." target "/output/speakers/speaker_0.wav" voice "text: professional narrator" result "/output/narration.wav"Mode: SLC Features: Foreign audio + target voice
python src/voder.py slc "foreign_speech.wav" target "english_actor.wav" result "/output/dubbed.wav"Mode: STT then TTS (two commands) Features: Image OCR + text-to-speech
python src/voder.py stt "script_screenshot.png" result "/output/extracted_text.txt"
# Parse the text file, then:
python src/voder.py tts script "[extracted text content]" voice "professional narrator" result "/output/audio.wav"Mode: TTS Features: Dialogue + SFX + directives + music + level + result
python src/voder.py tts script \
"sfx: podcast intro with music /duration:10 /level:60 /time:0" \
"Host: Welcome to Tech Talk, episode forty-two! /time:0 /level:100" \
"sfx: transition swoosh /duration:2 /level:40 /time:10" \
"Host: Today we're diving deep into AI. /time:12" \
"Guest: Excited to share my research! /time:18" \
"sfx: typing on keyboard /duration:5 /level:25 /time:25" \
"Host: Let's start with the basics. /time:30" \
voice "Host: adult male, warm conversational, podcast style" "Guest: adult female, academic, clear pronunciation" \
music "soft lo-fi beats, chill, minimal" \
level "0:30-60:25-180:15" \
result "/output/episode42.wav"| Mode | RAM | VRAM (if GPU) | Notes |
|---|---|---|---|
| TTS — voice design (single/dialogue) | 12GB | 4GB | Qwen VoiceDesign model |
| TTS — voice clone (single/dialogue) | 12GB | 4GB | Qwen Base model |
| TTS + music | 23GB | 15-16GB | Adds ACE-Step 1.5 |
| STS | 13GB | 14GB | Seed-VC (+ BS-RoFormer if auto-extract) |
| STS + video I/O | 13GB | 14GB | Same as STS, FFmpeg for muxing |
| TTM (legacy/1.5) | 23GB | 15-16GB | ACE-Step 1.5 |
| TTM (XL-Base sub-tasks) | 26GB | 18-20GB | ACE-Step XL-Base |
| TTM (XL-Turbo overdose) | 30GB | 22-24GB | ACE-Step XL-Turbo |
| TTM + vc (voice clone) | 26GB | 18-20GB | Auto-offloads between stages |
| STT | 12GB | N/A (CPU) | Whisper large-v3-turbo |
| STT + translate | 12GB | N/A (CPU) | Whisper large-v3 |
| STT + overdose | 14GB | N/A (CPU) | VibeVoice ASR |
| STT + diarization | 15GB | N/A (CPU) | Whisper + Pyannote |
| SE | 11GB | 4GB | UniSE |
| SFX | 12GB | 4GB | TangoFlux |
| SVS | 14GB | 8GB | BS-RoFormer |
| SLC | 14GB | 4GB | Whisper + Qwen3-TTS |
| SS | 14GB | N/A (CPU) | VibeVoice ASR |
When chaining operations, you don't need to sum all requirements — models are offloaded between operations. Plan for the peak memory of the most demanding step.
Step 1: STT (12GB peak) → offloaded
Step 2: TTS voice clone with music (23GB peak) → offloaded
Step 3: Done
Total memory needed: 23GB (not 35GB)
Step 1: SE (11GB peak) → offloaded
Step 2: TTM + vc (26GB peak) → offloaded
Step 3: Done
Total memory needed: 26GB
Step 1: SVS — extract vocals (14GB peak) → offloaded
Step 2: SLC — dub to English (14GB peak) → offloaded
Step 3: Done
Total memory needed: 14GB
Step 1: SS — separate speakers (14GB peak) → offloaded
Step 2: SVS — clean each speaker (14GB peak per file) → offloaded
Step 3: STT — transcribe each (12GB peak per file) → offloaded
Step 4: Done
Total memory needed: 14GB
| Issue | Cause | Solution |
|---|---|---|
| Out of memory | Insufficient RAM/VRAM | Check requirements table; close other apps |
| FFmpeg not found | Missing system dependency | Install FFmpeg to PATH |
| Slow processing | CPU-only operation | Normal for CPU; GPU speeds up certain modes |
| Diarization fails | Missing/invalid HF_TOKEN | Set up HF_TOKEN.txt with valid token |
| YouTube download fails | Network/availability | Check video exists and is public |
| Poor voice cloning | Bad reference audio | Use 10-30s clear speech, single speaker; run SE first |
| SFX quality issues | Insufficient steps | Increase steps parameter |
| Music doesn't generate | Single mode used | music only works in dialogue mode |
| SFX line ignored | Missing /duration | Add /duration:nn directive |
| Cross-use conflict | Both voice and target for same character | Use one or the other per character |
overdose + translate error |
Mutually exclusive flags | Use one or the other, not both |
| SS fallback to Pyannote | VibeVoice unavailable | Install VibeVoice model; or set up HF_TOKEN for fallback |
| SLC poor voice match | Noisy source audio | Run SE or SVS on source before SLC |
| SVS incomplete separation | Very mixed audio | Try SE first to clean up, then SVS |
| TTM overdose too slow | XL-Turbo is resource-intensive | Use standard TTM (1.5) for faster results |
| Video output has no audio | FFmpeg muxing issue | Ensure FFmpeg is installed and in PATH |
| TTM lego missing stems | Invalid stem names | Use only the 12 supported instrument track names |
- Enhance before cloning: Run SE on noisy reference audio before using for voice cloning
- Separate before cloning: Run SVS to extract clean vocals from mixed reference audio
- Test with short samples: Generate 5-10 second tests before full production
- Layer with time positioning: Use
/time:0for overlapping SFX and speech - Fade background music: Use level
"0:50-30:20"for intro-to-content transitions - Batch STT for efficiency: Process multiple files in one command
- Auto-clone for testing: Use same file for STT analysis and voice reference to test pipeline
- MSTS for songs: Always use
musicflag when converting singing voice - Instrumental TTM: Use
lyrics "..."for backing tracks - Result routing: Always use
resultfor automated workflows - Check memory first: Ensure 23GB RAM for any workflow involving music; 30GB for overdose
- Use overdose for final output: Generate with standard TTM for testing, switch to overdose for final production
- SS before STT for multi-speaker: Run SS first to identify and separate speakers, then transcribe individually for cleaner results
- SVS before STS for mixed audio: Auto vocal extraction in STS handles most cases, but manual SVS → STS gives more control
- SLC preserves speaker identity: For dubbing, use no-target mode to keep the original speaker's voice in English
- TTS is unified now: Don't think in terms of TTS vs TTS+VC — just use
ttswithvoiceortarget(or both via cross-use) - TTM is unified now: Don't think in terms of TTM vs TTM+VC — just use
ttmwithvc+clonewhen you need voice cloning - Legos for custom arrangements: Use
legowith specificmakestems to build custom instrumental arrangements - Extract for remixing: Use
extractto pull individual stems from existing songs - Remix for style transfer: Use
remixwithstylingandbiasto create cover versions with adjustable style strength - Repaint for section editing: Use
repaintwithtime:start-endto restyle specific sections of a song - Overdose XOR translate: Remember these STT flags are mutually exclusive — pick based on whether you need translation or enhanced transcription
This skill provides comprehensive understanding of VODER's architecture, complete CLI command catalog for all 10 modes, feature compatibility rules, and combo possibilities. AI agents can use this knowledge to construct complex audio processing workflows that would be impossible without deep understanding of how the tool works.