- Project: Text2AudioBook
- Document Type: Product Requirements Document (PRD)
- Status: Draft for stakeholder review
- Author: Cline
- Last Updated: 2026-05-21
- Implementation Status: Planning only; no production code changes approved yet
Text2AudioBook is a small desktop utility that converts text files into speech using OpenAI text-to-speech, then optionally combines MP3s into a video. The current codebase is functional but dated in several important areas: it uses a legacy-style raw HTTP integration for TTS, hardcoded voice/model defaults, simplistic text chunking, weak resilience for retries/rate limits, and a brittle audio/video pipeline.
This PRD proposes a modernization pass that keeps the project intentionally simple while upgrading it to current best practices. The plan focuses on:
- migrating to the official OpenAI Python SDK
- adding provider abstraction so OpenAI remains the primary hosted option and several local providers (Ollama, Kokoro, VibeVoice) become optional
- making model selection config-driven rather than hardcoded
- improving chunking and reliability without adding heavy dependencies
- adding automated tests so features can be validated safely
- keeping the GUI and general workflow familiar to existing users, with only small targeted UX improvements
Two HuggingFace TTS engines are added as opt-in local providers:
- Kokoro-82M (
hexgrad/Kokoro-82M, Apache 2.0): 82M-param StyleTTS 2 model, CPU-capable, 8 languages, 54 voices, 24 kHz WAV output. Lightweight, permissive license, good fit for single-narrator audiobook flow. - VibeVoice-1.5B (
microsoft/VibeVoice-1.5B, MIT weights but research/development use only per upstream guidance): multi-speaker (up to 4), long-form (up to 90 min), English/Chinese, ~3B params BF16, GPU required. Suited to multi-character dialogue and podcast-style output. Includes upstream-baked AI watermark and audible disclaimer that must not be stripped.
This document is designed to be approved before any code changes are made.
- Modernize the TTS integration to use current OpenAI client patterns.
- Make model and voice upgrades easy without changing core code each time.
- Improve output quality and reliability without overcomplicating the project.
- Add optional local model support via Ollama where feasible.
- Add optional HuggingFace local providers: Kokoro-82M for lightweight narration and VibeVoice-1.5B for multi-speaker long-form audio.
- Add meaningful automated testing coverage for key logic.
- Preserve the project's small size and approachable structure.
The following are not in scope for this modernization unless approved later:
- turning the project into a web app or SaaS
- adding a database
- adding user accounts, authentication, or cloud sync
- building a plugin architecture
- adding advanced NLP dependencies unless clearly justified
- redesigning the UI into a new framework
- Small, understandable Python codebase
- Minimal user workflow
- Useful core utility for text-to-audio conversion
- Existing GUI reduces adoption friction for non-technical users
- Uses
requests.post()directly against/v1/audio/speech - Hardcodes model
tts-1 - Makes future model changes require code edits
- Has no provider abstraction, making future local or alternate-provider support awkward
- API key is read from
key.txtat import time - Voices are hardcoded in the GUI
- No clean separation between defaults and runtime selection
- No discovery mechanism for available provider models/voices
- Current punctuation-based chunking is naive
- Can split awkwardly around abbreviations, dialogue, or long paragraphs
- Logged “starting sentence” metadata is simplistic
- Thread pool concurrency is unbounded by explicit design
- No retry/backoff behavior for transient API failures or rate limits
- Failure reporting is minimal
combine_and_convert.pymixesImageClipandVideoFileClipin a way that appears error-prone- Current approach may be heavier than necessary for static-image videos
- Dependency footprint is larger than needed
- No unit tests
- No integration test strategy
- No documented validation checklist for human-only verification
- Individual creators converting text into audiobook-style audio
- Users who prefer a desktop GUI over code/scripts
- Small-scale content producers who want a simple OpenAI TTS workflow
- Select a text file and generate a high-quality MP3 audiobook.
- Choose voice and quality mode without dealing with technical model names.
- Process larger documents safely through chunking.
- Optionally combine generated MP3 chunks into a single MP3 and/or simple video.
- Switch between hosted OpenAI TTS and local Ollama-backed generation where supported.
- The app shall use the official OpenAI Python SDK for TTS requests.
- The app shall support configurable model selection.
- The app shall preserve the current text-to-MP3 workflow.
- The app shall support more than one TTS provider without requiring GUI rewrites.
- OpenAI shall be the default hosted provider.
- Ollama-backed local support shall be included as an optional provider path.
- Kokoro-82M (HuggingFace, Apache 2.0) shall be included as an optional local provider for lightweight narration.
- VibeVoice-1.5B (HuggingFace, MIT weights, upstream-restricted to research/dev use) shall be included as an optional local provider for multi-speaker long-form output.
- Provider-specific capabilities (max speakers, supported languages, output sample rate/format, GPU requirement) shall be surfaced cleanly and used to gate UI options.
- HuggingFace local providers shall download model weights via
huggingface_hubwith pinned revisions/commit hashes to prevent silent model swaps. - The app shall cache models to a configurable directory (defaulting to the standard HF cache).
- The app shall verify required system dependencies (
espeak-ngfor Kokoro) on first use of the relevant provider and produce a clear actionable error if missing. - The app shall not strip, alter, or disable upstream-baked safety artifacts (e.g. VibeVoice's audible AI disclaimer or imperceptible watermark).
- The app shall expose VibeVoice's research/development-only license guidance to the user before first use and require explicit opt-in.
- The app shall convert provider-native output (Kokoro: WAV 24 kHz) into the project's standard MP3 output via the existing pydub path.
- The app shall first check
OPENAI_API_KEYfrom the environment. - The app shall support
key.txtonly as a fallback/backup option. - The app shall surface clear errors when no API key is available.
- The app shall support a simple quality preset strategy.
- The app shall also support direct model selection for users who want control.
- The app shall allow voice selection from a maintained supported list.
- The app may optionally expose speed if it does not complicate the UI.
- The app shall support refreshing available models from providers where technically feasible.
- The app shall cache or fall back to a curated supported list when live discovery is unavailable or unreliable.
- The app shall support local model discovery for Ollama via its local API.
- The app shall split large inputs safely while preferring paragraph and sentence boundaries.
- The app shall avoid forced hard cuts unless no clean split exists.
- The app shall preserve chunk ordering metadata.
- The app shall use bounded concurrency for API calls.
- The app shall retry transient failures with backoff.
- The app shall log chunk-level success/failure information.
- The app shall still produce a final merged MP3.
- The app shall still save chunk position metadata.
- The app shall sanitize or validate output file names.
- The project shall either simplify static-image video generation or clearly isolate it as a secondary feature.
- This shall not block the core text-to-audio modernization.
- Keep runtime dependencies lightweight.
- Keep modules small and understandable.
- Prefer deterministic, testable functions.
- Maintain Windows-first usability while avoiding OS-specific breakage where possible.
- Avoid hidden magic; use configuration defaults instead of sprawling settings.
Keep the current Tkinter desktop UI, but modernize the behavior under the hood.
Recommended user-facing simplification:
- Quality Preset:
Best Quality,Balanced,Fast(applies to OpenAI; non-OpenAI providers ignore the preset and use direct model/voice selection) - Provider Dropdown:
OpenAI/Ollama (Local)/Kokoro (Local)/VibeVoice (Local) - Model Dropdown: manually selectable, populated from provider discovery when available
- Voice Dropdown: supported voices only, filtered by provider (Kokoro: 54 voices across 8 languages; VibeVoice: speaker-slot mapping for up to 4 speakers)
- Optional Advanced Settings: collapsed or deferred unless needed
- First-use notice: selecting VibeVoice shows a one-time dialog summarising the research/dev-only license and the upstream safety features (watermark + audible disclaimer) before any download starts
Rationale:
- avoids overwhelming users with model names
- allows model upgrades later without changing the UI contract
- preserves simplicity while still feeling modern
The stakeholder approved small UX improvements while keeping the UI nearly identical. Recommended low-risk improvements are:
- disable the Start button while processing
- add a small status/progress label (for example:
Reading file,Chunk 3/12,Merging audio,Creating video) - add provider and model dropdowns without redesigning the layout
- add a
Refresh Modelsbutton so the latest available models can be fetched where supported - improve validation messages for missing API key, invalid paths, and unsupported local provider state
- preserve the existing single-window flow and overall field order as much as possible
Do not hardcode a forever-model into the architecture.
Instead:
- define a default recommended model in config/constants
- allow a fallback model
- map UI presets to model settings internally
Example strategy:
Best Quality→ latest recommended high-quality speech modelBalanced→ stable general-use modelFast→ lower-latency or lower-cost option
Final exact model names should be confirmed against current OpenAI docs during implementation.
The stakeholder requested support for pulling the latest model names where possible.
Recommended approach:
- OpenAI: attempt provider-backed model discovery if the SDK/API exposes suitable listing capabilities for the intended TTS endpoint; otherwise use a maintained allowlist curated in config and updated during releases
- Ollama: query the local Ollama API to list installed local models
- UI behavior: provide a
Refresh Modelsaction and store the last successful result for the current session
Important note: model listing and model usability are not always the same thing. Even if a provider returns a large model list, the app should still filter to models known or configured to support the project’s use case.
Minimal config hierarchy:
- explicit UI choices
- environment variables
- local config file defaults
- code defaults as last resort
Suggested configurable items:
- provider
- API key source
- default voice
- default quality preset
- optional explicit model override
- default output directory
- max concurrency
- local Ollama base URL
- official OpenAI SDK integration
- provider abstraction layer
- config-driven model selection
- env var API key with
key.txtfallback - bounded concurrency
- retry/backoff for transient failures
- improved logging
- improved text chunking
- optional speed control if supported and simple
- provider/model refresh UX
- GUI progress state / disable button during work
- clearer user-facing error messages
- refactor or simplify
combine_and_convert.py - reduce unnecessary dependency complexity
- improve README/setup docs
- Ollama integration for local model discovery and invocation, if the selected local model/provider path can support the desired workflow
- local provider setup checks and user guidance
- provider-specific validation and fallback behavior
- HuggingFace download/caching via
huggingface_hubwith pinned revision - espeak-ng presence check with actionable install guidance (Windows .msi link, Linux apt hint)
- Kokoro
KPipelineinvocation per chunk - WAV→MP3 conversion via pydub to keep pipeline output uniform
- voice/language dropdown population from Kokoro's supported list
- license/safety opt-in dialog on first selection
- HuggingFace download/caching via
huggingface_hubwith pinned revision (~6 GB BF16) - GPU availability detection; clear failure if no compatible GPU
- multi-speaker input support (script with speaker tags) — design or defer
- preservation of upstream watermark and audible AI disclaimer in output
- Confirm PRD approval from stakeholder
- Confirm preferred simplicity level for UI changes
- Confirm whether video generation remains in active scope or is secondary
- Confirm acceptable dependency changes
- Confirm baseline Python version target
- PRD approved
- open product questions answered or accepted as assumptions
- Add a settings/config helper module or equivalent lightweight utility
- Add a provider abstraction layer for OpenAI and Ollama
- Define config precedence: UI > env > config file > defaults
- Add support for
OPENAI_API_KEYwithkey.txtfallback - Define supported voice list in one place
- Define quality preset mapping in one place
- Move model defaults out of hardcoded request payloads
- Define provider-specific capabilities and model filtering rules
- configuration behavior documented and testable
- model and voice defaults no longer buried in GUI/business logic
- Replace raw HTTP TTS calls with official OpenAI SDK calls
- Add bounded concurrency
- Add retry/backoff for transient request failures
- Add chunk-level logging and error reporting
- Ensure file outputs remain deterministic and ordered
- TTS pipeline works with current SDK
- failure handling is predictable
- ordering and output naming remain stable
- Add model discovery for OpenAI where feasible and safe
- Add local model discovery for Ollama
- Add
Refresh ModelsUI action - Add fallback curated model list when live discovery fails
- Validate that only usable models are exposed for each provider path
- model selection is reliable even when discovery partially fails
- users can choose either presets or explicit models
- Refactor chunking to prefer paragraph boundaries
- Add sentence-aware fallback splitting
- Preserve chunk position metadata accurately
- Improve logged sentence preview behavior
- Keep implementation dependency-light
- chunking is measurably cleaner on realistic text
- edge cases are covered by unit tests
- Disable start button while processing
- Add visible progress/status updates
- Improve path and filename validation
- Improve missing-key and API-failure messaging
- Keep UI changes minimal and understandable
- GUI behavior is clearer during processing
- preventable user errors are surfaced early
- Audit
combine_and_convert.pybehavior - Keep MP3-to-video in active scope as approved by stakeholder
- Decide whether to keep MoviePy or replace parts with FFmpeg calls
- Fix image/video handling ambiguity
- Add tests around non-GUI helper logic where feasible
- Update documentation for video generation usage
- optional video feature is reliable or explicitly deprioritized
- Add Ollama connectivity checks
- Add local provider invocation path
- Add provider-specific settings such as base URL / model listing
- Define fallback behavior when a selected local model is unavailable
- Document limitations of local model quality/capability compared with hosted TTS
- local provider path can be selected and validated
- unsupported local configurations fail clearly and safely
- Add
kokoro>=0.9.4,soundfile,huggingface_hubto project dependencies - Document espeak-ng install steps (Windows .msi, Linux apt) in README
- Add startup espeak-ng probe with actionable error message on miss
- Implement
_write_kokoro_speechhelper in tts_conversion.py - Pin Kokoro model revision in settings.py
- Add Kokoro voice list (54 voices across 8 langs) to supported-voices config
- Add WAV→MP3 conversion step (pydub) for Kokoro chunks
- Default
max_concurrencyto 1 for local providers (GPU/CPU memory-bound) - Unit tests covering provider dispatch, voice validation, WAV→MP3 conversion
- Kokoro can be selected, downloaded once, and used to produce MP3 chunks
- espeak-ng absence is detected and surfaced before any chunk fails
- Add
transformers,torch,accelerate,huggingface_hubto project dependencies (pin versions from upstreampyproject.toml) - Confirm whether upstream's "inference request logging" applies to local invocation; document findings
- Implement first-use license/safety opt-in dialog
- Implement
_write_vibevoice_speechhelper in tts_conversion.py - Pin VibeVoice model revision in settings.py
- Implement GPU detection and clear failure mode when unavailable
- Design multi-speaker input format OR scope to single-speaker for v1
- Verify upstream watermark and audible disclaimer remain in output (do not strip)
- Unit tests covering provider dispatch, GPU detection, opt-in gating
- VibeVoice selectable only after explicit opt-in
- runs end-to-end on a GPU-equipped machine; fails clearly on non-GPU
- output retains upstream safety artifacts
- Add automated unit test suite
- Add focused integration tests with mocks/stubs
- Add real API smoke tests for approved hosted-provider scenarios
- Document human-only validation checklist
- Update README and setup instructions
- Produce release validation summary
- automated tests pass locally
- manual validation checklist completed for non-automatable areas
Testing is not optional in this modernization effort. Every implementation phase must include explicit test work, test evidence, and a pass/fail status before the phase can be considered complete.
Required enforcement rules:
- no feature is considered complete until its required tests are implemented and run
- no phase may be marked complete unless its testing exit criteria are satisfied
- any bug found during testing must either be fixed in the same phase or explicitly logged as a deferred issue with approval
- real smoke tests must be tracked separately from mocked/integration tests
- human validation items must be recorded as pending until explicitly confirmed by the stakeholder
Implementation must maintain the following test tracking artifacts:
- a test matrix mapping each feature to unit, integration, smoke, and manual validation coverage
- a phase validation checklist showing pass/fail status for each phase
- a defect log for failed tests, regressions, and deferred issues
- a release validation summary documenting what was tested, what passed, what was skipped, and why
Each tracked test item should use one of these statuses:
Not StartedIn ProgressPassedFailedBlockedDeferred (Approved)
- A feature cannot move to
Donewithout required automated test coverage. - A phase cannot move to
Donewhile any required test item isFailedorBlockedunless the stakeholder explicitly approves an exception. - Release readiness cannot be declared until required smoke tests pass and all human validation items are either completed or explicitly waived.
The project currently has no tests. This modernization will add a pragmatic test strategy that validates core behavior without introducing excessive infrastructure.
Principles:
- maximize coverage of deterministic logic with unit tests
- isolate API dependencies via mocks
- reserve human validation only for areas automation cannot reliably judge
- avoid expensive or flaky test design
Unit tests should cover:
- reading text file success/failure
- chunking under max length
- chunking over max length
- paragraph-aware splitting
- sentence-aware fallback splitting
- forced split fallback when no punctuation exists
- position metadata correctness
- sentence preview metadata correctness
- env var API key loading
- fallback to
key.txt - error when both are missing
- config precedence behavior
- default quality preset selection
- provider selection and capability gating
- model discovery fallback logic
- request payload/model selection mapping
- output filename generation
- retry decision logic
- chunk ordering preservation
- bounded concurrency configuration
- provider-specific invocation path selection
- concatenation order behavior
- empty list handling
- invalid path handling where feasible
- only for pure helper functions that can be isolated cleanly
- do not attempt heavy multimedia integration in basic unit tests
- Ollama availability detection
- local model list parsing
- provider fallback and unsupported-capability behavior
Integration tests should remain lightweight and mostly mocked.
Recommended integration coverage:
- text file → chunk generation → mocked TTS outputs → merged MP3 path creation
- GUI-adjacent logic extracted into testable helper functions where possible
- config loading + TTS selection flow
- provider switching between OpenAI and Ollama using mocks/stubs
These should be limited and carefully scoped.
Recommended smoke tests:
- run conversion on a very small sample input using mocked API responses
- verify chunk files and final output file creation
- run a real OpenAI smoke test with a tiny approved sample because the stakeholder explicitly requested real validation
- optionally run an Ollama smoke test if a compatible local model is installed and available
The human should only be asked to validate what cannot be reliably judged through automation.
Human validation required for:
- subjective voice quality preference
- whether the generated speech sounds natural enough for target use
- whether selected default voices feel appropriate
- real-cost tolerance for chosen model/preset
- full GUI usability preferences
- optional GPU/FFmpeg/video behavior on the user’s machine
- whether local Ollama output quality is acceptable relative to the hosted provider
-
Voice Quality Review
- Listen to sample output across selected voices
- Confirm preferred default voice
-
Model / Preset Review
- Compare
Best QualityvsBalancedvsFast - Confirm chosen defaults meet budget and speed expectations
- Compare
-
Desktop UX Review
- Validate field labels are understandable
- Confirm progress messaging is clear
- Confirm startup flow is acceptable for non-technical users
-
Optional Video Validation
- Confirm output plays correctly on target system
- Confirm video/image behavior is acceptable
pytestfor test runnerunittest.mockorpytest-mockfor mocking- temporary file fixtures for filesystem tests
- avoid real network calls in default automated tests
Testing is considered complete when:
- all unit tests pass locally
- mocked integration tests pass locally
- required smoke tests pass locally or are explicitly approved as deferred
- the test matrix is up to date for all in-scope features
- the defect log is updated for all failures encountered during validation
- documentation for manual validation is written
- stakeholder completes human-only validation checklist items relevant to approved scope
The implementation must maintain a feature-to-test mapping similar to the following.
| Feature / Area | Unit Tests | Integration Tests | Real Smoke Test | Human Validation | Status |
|---|---|---|---|---|---|
| Config loading and precedence | Required | Optional | No | No | Not Started |
| API key env var + key.txt fallback | Required | Required | No | No | Not Started |
| OpenAI SDK TTS flow | Required | Required | Required | Optional listening check | Not Started |
| OpenAI model discovery | Required | Required | Optional | No | Not Started |
| Ollama local model discovery | Required | Required | Optional | No | Not Started |
| Ollama generation path | Required | Required | Optional/Required if supported locally | Yes, quality review | Not Started |
| Kokoro download and revision pinning | Required | Required | Optional | No | Not Started |
| Kokoro espeak-ng probe and error path | Required | Optional | No | No | Not Started |
| Kokoro generation path + WAV→MP3 | Required | Required | Required (small sample) | Yes, quality review | Not Started |
| VibeVoice license opt-in gate | Required | Optional | No | Yes, consent flow review | Not Started |
| VibeVoice download and revision pinning | Required | Required | Optional | No | Not Started |
| VibeVoice GPU detection and failure mode | Required | Required | No | No | Not Started |
| VibeVoice generation path (preserves watermark) | Required | Required | Required (small sample, GPU-equipped) | Yes, quality + safety review | Not Started |
| Text chunking logic | Required | Optional | No | Optional output quality spot check | Not Started |
| Audio concatenation | Required | Required | Optional | Optional | Not Started |
| MP3-to-video pipeline | Required where practical | Required | Optional | Yes | Not Started |
| GUI progress and validation states | Required for helper logic | Optional | Optional | Yes | Not Started |
| Error handling / retry behavior | Required | Required | Optional | No | Not Started |
Each failed or blocked test must be captured in a defect log with at least:
- defect ID
- phase
- feature area
- failing test name
- date found
- severity
- current status
- disposition (
fix now,defer,won't fix) - stakeholder approval status if deferred
Each phase must include a validation block during implementation:
- Phase:
- Implemented Items:
- Unit Tests: Passed / Failed / Blocked
- Integration Tests: Passed / Failed / Blocked
- Smoke Tests: Passed / Failed / Blocked / N/A
- Human Validation Needed: Yes / No
- Defects Open:
- Approved to Exit Phase: Yes / No
No phase should be closed without completing this validation block.
This section defines what the implementation agent should test directly versus what should be escalated to the human.
- static review of module boundaries
- run unit tests locally
- run mocked integration tests locally
- run real hosted-provider smoke tests with approved small input and credentials/environment setup
- validate config precedence behavior
- validate chunking behavior across sample inputs
- validate error handling for missing files / missing keys / mocked API failures
- verify output files are created in test environments
- verify deterministic ordering of chunks and merged outputs
- live API billing/cost acceptability
- subjective speech quality and voice preference
- whether latency is acceptable in real-world use
- GUI polish preferences
- machine-specific FFmpeg/GPU behavior
- final acceptance of default presets and defaults
Before release, require:
- automated tests green
- core text-to-audio workflow manually spot-checked by human
- at least one real audio sample approved by human
- documentation reviewed for setup accuracy
| Risk | Impact | Mitigation |
|---|---|---|
| OpenAI model names or SDK APIs evolve | Medium | Keep model selection config-driven and isolate SDK usage |
| Live model discovery returns unusable or overly broad results | Medium | Filter discovered models through provider-specific allowlists/capability rules |
| Too many UI settings complicate the app | High | Use quality presets and keep advanced options optional or deferred |
| Rate limits or transient failures break multi-chunk conversion | High | Add retries, bounded concurrency, and chunk-level reporting |
| Chunking changes cause regressions | Medium | Add sample-based unit tests and preserve deterministic behavior |
| Video pipeline remains brittle | Medium | Make it a secondary scope item and simplify if retained |
| Local Ollama models may not provide TTS-quality output or compatible interfaces | High | Treat provider capabilities explicitly, document limitations, and fail clearly when unsupported |
| VibeVoice upstream restricts to research/dev use; commercial fit unclear | High | Surface license/intent dialog before first use; require explicit opt-in; document limitation in README |
| VibeVoice model is ~6 GB and requires GPU; many users will lack capability | Medium | GPU detection up front; clear failure path; keep Kokoro as the recommended local default |
Kokoro requires external espeak-ng system dep (manual install on Windows) |
Medium | Detect missing dep at startup of the provider path; show actionable install link |
| HF model revisions could drift between project versions | Medium | Pin commit hash/revision in settings.py; fail loudly on mismatch |
| Stripping or muting upstream safety artifacts (watermark, AI disclaimer) | High | Architectural prohibition; covered by code review and explicit test |
| Dependency bloat increases setup difficulty | Medium | Prefer stdlib + minimal libs; remove obsolete dependencies where possible; gate heavy deps (torch, transformers) to optional extras where feasible |
- keep:
openai,pydub - add:
pytest,huggingface_hub - add (Kokoro provider):
kokoro>=0.9.4,soundfile - add (VibeVoice provider):
transformers,torch,accelerate(versions pinned from upstreampyproject.tomlduring impl) - system dependency (Kokoro):
espeak-ng(Windows:.msifrom https://github.com/espeak-ng/espeak-ng/releases; Linux:apt-get install espeak-ng) - possibly remove or reduce dependence on:
requests - re-evaluate:
moviepy - add or integrate with: Ollama local API client approach (lightweight HTTP or maintained Python package, depending implementation choice)
- consider gating heavy deps (
torch,transformers) behind avibevoiceextras group so default install stays light
pytest- optionally
pytest-mockif helpful
These are the approved implementation assumptions for final planning:
- Ollama support: connect to the local Ollama instance and load locally available models as selectable options in the UI.
- OpenAI model discovery: use fully dynamic listing where possible, while still filtering to valid/current models appropriate for this app’s workflow.
- Real API smoke tests: use a conservative default validation budget chosen by implementation planning.
- Ollama behavior: the app should query the local Ollama API and show local models as available options. The implementation should still validate capabilities and warn or block when a selected local model cannot support the required generation flow. [Confirmed 2026-05-21]
- OpenAI discovery behavior: the app should dynamically retrieve current model options where technically feasible. Because dynamic listings can be broad, the implementation should apply validation rules so only valid/current models relevant to the app are shown or are clearly labeled. [Confirmed 2026-05-21]
- Real smoke test budget/time cap: default target should be under $1 per validation run and under 5 minutes total runtime. Smoke tests should use tiny sample inputs and minimize generated audio length. [Confirmed 2026-05-21]
- GPU availability: does the target machine have a CUDA-capable GPU with sufficient VRAM for VibeVoice-1.5B (BF16, ~3B param)? If no, Phase 6C is skipped or deferred.
Decision (2026-05-21): No GPU assumed in v0.1 target machines. Phase 6.3 (VibeVoice) deferred to v0.2. (resolution mode: bulk-accept-all-defaults; see
.paul/phases/00-discovery-and-approval/00-01-APPROVAL-PACKET.md) - VibeVoice license acceptance: stakeholder confirmation that the upstream "research and development only" guidance is acceptable for this project's distribution model, OR explicit decision to keep VibeVoice out of scope. Decision (2026-05-21): Accept research/dev-use license; require first-run opt-in dialog; do NOT ship binaries that include weights. Moot for v0.1 because Phase 6.3 is deferred; position locked for v0.2 re-use. (deferred: VibeVoice work moved to v0.2)
- Disk budget: ~6 GB for VibeVoice weights + a few hundred MB for Kokoro. Confirm acceptable cache footprint. Decision (2026-05-21): Accept ~6 GB for VibeVoice (only when Phase 6.3 is enabled in v0.2); Kokoro ~500 MB always OK in v0.1.
- HF cache location: default to standard HF cache (
~/.cache/huggingface) or project-local cache directory? Decision (2026-05-21): Default to standard~/.cache/huggingface; exposeHF_HOMEenv var override. Implementation in Phase 6.2 (Kokoro). - Multi-speaker UX: for VibeVoice, do we expose multi-speaker scripting (e.g.
[S1] line / [S2] line) in v1, or default to single-speaker mode? Decision (2026-05-21): Defer multi-speaker scripting to v0.2. Ship VibeVoice as single-speaker first if shipped at all. (deferred: tied to Phase 6.3) - Provider default: with four providers available, confirm the recommended default remains OpenAI for hosted-mode users and Kokoro for offline-mode users. Decision (2026-05-21): OpenAI is the default hosted provider; Kokoro is the recommended offline default. Drives Phase 4 GUI provider-default behavior.
- Keep the Tkinter UI.
- Allow small targeted UX improvements without redesigning the workflow.
- Keep both preset-based quality selection and direct model selection.
- Keep
key.txtfallback, but recommendOPENAI_API_KEYfirst. - Keep video generation in active scope.
- Include real API smoke tests in addition to mocked automated tests.
- Add Ollama as an optional local provider, with clear capability/quality caveats.
- Add Kokoro-82M as a recommended lightweight local provider (Apache 2.0, CPU-capable).
- Add VibeVoice-1.5B as an opt-in, GPU-only local provider, with explicit research/dev license dialog and preservation of upstream safety artifacts.
- Query Ollama dynamically for locally available models.
- Use dynamic OpenAI model discovery where feasible, with filtering/validation.
- Pin HF model revisions for reproducibility.
- Keep real smoke tests short and inexpensive by default (< $1, < 5 minutes).
The modernization project is complete when all of the following are true:
- TTS uses the official OpenAI SDK
- model/voice behavior is config-driven
- chunking is improved and covered by tests
- retries and bounded concurrency are implemented
- automated unit tests exist and pass locally
- manual validation checklist exists for human-only review areas
- README/setup docs are updated (including espeak-ng setup, HF cache notes, VibeVoice license acknowledgement)
- Kokoro-82M provider works end-to-end with pinned revision and produces valid MP3 output
- VibeVoice-1.5B provider (if in approved scope) works end-to-end on GPU with watermark + audible disclaimer preserved
- stakeholder approves the resulting workflow and defaults
This project is a strong candidate for a lightweight modernization because the codebase is small and its weaknesses are concentrated in a few clear places. The best path is not a rewrite; it is a focused refactor with tests, config cleanup, and reliability improvements.
No code changes should begin until the stakeholder reviews this PRD and answers the open questions above.