Text2AudioBook Modernization PRD

1. Document Status

Project: Text2AudioBook
Document Type: Product Requirements Document (PRD)
Status: Draft for stakeholder review
Author: Cline
Last Updated: 2026-05-21
Implementation Status: Planning only; no production code changes approved yet

2. Executive Summary

Text2AudioBook is a small desktop utility that converts text files into speech using OpenAI text-to-speech, then optionally combines MP3s into a video. The current codebase is functional but dated in several important areas: it uses a legacy-style raw HTTP integration for TTS, hardcoded voice/model defaults, simplistic text chunking, weak resilience for retries/rate limits, and a brittle audio/video pipeline.

This PRD proposes a modernization pass that keeps the project intentionally simple while upgrading it to current best practices. The plan focuses on:

migrating to the official OpenAI Python SDK
adding provider abstraction so OpenAI remains the primary hosted option and several local providers (Ollama, Kokoro, VibeVoice) become optional
making model selection config-driven rather than hardcoded
improving chunking and reliability without adding heavy dependencies
adding automated tests so features can be validated safely
keeping the GUI and general workflow familiar to existing users, with only small targeted UX improvements

Two HuggingFace TTS engines are added as opt-in local providers:

Kokoro-82M (hexgrad/Kokoro-82M, Apache 2.0): 82M-param StyleTTS 2 model, CPU-capable, 8 languages, 54 voices, 24 kHz WAV output. Lightweight, permissive license, good fit for single-narrator audiobook flow.
VibeVoice-1.5B (microsoft/VibeVoice-1.5B, MIT weights but research/development use only per upstream guidance): multi-speaker (up to 4), long-form (up to 90 min), English/Chinese, ~3B params BF16, GPU required. Suited to multi-character dialogue and podcast-style output. Includes upstream-baked AI watermark and audible disclaimer that must not be stripped.

This document is designed to be approved before any code changes are made.

3. Product Goals

3.1 Primary Goals

Modernize the TTS integration to use current OpenAI client patterns.
Make model and voice upgrades easy without changing core code each time.
Improve output quality and reliability without overcomplicating the project.
Add optional local model support via Ollama where feasible.
Add optional HuggingFace local providers: Kokoro-82M for lightweight narration and VibeVoice-1.5B for multi-speaker long-form audio.
Add meaningful automated testing coverage for key logic.
Preserve the project's small size and approachable structure.

3.2 Non-Goals

The following are not in scope for this modernization unless approved later:

turning the project into a web app or SaaS
adding a database
adding user accounts, authentication, or cloud sync
building a plugin architecture
adding advanced NLP dependencies unless clearly justified
redesigning the UI into a new framework

4. Current State Assessment

4.1 Strengths

Small, understandable Python codebase
Minimal user workflow
Useful core utility for text-to-audio conversion
Existing GUI reduces adoption friction for non-technical users

4.2 Key Problems / Outdated Areas

A. TTS integration is outdated

Uses requests.post() directly against /v1/audio/speech
Hardcodes model tts-1
Makes future model changes require code edits
Has no provider abstraction, making future local or alternate-provider support awkward

B. Configuration is brittle

API key is read from key.txt at import time
Voices are hardcoded in the GUI
No clean separation between defaults and runtime selection
No discovery mechanism for available provider models/voices

C. Chunking quality is basic

Current punctuation-based chunking is naive
Can split awkwardly around abbreviations, dialogue, or long paragraphs
Logged “starting sentence” metadata is simplistic

D. Reliability is limited

Thread pool concurrency is unbounded by explicit design
No retry/backoff behavior for transient API failures or rate limits
Failure reporting is minimal

E. Video pipeline is fragile

combine_and_convert.py mixes ImageClip and VideoFileClip in a way that appears error-prone
Current approach may be heavier than necessary for static-image videos
Dependency footprint is larger than needed

F. Test coverage is absent

No unit tests
No integration test strategy
No documented validation checklist for human-only verification

5. Users and Use Cases

5.1 Primary Users

Individual creators converting text into audiobook-style audio
Users who prefer a desktop GUI over code/scripts
Small-scale content producers who want a simple OpenAI TTS workflow

5.2 Primary Use Cases

Select a text file and generate a high-quality MP3 audiobook.
Choose voice and quality mode without dealing with technical model names.
Process larger documents safely through chunking.
Optionally combine generated MP3 chunks into a single MP3 and/or simple video.
Switch between hosted OpenAI TTS and local Ollama-backed generation where supported.

6. Product Requirements

6.1 Functional Requirements

FR-1: Modern OpenAI TTS integration

The app shall use the official OpenAI Python SDK for TTS requests.
The app shall support configurable model selection.
The app shall preserve the current text-to-MP3 workflow.

FR-1A: Multi-provider architecture

The app shall support more than one TTS provider without requiring GUI rewrites.
OpenAI shall be the default hosted provider.
Ollama-backed local support shall be included as an optional provider path.
Kokoro-82M (HuggingFace, Apache 2.0) shall be included as an optional local provider for lightweight narration.
VibeVoice-1.5B (HuggingFace, MIT weights, upstream-restricted to research/dev use) shall be included as an optional local provider for multi-speaker long-form output.
Provider-specific capabilities (max speakers, supported languages, output sample rate/format, GPU requirement) shall be surfaced cleanly and used to gate UI options.

FR-1B: HuggingFace model handling

HuggingFace local providers shall download model weights via huggingface_hub with pinned revisions/commit hashes to prevent silent model swaps.
The app shall cache models to a configurable directory (defaulting to the standard HF cache).
The app shall verify required system dependencies (espeak-ng for Kokoro) on first use of the relevant provider and produce a clear actionable error if missing.
The app shall not strip, alter, or disable upstream-baked safety artifacts (e.g. VibeVoice's audible AI disclaimer or imperceptible watermark).
The app shall expose VibeVoice's research/development-only license guidance to the user before first use and require explicit opt-in.
The app shall convert provider-native output (Kokoro: WAV 24 kHz) into the project's standard MP3 output via the existing pydub path.

FR-2: Backward-compatible credential loading

The app shall first check OPENAI_API_KEY from the environment.
The app shall support key.txt only as a fallback/backup option.
The app shall surface clear errors when no API key is available.

FR-3: Simpler modern settings

The app shall support a simple quality preset strategy.
The app shall also support direct model selection for users who want control.
The app shall allow voice selection from a maintained supported list.
The app may optionally expose speed if it does not complicate the UI.

FR-3A: Model and voice discovery

The app shall support refreshing available models from providers where technically feasible.
The app shall cache or fall back to a curated supported list when live discovery is unavailable or unreliable.
The app shall support local model discovery for Ollama via its local API.

FR-4: Better chunking

The app shall split large inputs safely while preferring paragraph and sentence boundaries.
The app shall avoid forced hard cuts unless no clean split exists.
The app shall preserve chunk ordering metadata.

FR-5: Reliability controls

The app shall use bounded concurrency for API calls.
The app shall retry transient failures with backoff.
The app shall log chunk-level success/failure information.

FR-6: Output handling

The app shall still produce a final merged MP3.
The app shall still save chunk position metadata.
The app shall sanitize or validate output file names.

FR-7: Optional video pipeline cleanup

The project shall either simplify static-image video generation or clearly isolate it as a secondary feature.
This shall not block the core text-to-audio modernization.

6.2 Non-Functional Requirements

Keep runtime dependencies lightweight.
Keep modules small and understandable.
Prefer deterministic, testable functions.
Maintain Windows-first usability while avoiding OS-specific breakage where possible.
Avoid hidden magic; use configuration defaults instead of sprawling settings.

7. Proposed Product Direction

7.1 Recommended UX Direction

Keep the current Tkinter desktop UI, but modernize the behavior under the hood.

Recommended user-facing simplification:

Quality Preset: Best Quality, Balanced, Fast (applies to OpenAI; non-OpenAI providers ignore the preset and use direct model/voice selection)
Provider Dropdown: OpenAI / Ollama (Local) / Kokoro (Local) / VibeVoice (Local)
Model Dropdown: manually selectable, populated from provider discovery when available
Voice Dropdown: supported voices only, filtered by provider (Kokoro: 54 voices across 8 languages; VibeVoice: speaker-slot mapping for up to 4 speakers)
Optional Advanced Settings: collapsed or deferred unless needed
First-use notice: selecting VibeVoice shows a one-time dialog summarising the research/dev-only license and the upstream safety features (watermark + audible disclaimer) before any download starts

Rationale:

avoids overwhelming users with model names
allows model upgrades later without changing the UI contract
preserves simplicity while still feeling modern

7.1A Approved small UX improvements

The stakeholder approved small UX improvements while keeping the UI nearly identical. Recommended low-risk improvements are:

disable the Start button while processing
add a small status/progress label (for example: Reading file, Chunk 3/12, Merging audio, Creating video)
add provider and model dropdowns without redesigning the layout
add a Refresh Models button so the latest available models can be fetched where supported
improve validation messages for missing API key, invalid paths, and unsupported local provider state
preserve the existing single-window flow and overall field order as much as possible

7.2 Recommended Model Strategy

Do not hardcode a forever-model into the architecture.

Instead:

define a default recommended model in config/constants
allow a fallback model
map UI presets to model settings internally

Example strategy:

Best Quality → latest recommended high-quality speech model
Balanced → stable general-use model
Fast → lower-latency or lower-cost option

Final exact model names should be confirmed against current OpenAI docs during implementation.

7.2A Auto-pulling latest models

The stakeholder requested support for pulling the latest model names where possible.

Recommended approach:

OpenAI: attempt provider-backed model discovery if the SDK/API exposes suitable listing capabilities for the intended TTS endpoint; otherwise use a maintained allowlist curated in config and updated during releases
Ollama: query the local Ollama API to list installed local models
UI behavior: provide a Refresh Models action and store the last successful result for the current session

Important note: model listing and model usability are not always the same thing. Even if a provider returns a large model list, the app should still filter to models known or configured to support the project’s use case.

7.3 Recommended Configuration Strategy

Minimal config hierarchy:

explicit UI choices
environment variables
local config file defaults
code defaults as last resort

Suggested configurable items:

provider
API key source
default voice
default quality preset
optional explicit model override
default output directory
max concurrency
local Ollama base URL

8. Scope by Release

8.1 Phase 1 — Core Modernization (Must Have)

official OpenAI SDK integration
provider abstraction layer
config-driven model selection
env var API key with key.txt fallback
bounded concurrency
retry/backoff for transient failures
improved logging

8.2 Phase 2 — Output Quality and UX (Should Have)

improved text chunking
optional speed control if supported and simple
provider/model refresh UX
GUI progress state / disable button during work
clearer user-facing error messages

8.3 Phase 3 — Video / Packaging Cleanup (Could Have)

refactor or simplify combine_and_convert.py
reduce unnecessary dependency complexity
improve README/setup docs

8.4 Phase 4 — Local Provider Support (Must Have per stakeholder)

Ollama integration for local model discovery and invocation, if the selected local model/provider path can support the desired workflow
local provider setup checks and user guidance
provider-specific validation and fallback behavior

8.5 Phase 4B — Kokoro-82M Local Provider (Should Have)

HuggingFace download/caching via huggingface_hub with pinned revision
espeak-ng presence check with actionable install guidance (Windows .msi link, Linux apt hint)
Kokoro KPipeline invocation per chunk
WAV→MP3 conversion via pydub to keep pipeline output uniform
voice/language dropdown population from Kokoro's supported list

8.6 Phase 4C — VibeVoice-1.5B Local Provider (Could Have, gated on GPU)

license/safety opt-in dialog on first selection
HuggingFace download/caching via huggingface_hub with pinned revision (~6 GB BF16)
GPU availability detection; clear failure if no compatible GPU
multi-speaker input support (script with speaker tags) — design or defer
preservation of upstream watermark and audible AI disclaimer in output

9. Detailed Phase Tracker

Phase 0 — Discovery and Approval

Confirm PRD approval from stakeholder
Confirm preferred simplicity level for UI changes
Confirm whether video generation remains in active scope or is secondary
Confirm acceptable dependency changes
Confirm baseline Python version target

Exit Criteria

PRD approved
open product questions answered or accepted as assumptions

Phase 1 — Architecture and Configuration

Add a settings/config helper module or equivalent lightweight utility
Add a provider abstraction layer for OpenAI and Ollama
Define config precedence: UI > env > config file > defaults
Add support for OPENAI_API_KEY with key.txt fallback
Define supported voice list in one place
Define quality preset mapping in one place
Move model defaults out of hardcoded request payloads
Define provider-specific capabilities and model filtering rules

Exit Criteria

configuration behavior documented and testable
model and voice defaults no longer buried in GUI/business logic

Phase 2 — TTS Engine Modernization

Replace raw HTTP TTS calls with official OpenAI SDK calls
Add bounded concurrency
Add retry/backoff for transient request failures
Add chunk-level logging and error reporting
Ensure file outputs remain deterministic and ordered

Exit Criteria

TTS pipeline works with current SDK
failure handling is predictable
ordering and output naming remain stable

Phase 2A — Model Discovery and Selection

Add model discovery for OpenAI where feasible and safe
Add local model discovery for Ollama
Add Refresh Models UI action
Add fallback curated model list when live discovery fails
Validate that only usable models are exposed for each provider path

Exit Criteria

model selection is reliable even when discovery partially fails
users can choose either presets or explicit models

Phase 3 — Text Processing Improvements

Refactor chunking to prefer paragraph boundaries
Add sentence-aware fallback splitting
Preserve chunk position metadata accurately
Improve logged sentence preview behavior
Keep implementation dependency-light

Exit Criteria

chunking is measurably cleaner on realistic text
edge cases are covered by unit tests

Phase 4 — GUI Reliability and UX

Disable start button while processing
Add visible progress/status updates
Improve path and filename validation
Improve missing-key and API-failure messaging
Keep UI changes minimal and understandable

Exit Criteria

GUI behavior is clearer during processing
preventable user errors are surfaced early

Phase 5 — Audio/Video Cleanup

Audit combine_and_convert.py behavior
Keep MP3-to-video in active scope as approved by stakeholder
Decide whether to keep MoviePy or replace parts with FFmpeg calls
Fix image/video handling ambiguity
Add tests around non-GUI helper logic where feasible
Update documentation for video generation usage

Exit Criteria

optional video feature is reliable or explicitly deprioritized

Phase 6 — Local Provider Support

Add Ollama connectivity checks
Add local provider invocation path
Add provider-specific settings such as base URL / model listing
Define fallback behavior when a selected local model is unavailable
Document limitations of local model quality/capability compared with hosted TTS

Exit Criteria

local provider path can be selected and validated
unsupported local configurations fail clearly and safely

Phase 6B — Kokoro-82M Local Provider

Add kokoro>=0.9.4, soundfile, huggingface_hub to project dependencies
Document espeak-ng install steps (Windows .msi, Linux apt) in README
Add startup espeak-ng probe with actionable error message on miss
Implement _write_kokoro_speech helper in tts_conversion.py
Pin Kokoro model revision in settings.py
Add Kokoro voice list (54 voices across 8 langs) to supported-voices config
Add WAV→MP3 conversion step (pydub) for Kokoro chunks
Default max_concurrency to 1 for local providers (GPU/CPU memory-bound)
Unit tests covering provider dispatch, voice validation, WAV→MP3 conversion

Exit Criteria

Kokoro can be selected, downloaded once, and used to produce MP3 chunks
espeak-ng absence is detected and surfaced before any chunk fails

Phase 6C — VibeVoice-1.5B Local Provider

Add transformers, torch, accelerate, huggingface_hub to project dependencies (pin versions from upstream pyproject.toml)
Confirm whether upstream's "inference request logging" applies to local invocation; document findings
Implement first-use license/safety opt-in dialog
Implement _write_vibevoice_speech helper in tts_conversion.py
Pin VibeVoice model revision in settings.py
Implement GPU detection and clear failure mode when unavailable
Design multi-speaker input format OR scope to single-speaker for v1
Verify upstream watermark and audible disclaimer remain in output (do not strip)
Unit tests covering provider dispatch, GPU detection, opt-in gating

Exit Criteria

VibeVoice selectable only after explicit opt-in
runs end-to-end on a GPU-equipped machine; fails clearly on non-GPU
output retains upstream safety artifacts

Phase 7 — Testing, Validation, and Docs

Add automated unit test suite
Add focused integration tests with mocks/stubs
Add real API smoke tests for approved hosted-provider scenarios
Document human-only validation checklist
Update README and setup instructions
Produce release validation summary

Exit Criteria

automated tests pass locally
manual validation checklist completed for non-automatable areas

10. Testing Strategy

10.0 Testing is Mandatory and Tracked

Testing is not optional in this modernization effort. Every implementation phase must include explicit test work, test evidence, and a pass/fail status before the phase can be considered complete.

Required enforcement rules:

no feature is considered complete until its required tests are implemented and run
no phase may be marked complete unless its testing exit criteria are satisfied
any bug found during testing must either be fixed in the same phase or explicitly logged as a deferred issue with approval
real smoke tests must be tracked separately from mocked/integration tests
human validation items must be recorded as pending until explicitly confirmed by the stakeholder

10.0A Required test tracking artifacts

Implementation must maintain the following test tracking artifacts:

a test matrix mapping each feature to unit, integration, smoke, and manual validation coverage
a phase validation checklist showing pass/fail status for each phase
a defect log for failed tests, regressions, and deferred issues
a release validation summary documenting what was tested, what passed, what was skipped, and why

10.0B Required status values

Each tracked test item should use one of these statuses:

Not Started
In Progress
Passed
Failed
Blocked
Deferred (Approved)

10.0C Gating policy

A feature cannot move to Done without required automated test coverage.
A phase cannot move to Done while any required test item is Failed or Blocked unless the stakeholder explicitly approves an exception.
Release readiness cannot be declared until required smoke tests pass and all human validation items are either completed or explicitly waived.

10.1 Testing Principles

The project currently has no tests. This modernization will add a pragmatic test strategy that validates core behavior without introducing excessive infrastructure.

Principles:

maximize coverage of deterministic logic with unit tests
isolate API dependencies via mocks
reserve human validation only for areas automation cannot reliably judge
avoid expensive or flaky test design

10.2 Test Pyramid

A. Unit Tests (highest priority)

Unit tests should cover:

Text Processing

reading text file success/failure
chunking under max length
chunking over max length
paragraph-aware splitting
sentence-aware fallback splitting
forced split fallback when no punctuation exists
position metadata correctness
sentence preview metadata correctness

Config / Settings

env var API key loading
fallback to key.txt
error when both are missing
config precedence behavior
default quality preset selection
provider selection and capability gating
model discovery fallback logic

TTS Helper Logic

request payload/model selection mapping
output filename generation
retry decision logic
chunk ordering preservation
bounded concurrency configuration
provider-specific invocation path selection

Audio Utilities

concatenation order behavior
empty list handling
invalid path handling where feasible

Video Helper Logic

only for pure helper functions that can be isolated cleanly
do not attempt heavy multimedia integration in basic unit tests

Local Provider Logic

Ollama availability detection
local model list parsing
provider fallback and unsupported-capability behavior

B. Integration Tests

Integration tests should remain lightweight and mostly mocked.

Recommended integration coverage:

text file → chunk generation → mocked TTS outputs → merged MP3 path creation
GUI-adjacent logic extracted into testable helper functions where possible
config loading + TTS selection flow
provider switching between OpenAI and Ollama using mocks/stubs

C. End-to-End / Smoke Tests

These should be limited and carefully scoped.

Recommended smoke tests:

run conversion on a very small sample input using mocked API responses
verify chunk files and final output file creation
run a real OpenAI smoke test with a tiny approved sample because the stakeholder explicitly requested real validation
optionally run an Ollama smoke test if a compatible local model is installed and available

10.3 Human-Only Validation Checklist

The human should only be asked to validate what cannot be reliably judged through automation.

Human validation required for:

subjective voice quality preference
whether the generated speech sounds natural enough for target use
whether selected default voices feel appropriate
real-cost tolerance for chosen model/preset
full GUI usability preferences
optional GPU/FFmpeg/video behavior on the user’s machine
whether local Ollama output quality is acceptable relative to the hosted provider

Manual Validation Sections

Voice Quality Review
- Listen to sample output across selected voices
- Confirm preferred default voice
Model / Preset Review
- Compare Best Quality vs Balanced vs Fast
- Confirm chosen defaults meet budget and speed expectations
Desktop UX Review
- Validate field labels are understandable
- Confirm progress messaging is clear
- Confirm startup flow is acceptable for non-technical users
Optional Video Validation
- Confirm output plays correctly on target system
- Confirm video/image behavior is acceptable

10.4 Suggested Test Tooling

pytest for test runner
unittest.mock or pytest-mock for mocking
temporary file fixtures for filesystem tests
avoid real network calls in default automated tests

10.5 Definition of Test Completion

Testing is considered complete when:

all unit tests pass locally
mocked integration tests pass locally
required smoke tests pass locally or are explicitly approved as deferred
the test matrix is up to date for all in-scope features
the defect log is updated for all failures encountered during validation
documentation for manual validation is written
stakeholder completes human-only validation checklist items relevant to approved scope

10.6 Required Test Matrix

The implementation must maintain a feature-to-test mapping similar to the following.

Feature / Area	Unit Tests	Integration Tests	Real Smoke Test	Human Validation	Status
Config loading and precedence	Required	Optional	No	No	Not Started
API key env var + key.txt fallback	Required	Required	No	No	Not Started
OpenAI SDK TTS flow	Required	Required	Required	Optional listening check	Not Started
OpenAI model discovery	Required	Required	Optional	No	Not Started
Ollama local model discovery	Required	Required	Optional	No	Not Started
Ollama generation path	Required	Required	Optional/Required if supported locally	Yes, quality review	Not Started
Kokoro download and revision pinning	Required	Required	Optional	No	Not Started
Kokoro espeak-ng probe and error path	Required	Optional	No	No	Not Started
Kokoro generation path + WAV→MP3	Required	Required	Required (small sample)	Yes, quality review	Not Started
VibeVoice license opt-in gate	Required	Optional	No	Yes, consent flow review	Not Started
VibeVoice download and revision pinning	Required	Required	Optional	No	Not Started
VibeVoice GPU detection and failure mode	Required	Required	No	No	Not Started
VibeVoice generation path (preserves watermark)	Required	Required	Required (small sample, GPU-equipped)	Yes, quality + safety review	Not Started
Text chunking logic	Required	Optional	No	Optional output quality spot check	Not Started
Audio concatenation	Required	Required	Optional	Optional	Not Started
MP3-to-video pipeline	Required where practical	Required	Optional	Yes	Not Started
GUI progress and validation states	Required for helper logic	Optional	Optional	Yes	Not Started
Error handling / retry behavior	Required	Required	Optional	No	Not Started

10.7 Required Defect Log Format

Each failed or blocked test must be captured in a defect log with at least:

defect ID
phase
feature area
failing test name
date found
severity
current status
disposition (fix now, defer, won't fix)
stakeholder approval status if deferred

10.8 Phase Validation Tracker

Each phase must include a validation block during implementation:

Validation Block Template

Phase:
Implemented Items:
Unit Tests: Passed / Failed / Blocked
Integration Tests: Passed / Failed / Blocked
Smoke Tests: Passed / Failed / Blocked / N/A
Human Validation Needed: Yes / No
Defects Open:
Approved to Exit Phase: Yes / No

No phase should be closed without completing this validation block.

11. Full Validation Plan

This section defines what the implementation agent should test directly versus what should be escalated to the human.

11.1 Validation the implementation agent can do

static review of module boundaries
run unit tests locally
run mocked integration tests locally
run real hosted-provider smoke tests with approved small input and credentials/environment setup
validate config precedence behavior
validate chunking behavior across sample inputs
validate error handling for missing files / missing keys / mocked API failures
verify output files are created in test environments
verify deterministic ordering of chunks and merged outputs

11.2 Validation requiring the human

live API billing/cost acceptability
subjective speech quality and voice preference
whether latency is acceptable in real-world use
GUI polish preferences
machine-specific FFmpeg/GPU behavior
final acceptance of default presets and defaults

11.3 Acceptance Gate Before Release

Before release, require:

automated tests green
core text-to-audio workflow manually spot-checked by human
at least one real audio sample approved by human
documentation reviewed for setup accuracy

12. Risks and Mitigations

Risk	Impact	Mitigation
OpenAI model names or SDK APIs evolve	Medium	Keep model selection config-driven and isolate SDK usage
Live model discovery returns unusable or overly broad results	Medium	Filter discovered models through provider-specific allowlists/capability rules
Too many UI settings complicate the app	High	Use quality presets and keep advanced options optional or deferred
Rate limits or transient failures break multi-chunk conversion	High	Add retries, bounded concurrency, and chunk-level reporting
Chunking changes cause regressions	Medium	Add sample-based unit tests and preserve deterministic behavior
Video pipeline remains brittle	Medium	Make it a secondary scope item and simplify if retained
Local Ollama models may not provide TTS-quality output or compatible interfaces	High	Treat provider capabilities explicitly, document limitations, and fail clearly when unsupported
VibeVoice upstream restricts to research/dev use; commercial fit unclear	High	Surface license/intent dialog before first use; require explicit opt-in; document limitation in README
VibeVoice model is ~6 GB and requires GPU; many users will lack capability	Medium	GPU detection up front; clear failure path; keep Kokoro as the recommended local default
Kokoro requires external `espeak-ng` system dep (manual install on Windows)	Medium	Detect missing dep at startup of the provider path; show actionable install link
HF model revisions could drift between project versions	Medium	Pin commit hash/revision in settings.py; fail loudly on mismatch
Stripping or muting upstream safety artifacts (watermark, AI disclaimer)	High	Architectural prohibition; covered by code review and explicit test
Dependency bloat increases setup difficulty	Medium	Prefer stdlib + minimal libs; remove obsolete dependencies where possible; gate heavy deps (torch, transformers) to optional extras where feasible

13. Dependencies and Tooling Changes

Likely Dependency Direction

keep: openai, pydub
add: pytest, huggingface_hub
add (Kokoro provider): kokoro>=0.9.4, soundfile
add (VibeVoice provider): transformers, torch, accelerate (versions pinned from upstream pyproject.toml during impl)
system dependency (Kokoro): espeak-ng (Windows: .msi from https://github.com/espeak-ng/espeak-ng/releases; Linux: apt-get install espeak-ng)
possibly remove or reduce dependence on: requests
re-evaluate: moviepy
add or integrate with: Ollama local API client approach (lightweight HTTP or maintained Python package, depending implementation choice)
consider gating heavy deps (torch, transformers) behind a vibevoice extras group so default install stays light

Proposed Development Dependencies

pytest
optionally pytest-mock if helpful

14. Open Questions for Stakeholder Approval

These are the approved implementation assumptions for final planning:

Ollama support: connect to the local Ollama instance and load locally available models as selectable options in the UI.
OpenAI model discovery: use fully dynamic listing where possible, while still filtering to valid/current models appropriate for this app’s workflow.
Real API smoke tests: use a conservative default validation budget chosen by implementation planning.

14.1 Final planning assumptions

Ollama behavior: the app should query the local Ollama API and show local models as available options. The implementation should still validate capabilities and warn or block when a selected local model cannot support the required generation flow. [Confirmed 2026-05-21]
OpenAI discovery behavior: the app should dynamically retrieve current model options where technically feasible. Because dynamic listings can be broad, the implementation should apply validation rules so only valid/current models relevant to the app are shown or are clearly labeled. [Confirmed 2026-05-21]
Real smoke test budget/time cap: default target should be under $1 per validation run and under 5 minutes total runtime. Smoke tests should use tiny sample inputs and minimize generated audio length. [Confirmed 2026-05-21]

14.2 Open questions added by HuggingFace provider expansion

GPU availability: does the target machine have a CUDA-capable GPU with sufficient VRAM for VibeVoice-1.5B (BF16, ~3B param)? If no, Phase 6C is skipped or deferred. Decision (2026-05-21): No GPU assumed in v0.1 target machines. Phase 6.3 (VibeVoice) deferred to v0.2. (resolution mode: bulk-accept-all-defaults; see .paul/phases/00-discovery-and-approval/00-01-APPROVAL-PACKET.md)
VibeVoice license acceptance: stakeholder confirmation that the upstream "research and development only" guidance is acceptable for this project's distribution model, OR explicit decision to keep VibeVoice out of scope. Decision (2026-05-21): Accept research/dev-use license; require first-run opt-in dialog; do NOT ship binaries that include weights. Moot for v0.1 because Phase 6.3 is deferred; position locked for v0.2 re-use. (deferred: VibeVoice work moved to v0.2)
Disk budget: ~6 GB for VibeVoice weights + a few hundred MB for Kokoro. Confirm acceptable cache footprint. Decision (2026-05-21): Accept ~6 GB for VibeVoice (only when Phase 6.3 is enabled in v0.2); Kokoro ~500 MB always OK in v0.1.
HF cache location: default to standard HF cache (~/.cache/huggingface) or project-local cache directory? Decision (2026-05-21): Default to standard ~/.cache/huggingface; expose HF_HOME env var override. Implementation in Phase 6.2 (Kokoro).
Multi-speaker UX: for VibeVoice, do we expose multi-speaker scripting (e.g. [S1] line / [S2] line) in v1, or default to single-speaker mode? Decision (2026-05-21): Defer multi-speaker scripting to v0.2. Ship VibeVoice as single-speaker first if shipped at all. (deferred: tied to Phase 6.3)
Provider default: with four providers available, confirm the recommended default remains OpenAI for hosted-mode users and Kokoro for offline-mode users. Decision (2026-05-21): OpenAI is the default hosted provider; Kokoro is the recommended offline default. Drives Phase 4 GUI provider-default behavior.

15. Recommended Approval Path

Recommended default decisions if you want the simplest modernization

Keep the Tkinter UI.
Allow small targeted UX improvements without redesigning the workflow.
Keep both preset-based quality selection and direct model selection.
Keep key.txt fallback, but recommend OPENAI_API_KEY first.
Keep video generation in active scope.
Include real API smoke tests in addition to mocked automated tests.
Add Ollama as an optional local provider, with clear capability/quality caveats.
Add Kokoro-82M as a recommended lightweight local provider (Apache 2.0, CPU-capable).
Add VibeVoice-1.5B as an opt-in, GPU-only local provider, with explicit research/dev license dialog and preservation of upstream safety artifacts.
Query Ollama dynamically for locally available models.
Use dynamic OpenAI model discovery where feasible, with filtering/validation.
Pin HF model revisions for reproducibility.
Keep real smoke tests short and inexpensive by default (< $1, < 5 minutes).

16. Proposed Definition of Done

The modernization project is complete when all of the following are true:

TTS uses the official OpenAI SDK
model/voice behavior is config-driven
chunking is improved and covered by tests
retries and bounded concurrency are implemented
automated unit tests exist and pass locally
manual validation checklist exists for human-only review areas
README/setup docs are updated (including espeak-ng setup, HF cache notes, VibeVoice license acknowledgement)
Kokoro-82M provider works end-to-end with pinned revision and produces valid MP3 output
VibeVoice-1.5B provider (if in approved scope) works end-to-end on GPU with watermark + audible disclaimer preserved
stakeholder approves the resulting workflow and defaults

17. Implementation Readiness Summary

This project is a strong candidate for a lightweight modernization because the codebase is small and its weaknesses are concentrated in a few clear places. The best path is not a rewrite; it is a focused refactor with tests, config cleanup, and reliability improvements.

No code changes should begin until the stakeholder reviews this PRD and answers the open questions above.

FilesExpand file tree

MODERNIZATION_PRD.md

Latest commit

History

MODERNIZATION_PRD.md

File metadata and controls

Text2AudioBook Modernization PRD

1. Document Status

2. Executive Summary

3. Product Goals

3.1 Primary Goals

3.2 Non-Goals

4. Current State Assessment

4.1 Strengths

4.2 Key Problems / Outdated Areas

A. TTS integration is outdated

B. Configuration is brittle

C. Chunking quality is basic

D. Reliability is limited

E. Video pipeline is fragile

F. Test coverage is absent

5. Users and Use Cases

5.1 Primary Users

5.2 Primary Use Cases

6. Product Requirements

6.1 Functional Requirements

FR-1: Modern OpenAI TTS integration

FR-1A: Multi-provider architecture

FR-1B: HuggingFace model handling

FR-2: Backward-compatible credential loading

FR-3: Simpler modern settings

FR-3A: Model and voice discovery

FR-4: Better chunking

FR-5: Reliability controls

FR-6: Output handling

FR-7: Optional video pipeline cleanup

6.2 Non-Functional Requirements

7. Proposed Product Direction

7.1 Recommended UX Direction

7.1A Approved small UX improvements

7.2 Recommended Model Strategy

7.2A Auto-pulling latest models

7.3 Recommended Configuration Strategy

8. Scope by Release

8.1 Phase 1 — Core Modernization (Must Have)

8.2 Phase 2 — Output Quality and UX (Should Have)

8.3 Phase 3 — Video / Packaging Cleanup (Could Have)

8.4 Phase 4 — Local Provider Support (Must Have per stakeholder)

8.5 Phase 4B — Kokoro-82M Local Provider (Should Have)

8.6 Phase 4C — VibeVoice-1.5B Local Provider (Could Have, gated on GPU)

9. Detailed Phase Tracker

Phase 0 — Discovery and Approval

Exit Criteria

Phase 1 — Architecture and Configuration

Exit Criteria

Phase 2 — TTS Engine Modernization

Exit Criteria

Phase 2A — Model Discovery and Selection

Exit Criteria

Phase 3 — Text Processing Improvements

Exit Criteria

Phase 4 — GUI Reliability and UX

Exit Criteria

Phase 5 — Audio/Video Cleanup

Exit Criteria

Phase 6 — Local Provider Support

Exit Criteria

Phase 6B — Kokoro-82M Local Provider

Exit Criteria

Phase 6C — VibeVoice-1.5B Local Provider

Exit Criteria

Phase 7 — Testing, Validation, and Docs

Exit Criteria

10. Testing Strategy

10.0 Testing is Mandatory and Tracked

10.0A Required test tracking artifacts

10.0B Required status values

10.0C Gating policy

10.1 Testing Principles

10.2 Test Pyramid