Skip to content

Commit da7900a

Browse files
committed
add(tts): expose chunk planning and segment metadata
1 parent 63c9a78 commit da7900a

16 files changed

Lines changed: 917 additions & 94 deletions

Sources/AgentRunKit/Documentation.docc/AgentRunKit.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -164,8 +164,14 @@ For a complete walkthrough, see <doc:GettingStarted>.
164164
- ``TTSProvider``
165165
- ``TTSProviderConfig``
166166
- ``TTSAudioFormat``
167+
- ``TTSAudioEncoding``
167168
- ``OpenAITTSProvider``
168169
- ``TTSSegment``
170+
- ``TTSSegmentTiming``
171+
- ``TTSChunk``
172+
- ``TTSChunkContext``
173+
- ``TTSManifestEntry``
174+
- ``TTSConcatenationResult``
169175
- ``TTSOptions``
170176

171177
### MCP Integration

Sources/AgentRunKit/Documentation.docc/Articles/MultimodalAndAudio.md

Lines changed: 27 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -111,13 +111,14 @@ let tts = TTSClient(provider: provider, maxConcurrent: 4)
111111

112112
### Generating Audio
113113

114-
Three methods cover different use cases:
114+
These methods cover different use cases:
115115

116116
| Method | Returns | Behavior |
117117
|---|---|---|
118118
| `generate(text:voice:options:)` | `Data` | Single request, no chunking |
119119
| `stream(text:voice:options:)` | `AsyncThrowingStream<TTSSegment, Error>` | Chunked, yields ordered ``TTSSegment`` values as they complete |
120120
| `generateAll(text:voice:options:)` | `Data` | Chunked, concatenates all segments into one `Data` |
121+
| `chunks(for:)` | `[TTSChunk]` | The chunk plan this client will use, without invoking the provider |
121122

122123
```swift
123124
// Single generation
@@ -126,11 +127,15 @@ let audio = try await tts.generate(text: "Hello, world.", voice: "nova")
126127
// Streaming segments
127128
for try await segment in tts.stream(text: longArticle) {
128129
player.play(segment.audio)
129-
print("chunk \(segment.index + 1)/\(segment.total) bytes \(segment.sourceRange): \(segment.text)")
130+
let chunk = segment.chunk
131+
print("chunk \(chunk.index + 1)/\(chunk.total) bytes \(chunk.sourceRange): \(chunk.text)")
130132
}
131133

132134
// Full concatenated output
133135
let fullAudio = try await tts.generateAll(text: longArticle, options: TTSOptions(speed: 1.25))
136+
137+
// Forecast the chunk plan without generating audio
138+
let plan = tts.chunks(for: longArticle)
134139
```
135140

136141
### TTSOptions
@@ -144,19 +149,30 @@ let fullAudio = try await tts.generateAll(text: longArticle, options: TTSOptions
144149

145150
The chunker splits input text on sentence boundaries using `NLTokenizer`. Sentences are packed into chunks up to the provider's `maxChunkCharacters` limit. Oversized sentences fall back to word-level, then character-level splitting. ``TTSClient`` dispatches up to `maxConcurrent` chunk requests in parallel using a task group. Results are buffered and yielded in original order.
146151

147-
Each ``TTSSegment`` carries the chunker's output `text` and a `sourceRange` of UTF-8 byte offsets into the original input. For force-split chunks, `text` normalizes whitespace to single spaces while `sourceRange` covers the discontiguous span of the words it contains, preserving left-to-right monotonicity for caller-side highlighting and forced alignment.
152+
Each ``TTSSegment`` aggregates a ``TTSChunk`` (the unit of input text), a ``TTSAudioEncoding`` (the encoding ``TTSClient`` requested from the provider), a ``TTSSegmentTiming`` (audio-time metadata; both fields are `nil` until the framework computes them), and the audio bytes. The chunk, encoding, and timing are the canonical access path; flat computed properties on ``TTSSegment`` (`index`, `total`, `text`, `sourceRange`) forward to the chunk for log statements that need only those fields. For force-split chunks, `text` normalizes whitespace to single spaces while `sourceRange` covers the discontiguous span of the words it contains, preserving left-to-right monotonicity for caller-side highlighting and forced alignment.
153+
154+
``TTSClient/chunks(for:)`` returns the same ``TTSChunk`` values the stream will emit, without calling the provider. Use it to forecast chunk identity before generation or to drive offline planning.
155+
156+
``TTSConcatenationResult`` and ``TTSManifestEntry`` describe the shape of a manifest-aware concatenation that pairs the audio bytes with a per-segment manifest of chunk, encoding, and timing.
148157

149158
For MP3 output, the concatenator strips ID3v2 headers, Xing/Info frames, and ID3v1 tails from interior segments for clean concatenation.
150159

151160
### Custom Providers
152161

153-
Conform to ``TTSProvider`` to use any speech synthesis backend:
162+
Conform to ``TTSProvider`` to use any speech synthesis backend. ``TTSClient`` delivers a ``TTSChunkContext`` carrying the chunk plan and requested encoding alongside each call. Providers should treat `context.encoding` as the authoritative source for the format to produce, and can additionally use it for logging or request correlation:
154163

155164
```swift
156165
struct MyTTSProvider: TTSProvider {
157166
let config: TTSProviderConfig
158167

159-
func generate(text: String, voice: String, options: TTSOptions) async throws -> Data {
168+
func generate(
169+
text: String,
170+
voice: String,
171+
options: TTSOptions,
172+
context: TTSChunkContext
173+
) async throws -> Data {
174+
let chunkID = "\(context.chunk.index + 1)/\(context.chunk.total)"
175+
log("synthesizing \(chunkID) as \(context.encoding.mimeType)")
160176
// Call your speech API and return audio bytes
161177
}
162178
}
@@ -181,4 +197,10 @@ let tts = TTSClient(provider: provider)
181197
- ``TTSProvider``
182198
- ``OpenAITTSProvider``
183199
- ``TTSSegment``
200+
- ``TTSSegmentTiming``
201+
- ``TTSChunk``
202+
- ``TTSChunkContext``
203+
- ``TTSAudioEncoding``
204+
- ``TTSManifestEntry``
205+
- ``TTSConcatenationResult``
184206
- ``TTSOptions``

Sources/AgentRunKit/TTS/OpenAITTSProvider.swift

Lines changed: 19 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,12 @@ public struct OpenAITTSProvider: TTSProvider, Sendable {
3131
)
3232
}
3333

34-
public func generate(text: String, voice: String, options: TTSOptions) async throws -> Data {
34+
public func generate(
35+
text: String,
36+
voice: String,
37+
options: TTSOptions,
38+
context: TTSChunkContext
39+
) async throws -> Data {
3540
if let speed = options.speed {
3641
guard (0.25 ... 4.0).contains(speed) else {
3742
throw TTSError.invalidConfiguration(
@@ -40,7 +45,12 @@ public struct OpenAITTSProvider: TTSProvider, Sendable {
4045
}
4146
}
4247

43-
let urlRequest = try buildURLRequest(text: text, voice: voice, options: options)
48+
let urlRequest = try buildURLRequest(
49+
text: text,
50+
voice: voice,
51+
options: options,
52+
encoding: context.encoding
53+
)
4454

4555
do {
4656
let (data, _) = try await HTTPRetry.performData(
@@ -56,7 +66,12 @@ public struct OpenAITTSProvider: TTSProvider, Sendable {
5666
}
5767
}
5868

59-
func buildURLRequest(text: String, voice: String, options: TTSOptions) throws -> URLRequest {
69+
func buildURLRequest(
70+
text: String,
71+
voice: String,
72+
options: TTSOptions,
73+
encoding: TTSAudioEncoding
74+
) throws -> URLRequest {
6075
let url = baseURL.appendingPathComponent("audio/speech")
6176
var urlRequest = URLRequest(url: url)
6277
urlRequest.httpMethod = "POST"
@@ -67,7 +82,7 @@ public struct OpenAITTSProvider: TTSProvider, Sendable {
6782
model: model,
6883
input: text,
6984
voice: voice,
70-
responseFormat: (options.responseFormat ?? config.defaultFormat).rawValue,
85+
responseFormat: encoding.format.rawValue,
7186
speed: options.speed
7287
)
7388
urlRequest.httpBody = try JSONEncoder().encode(body)

Sources/AgentRunKit/TTS/SentenceChunker.swift

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -147,7 +147,7 @@ enum SentenceChunker {
147147
return (lowerOffset + trimShift) ..< (upperOffset + trimShift)
148148
}
149149

150-
private static func trimByteOffset(in original: String) -> Int {
150+
static func trimByteOffset(in original: String) -> Int {
151151
guard let firstNonWS = original.unicodeScalars.firstIndex(where: {
152152
!CharacterSet.whitespacesAndNewlines.contains($0)
153153
}) else { return 0 }
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
import Foundation
2+
3+
/// The audio encoding ``TTSClient`` requests from a provider for a segment.
4+
public struct TTSAudioEncoding: Sendable, Equatable, Hashable, Codable {
5+
public let format: TTSAudioFormat
6+
public let mimeType: String
7+
public let fileExtension: String
8+
public let sampleRate: Int?
9+
public let channels: Int?
10+
public let bitsPerSample: Int?
11+
12+
public init(
13+
format: TTSAudioFormat,
14+
mimeType: String,
15+
fileExtension: String,
16+
sampleRate: Int? = nil,
17+
channels: Int? = nil,
18+
bitsPerSample: Int? = nil
19+
) {
20+
self.format = format
21+
self.mimeType = mimeType
22+
self.fileExtension = fileExtension
23+
self.sampleRate = sampleRate
24+
self.channels = channels
25+
self.bitsPerSample = bitsPerSample
26+
}
27+
28+
public init(
29+
_ format: TTSAudioFormat,
30+
sampleRate: Int? = nil,
31+
channels: Int? = nil,
32+
bitsPerSample: Int? = nil
33+
) {
34+
self.init(
35+
format: format,
36+
mimeType: format.mimeType,
37+
fileExtension: format.fileExtension,
38+
sampleRate: sampleRate,
39+
channels: channels,
40+
bitsPerSample: bitsPerSample
41+
)
42+
}
43+
}
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
import Foundation
2+
3+
/// An audio container or codec the orchestrator can request from a ``TTSProvider``.
4+
public enum TTSAudioFormat: String, Sendable, Codable, CaseIterable {
5+
case mp3, opus, aac, flac, wav, pcm
6+
7+
public var mimeType: String {
8+
switch self {
9+
case .mp3:
10+
"audio/mpeg"
11+
case .opus:
12+
"audio/opus"
13+
case .aac:
14+
"audio/aac"
15+
case .flac:
16+
"audio/flac"
17+
case .wav:
18+
"audio/wav"
19+
case .pcm:
20+
"audio/L16"
21+
}
22+
}
23+
24+
public var fileExtension: String {
25+
rawValue
26+
}
27+
}
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
import Foundation
2+
3+
/// A unit of input text that ``TTSClient`` synthesizes as one provider call.
4+
public struct TTSChunk: Sendable, Equatable, Hashable, Codable {
5+
public let index: Int
6+
public let total: Int
7+
public let text: String
8+
/// UTF-8 byte offsets into the original input string passed to the ``TTSClient`` call.
9+
public let sourceRange: Range<Int>
10+
11+
public init(index: Int, total: Int, text: String, sourceRange: Range<Int>) {
12+
self.index = index
13+
self.total = total
14+
self.text = text
15+
self.sourceRange = sourceRange
16+
}
17+
}
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
import Foundation
2+
3+
/// The chunk and requested encoding ``TTSClient`` delivers to a provider for one synthesis call.
4+
public struct TTSChunkContext: Sendable, Equatable, Codable {
5+
public let chunk: TTSChunk
6+
public let encoding: TTSAudioEncoding
7+
8+
public init(chunk: TTSChunk, encoding: TTSAudioEncoding) {
9+
self.chunk = chunk
10+
self.encoding = encoding
11+
}
12+
}

Sources/AgentRunKit/TTS/TTSClient.swift

Lines changed: 45 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,30 @@ public struct TTSClient<P: TTSProvider>: Sendable {
2222
guard !trimmed.isEmpty else {
2323
throw TTSError.emptyText
2424
}
25+
let encoding = TTSAudioEncoding(options.responseFormat ?? provider.config.defaultFormat)
26+
let leadingShift = SentenceChunker.trimByteOffset(in: text)
27+
let chunk = TTSChunk(
28+
index: 0,
29+
total: 1,
30+
text: trimmed,
31+
sourceRange: leadingShift ..< (leadingShift + trimmed.utf8.count)
32+
)
33+
let context = TTSChunkContext(chunk: chunk, encoding: encoding)
2534
return try await provider.generate(
2635
text: trimmed,
2736
voice: voice ?? provider.config.defaultVoice,
28-
options: options
37+
options: options,
38+
context: context
39+
)
40+
}
41+
42+
/// The chunk plan this client will use for a given input, without invoking the provider.
43+
public func chunks(for text: String) -> [TTSChunk] {
44+
let internalChunks = SentenceChunker.chunk(
45+
text: text,
46+
maxCharacters: provider.config.maxChunkCharacters
2947
)
48+
return Self.makePublicChunks(internalChunks)
3049
}
3150

3251
public func stream(
@@ -35,25 +54,28 @@ public struct TTSClient<P: TTSProvider>: Sendable {
3554
options: TTSOptions = TTSOptions()
3655
) -> AsyncThrowingStream<TTSSegment, Error> {
3756
let resolvedVoice = voice ?? provider.config.defaultVoice
38-
let chunks = SentenceChunker.chunk(
57+
let internalChunks = SentenceChunker.chunk(
3958
text: text,
4059
maxCharacters: provider.config.maxChunkCharacters
4160
)
4261

43-
guard !chunks.isEmpty else {
62+
guard !internalChunks.isEmpty else {
4463
return AsyncThrowingStream { $0.finish(throwing: TTSError.emptyText) }
4564
}
4665

66+
let publicChunks = Self.makePublicChunks(internalChunks)
67+
let encoding = TTSAudioEncoding(options.responseFormat ?? provider.config.defaultFormat)
4768
let provider = provider
4869
let maxConcurrent = maxConcurrent
4970

5071
return AsyncThrowingStream { continuation in
5172
let task = Task {
5273
do {
5374
try await Self.executeChunks(
54-
chunks,
75+
publicChunks,
5576
voice: resolvedVoice,
5677
options: options,
78+
encoding: encoding,
5779
provider: provider,
5880
maxConcurrent: maxConcurrent,
5981
continuation: continuation
@@ -89,10 +111,18 @@ public struct TTSClient<P: TTSProvider>: Sendable {
89111
return result
90112
}
91113

114+
private static func makePublicChunks(_ internalChunks: [SentenceChunker.Chunk]) -> [TTSChunk] {
115+
let total = internalChunks.count
116+
return internalChunks.enumerated().map { index, chunk in
117+
TTSChunk(index: index, total: total, text: chunk.text, sourceRange: chunk.sourceRange)
118+
}
119+
}
120+
92121
private static func executeChunks(
93-
_ chunks: [SentenceChunker.Chunk],
122+
_ chunks: [TTSChunk],
94123
voice: String,
95124
options: TTSOptions,
125+
encoding: TTSAudioEncoding,
96126
provider: P,
97127
maxConcurrent: Int,
98128
continuation: AsyncThrowingStream<TTSSegment, Error>.Continuation
@@ -107,28 +137,29 @@ public struct TTSClient<P: TTSProvider>: Sendable {
107137

108138
while nextToYield < totalChunks {
109139
while activeTasks < maxConcurrent, nextToSend < totalChunks {
110-
let chunkIndex = nextToSend
111-
let chunk = chunks[chunkIndex]
140+
let chunk = chunks[nextToSend]
141+
let context = TTSChunkContext(chunk: chunk, encoding: encoding)
112142
group.addTask {
113143
do {
114144
let data = try await provider.generate(
115145
text: chunk.text,
116146
voice: voice,
117-
options: options
147+
options: options,
148+
context: context
118149
)
119-
return (chunkIndex, data)
150+
return (chunk.index, data)
120151
} catch is CancellationError {
121152
throw CancellationError()
122153
} catch let error as TransportError {
123154
throw TTSError.chunkFailed(
124-
index: chunkIndex,
155+
index: chunk.index,
125156
total: totalChunks,
126157
sourceRange: chunk.sourceRange,
127158
error
128159
)
129160
} catch {
130161
throw TTSError.chunkFailed(
131-
index: chunkIndex,
162+
index: chunk.index,
132163
total: totalChunks,
133164
sourceRange: chunk.sourceRange,
134165
TransportError.other(String(describing: error))
@@ -144,12 +175,10 @@ public struct TTSClient<P: TTSProvider>: Sendable {
144175
buffer[index] = data
145176

146177
while let audio = buffer.removeValue(forKey: nextToYield) {
147-
let chunk = chunks[nextToYield]
148178
continuation.yield(TTSSegment(
149-
index: nextToYield,
150-
total: totalChunks,
151-
text: chunk.text,
152-
sourceRange: chunk.sourceRange,
179+
chunk: chunks[nextToYield],
180+
encoding: encoding,
181+
timing: .uncomputed,
153182
audio: audio
154183
))
155184
nextToYield += 1

0 commit comments

Comments
 (0)