Skip to content

Commit e2ea254

Browse files
committed
add(tts): return concatenation manifests and retry helper
1 parent a3ae80a commit e2ea254

13 files changed

Lines changed: 968 additions & 176 deletions

File tree

Sources/AgentRunKit/Documentation.docc/AgentRunKit.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,7 @@ For a complete walkthrough, see <doc:GettingStarted>.
104104
- ``VertexGoogleClient``
105105
- ``ResponsesAPIClient``
106106
- ``RetryPolicy``
107+
- ``HTTPDataRetry``
107108
- ``GoogleAuthService``
108109

109110
### Provider Capabilities

Sources/AgentRunKit/Documentation.docc/Articles/MultimodalAndAudio.md

Lines changed: 47 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,7 @@ These methods cover different use cases:
118118
| `generate(text:voice:options:)` | `Data` | Single request, no chunking |
119119
| `stream(text:voice:options:)` | `AsyncThrowingStream<TTSSegment, Error>` | Chunked, yields ordered ``TTSSegment`` values as they complete |
120120
| `generateAll(text:voice:options:)` | `Data` | Chunked, concatenates all segments into one `Data` |
121+
| `generateWithManifest(text:voice:options:)` | ``TTSConcatenationResult`` | Like `generateAll` but also returns a per-segment manifest of chunk, encoding, and timing |
121122
| `chunks(for:)` | `[TTSChunk]` | The chunk plan this client will use, without invoking the provider |
122123

123124
```swift
@@ -134,10 +135,25 @@ for try await segment in tts.stream(text: longArticle) {
134135
// Full concatenated output
135136
let fullAudio = try await tts.generateAll(text: longArticle, options: TTSOptions(speed: 1.25))
136137

138+
// Concatenated audio plus a per-segment manifest
139+
let result = try await tts.generateWithManifest(text: longArticle, options: TTSOptions(responseFormat: .pcm))
140+
for entry in result.manifest {
141+
if let range = entry.timing.byteRangeInConcatenatedAudio {
142+
print("chunk \(entry.chunk.index): bytes \(range) of result.audio")
143+
}
144+
}
145+
137146
// Forecast the chunk plan without generating audio
138147
let plan = tts.chunks(for: longArticle)
139148
```
140149

150+
`generateWithManifest` populates ``TTSSegmentTiming/byteRangeInConcatenatedAudio`` for `pcm`
151+
output today. Other formats leave it `nil` until the framework can compute byte ranges defensibly.
152+
`generateAll` is implemented on top of the same path and returns `result.audio`.
153+
154+
`stream` segments always carry ``TTSSegmentTiming/uncomputed`` timing. Per-segment audio is the
155+
raw chunk bytes, and final container offsets are only meaningful after concatenation.
156+
141157
### TTSOptions
142158

143159
``TTSOptions`` controls per-request parameters:
@@ -147,13 +163,26 @@ let plan = tts.chunks(for: longArticle)
147163

148164
### How Chunking Works
149165

150-
The chunker splits input text on sentence boundaries using `NLTokenizer`. Sentences are packed into chunks up to the provider's `maxChunkCharacters` limit. Oversized sentences fall back to word-level, then character-level splitting. ``TTSClient`` dispatches up to `maxConcurrent` chunk requests in parallel using a task group. Results are buffered and yielded in original order.
166+
The chunker splits input text on sentence boundaries using `NLTokenizer`. Sentences are packed up to
167+
the provider's `maxChunkCharacters` limit. Oversized sentences fall back to word-level, then
168+
character-level splitting.
169+
170+
``TTSClient`` dispatches up to `maxConcurrent` chunk requests in parallel. Results are buffered and
171+
yielded in original order.
151172

152-
Each ``TTSSegment`` aggregates a ``TTSChunk`` (the unit of input text), a ``TTSAudioEncoding`` (the encoding ``TTSClient`` requested from the provider), a ``TTSSegmentTiming`` (audio-time metadata; both fields are `nil` until the framework computes them), and the audio bytes. The chunk, encoding, and timing are the canonical access path; flat computed properties on ``TTSSegment`` (`index`, `total`, `text`, `sourceRange`) forward to the chunk for log statements that need only those fields. For force-split chunks, `text` normalizes whitespace to single spaces while `sourceRange` covers the discontiguous span of the words it contains, preserving left-to-right monotonicity for caller-side highlighting and forced alignment.
173+
Each ``TTSSegment`` carries a ``TTSChunk``, a ``TTSAudioEncoding``, a ``TTSSegmentTiming``, and the
174+
audio bytes. The chunk, encoding, and timing fields are the canonical access path; flat properties
175+
on ``TTSSegment`` forward to the chunk for compact logging.
153176

154-
``TTSClient/chunks(for:)`` returns the same ``TTSChunk`` values the stream will emit, without calling the provider. Use it to forecast chunk identity before generation or to drive offline planning.
177+
For force-split chunks, `text` normalizes whitespace to single spaces while `sourceRange` covers the
178+
span of the words it contains. That keeps ranges monotonic for caller-side highlighting and forced
179+
alignment.
155180

156-
``TTSConcatenationResult`` and ``TTSManifestEntry`` describe the shape of a manifest-aware concatenation that pairs the audio bytes with a per-segment manifest of chunk, encoding, and timing.
181+
``TTSClient/chunks(for:)`` returns the same ``TTSChunk`` values the stream will emit, without calling
182+
the provider. Use it to forecast chunk identity before generation or to drive offline planning.
183+
184+
``TTSConcatenationResult`` and ``TTSManifestEntry`` pair concatenated audio bytes with a per-segment
185+
manifest of chunk, encoding, and timing.
157186

158187
For MP3 output, the concatenator strips ID3v2 headers, Xing/Info frames, and ID3v1 tails from interior segments for clean concatenation.
159188

@@ -185,6 +214,19 @@ let provider = MyTTSProvider(config: TTSProviderConfig(
185214
let tts = TTSClient(provider: provider)
186215
```
187216

217+
For HTTP-backed providers, ``HTTPDataRetry`` exposes the same retry primitive
218+
``OpenAITTSProvider`` uses: exponential backoff with jitter and `Retry-After`-aware handling of
219+
429 responses. Pass a ``RetryPolicy`` and receive `(Data, HTTPURLResponse)` on success or a
220+
``TransportError`` on failure; cancellation propagates through `CancellationError`.
221+
222+
```swift
223+
let (data, response) = try await HTTPDataRetry.perform(
224+
urlRequest: request,
225+
session: .shared,
226+
retryPolicy: .default
227+
)
228+
```
229+
188230
## See Also
189231

190232
- <doc:AgentAndChat>
@@ -204,3 +246,4 @@ let tts = TTSClient(provider: provider)
204246
- ``TTSManifestEntry``
205247
- ``TTSConcatenationResult``
206248
- ``TTSOptions``
249+
- ``HTTPDataRetry``
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
import Foundation
2+
3+
/// A public retry facade shared by ``OpenAITTSProvider`` and custom HTTP-backed providers.
4+
public enum HTTPDataRetry: Sendable {
5+
/// Sends the request with retry, throwing bare ``TransportError`` or `CancellationError`.
6+
public static func perform(
7+
urlRequest: URLRequest,
8+
session: URLSession,
9+
retryPolicy: RetryPolicy
10+
) async throws -> (Data, HTTPURLResponse) {
11+
do {
12+
return try await HTTPRetry.performData(
13+
urlRequest: urlRequest,
14+
session: session,
15+
retryPolicy: retryPolicy
16+
)
17+
} catch is CancellationError {
18+
throw CancellationError()
19+
} catch let AgentError.llmError(transportError) {
20+
throw transportError
21+
}
22+
}
23+
}

Sources/AgentRunKit/LLM/Transport/HTTPRetry.swift

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,9 @@ enum HTTPRetry {
2626
do {
2727
(data, response) = try await session.data(for: urlRequest)
2828
} catch {
29+
if isCancellation(error) {
30+
throw CancellationError()
31+
}
2932
lastError = TransportError.networkError(error)
3033
continue
3134
}
@@ -79,6 +82,9 @@ enum HTTPRetry {
7982
do {
8083
(bytes, response) = try await session.bytes(for: urlRequest)
8184
} catch {
85+
if isCancellation(error) {
86+
throw CancellationError()
87+
}
8288
lastError = TransportError.networkError(error)
8389
continue
8490
}
@@ -165,6 +171,13 @@ enum HTTPRetry {
165171
return nil
166172
}
167173

174+
static func isCancellation(_ error: any Error) -> Bool {
175+
if error is CancellationError {
176+
return true
177+
}
178+
return (error as? URLError)?.code == .cancelled
179+
}
180+
168181
static func collectErrorBody(from bytes: URLSession.AsyncBytes) async -> String {
169182
await withTaskGroup(of: String?.self) { group in
170183
group.addTask {

Sources/AgentRunKit/TTS/OpenAITTSProvider.swift

Lines changed: 6 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -52,18 +52,12 @@ public struct OpenAITTSProvider: TTSProvider, Sendable {
5252
encoding: context.encoding
5353
)
5454

55-
do {
56-
let (data, _) = try await HTTPRetry.performData(
57-
urlRequest: urlRequest, session: session, retryPolicy: retryPolicy
58-
)
59-
return data
60-
} catch is CancellationError {
61-
throw CancellationError()
62-
} catch let AgentError.llmError(transportError) {
63-
throw transportError
64-
} catch {
65-
throw TransportError.other(String(describing: error))
66-
}
55+
let (data, _) = try await HTTPDataRetry.perform(
56+
urlRequest: urlRequest,
57+
session: session,
58+
retryPolicy: retryPolicy
59+
)
60+
return data
6761
}
6862

6963
func buildURLRequest(

Sources/AgentRunKit/TTS/TTSClient.swift

Lines changed: 39 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -94,19 +94,54 @@ public struct TTSClient<P: TTSProvider>: Sendable {
9494
voice: String? = nil,
9595
options: TTSOptions = TTSOptions()
9696
) async throws -> Data {
97+
try await generateWithManifest(text: text, voice: voice, options: options).audio
98+
}
99+
100+
/// Synthesizes the input and returns concatenated audio plus a per-segment manifest; chunk failure throws.
101+
public func generateWithManifest(
102+
text: String,
103+
voice: String? = nil,
104+
options: TTSOptions = TTSOptions()
105+
) async throws -> TTSConcatenationResult {
97106
var segments: [TTSSegment] = []
98107
for try await segment in stream(text: text, voice: voice, options: options) {
99108
segments.append(segment)
100109
}
101110

102111
let effectiveFormat = options.responseFormat ?? provider.config.defaultFormat
103-
if effectiveFormat == .mp3 {
104-
return MP3Concatenator.concatenate(segments.map(\.audio))
112+
let audio: Data = if effectiveFormat == .mp3 {
113+
MP3Concatenator.concatenate(segments.map(\.audio))
114+
} else {
115+
Self.appendingConcatenation(segments.map(\.audio))
105116
}
106117

107-
var result = Data()
118+
var manifest: [TTSManifestEntry] = []
119+
manifest.reserveCapacity(segments.count)
120+
var pcmCursor = 0
108121
for segment in segments {
109-
result.append(segment.audio)
122+
let timing: TTSSegmentTiming
123+
if effectiveFormat == .pcm {
124+
let lower = pcmCursor
125+
pcmCursor += segment.audio.count
126+
timing = TTSSegmentTiming(byteRangeInConcatenatedAudio: lower ..< pcmCursor)
127+
} else {
128+
timing = .uncomputed
129+
}
130+
manifest.append(TTSManifestEntry(
131+
chunk: segment.chunk,
132+
encoding: segment.encoding,
133+
timing: timing
134+
))
135+
}
136+
137+
return TTSConcatenationResult(audio: audio, manifest: manifest)
138+
}
139+
140+
private static func appendingConcatenation(_ audioSegments: [Data]) -> Data {
141+
var result = Data()
142+
result.reserveCapacity(audioSegments.reduce(0) { $0 + $1.count })
143+
for audioSegment in audioSegments {
144+
result.append(audioSegment)
110145
}
111146
return result
112147
}

0 commit comments

Comments
 (0)