fix(audio): stamp SW-decode buffers from a gapless sample clock (#89)

superuser404notfound · claude · superuser404notfound · commit 55bb619721ee · 2026-06-30T10:16:04.000+02:00
The software-path AudioDecoder stamped each CMSampleBuffer's presentationTimeStamp with the frame's container-quantized PTS. Container timebases are coarse (1 ms in MKV), so when a frame's duration is not an integer number of ticks the consecutive buffers no longer abut: a 1536-sample AC-3 frame is 34.83 ms at 44.1 kHz (vs exactly 32 ms at 48 kHz), so the integer-ms PTS step (34, 35, 35, 36 ...) leaves a sub-millisecond gap/overlap at every boundary and AVSampleBufferAudioRenderer reconciles a discontinuity at each frame (~29 Hz, a continuous crackle). 48 kHz AC-3 was silent on the same path; any non-integer-ms frame duration (48 kHz AAC/FLAC too) was affected. Derive each buffer's PTS from a running sample count anchored to the first frame so consecutive buffers abut to the sample. A real source discontinuity (> 100 ms off the predicted clock: seek/edit) re-anchors so genuine gaps are not papered over; flush() drops the anchor; the clock advances only on a successfully emitted buffer so a dropped buffer injects no phantom samples. Extracted as AudioClockAnchor, a pure mutating struct mirroring OutputTimestampSanitizer, with 7 unit tests reproducing the 44.1 kHz AC-3 crackle and covering jitter absorption, seek re-anchor, flush, the clean 48 kHz case, and dropped-buffer clock integrity. Full suite green (XCTest 208, Swift Testing 265). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XZTEfmztPE8hAdjHdBr9BH
diff --git a/Sources/AetherEngine/Audio/AudioClockAnchor.swift b/Sources/AetherEngine/Audio/AudioClockAnchor.swift
@@ -0,0 +1,56 @@
+import Foundation
+import CoreMedia
+
+/// Gapless presentation-clock for the software AudioDecoder output. Container per-packet PTS are
+/// quantized to the container timebase (1 ms for MKV). When the decoded frame duration is not an
+/// integer number of those ticks -- e.g. a 1536-sample AC-3 frame at 44.1 kHz is 34.83 ms, not an
+/// integer ms -- stamping every CMSampleBuffer with its own quantized PTS leaves a +/-0.5 ms
+/// gap/overlap between consecutive buffers, and AVSampleBufferAudioRenderer reconciles a
+/// discontinuity at each boundary (~29 audible clicks/sec, a continuous crackle). 48 kHz AC-3
+/// (1536 samples = exactly 32 ms) is integer-ms so it never showed.
+///
+/// Fix: anchor to the first frame's PTS, then advance by emitted sample count so consecutive buffers
+/// abut to the sample. `reset()` on flush; a real source PTS discontinuity (seek/edit, > 100 ms off
+/// the predicted clock) re-anchors so genuine gaps aren't papered over. Mirrors
+/// `OutputTimestampSanitizer`'s focused, unit-testable mutating-struct shape (issue #89).
+struct AudioClockAnchor {
+    /// A real seek/edit moves the source clock by far more than container rounding jitter.
+    static let discontinuityThresholdSeconds = 0.10
+
+    private var anchorPTS: CMTime = .invalid
+    private var emittedSamplesSinceAnchor: Int64 = 0
+
+    /// Clear the anchor (flush/seek). The next `resolve` re-anchors to its `startPTS`.
+    mutating func reset() {
+        anchorPTS = .invalid
+        emittedSamplesSinceAnchor = 0
+    }
+
+    /// Decide the PTS to stamp on a buffer whose container PTS is `startPTS`, without mutating state.
+    /// `reanchor` distinguishes a fresh anchor (first buffer or post-discontinuity) from a continued
+    /// gapless run; pass it back to `commit` once the buffer is actually emitted.
+    func resolve(startPTS: CMTime, sampleRate: Int32) -> (pts: CMTime, reanchor: Bool) {
+        guard anchorPTS.isValid, startPTS.isValid else {
+            return (startPTS, true)  // first buffer after open/flush, or no source PTS to anchor to
+        }
+        let predicted = CMTimeAdd(
+            anchorPTS,
+            CMTime(value: emittedSamplesSinceAnchor, timescale: sampleRate)
+        )
+        if abs(CMTimeGetSeconds(CMTimeSubtract(startPTS, predicted))) > Self.discontinuityThresholdSeconds {
+            return (startPTS, true)   // real discontinuity (seek/edit): honour the source clock
+        }
+        return (predicted, false)     // gapless continuation: abut to the sample, ignore container rounding
+    }
+
+    /// Advance the clock after a buffer was successfully emitted at `pts`. Only called on success so a
+    /// dropped buffer does not inject phantom samples into the running count.
+    mutating func commit(pts: CMTime, reanchor: Bool, sampleCount: Int) {
+        if reanchor {
+            anchorPTS = pts
+            emittedSamplesSinceAnchor = Int64(sampleCount)
+        } else {
+            emittedSamplesSinceAnchor += Int64(sampleCount)
+        }
+    }
+}
diff --git a/Sources/AetherEngine/Audio/AudioDecoder.swift b/Sources/AetherEngine/Audio/AudioDecoder.swift
@@ -33,6 +33,11 @@ final class AudioDecoder: @unchecked Sendable {
     private var pendingStartPTS: CMTime = .invalid
     private var pendingSampleCount: Int = 0
 
+    /// Gapless presentation clock (issue #89): stamps each buffer from a running sample count so
+    /// consecutive buffers abut to the sample, instead of from each buffer's container-quantized PTS
+    /// (which clicks at every frame for non-integer-ms frame durations like 1536-sample AC-3 @ 44.1 kHz).
+    private var clock = AudioClockAnchor()
+
     #if DEBUG
     private var _loggedZeroConvert = false
     #endif
@@ -170,6 +175,8 @@ final class AudioDecoder: @unchecked Sendable {
         avcodec_flush_buffers(ctx)
         // Drop the coalesced samples; after a seek they'd be at the wrong PTS anyway.
         resetPending()
+        // Re-anchor the gapless clock to the post-seek PTS on the next emitted buffer.
+        clock.reset()
         #if DEBUG
         _loggedZeroConvert = false
         #endif
@@ -356,12 +363,18 @@ final class AudioDecoder: @unchecked Sendable {
             return nil
         }
 
+        // Gapless PTS (issue #89): derive this buffer's start from the running sample count so
+        // consecutive buffers abut exactly, instead of from the container-quantized PTS (which leaves
+        // +/-0.5 ms gaps and per-frame clicks for non-integer-ms frame durations). Committed only on
+        // success below, so a dropped buffer never advances the clock.
+        let (outPTS, reanchor) = clock.resolve(startPTS: startPTS, sampleRate: sampleRate)
+
         // Single timing entry: CoreMedia treats `duration` as per-SAMPLE, so LPCM must be 1/sampleRate. Stamping
         // the buffer total made GetDuration report totalSamples^2/sampleRate (~22s for 1024 samples), wedging
         // AudioPlaybackHost's buffer-ahead gate after one packet.
         var timing = CMSampleTimingInfo(
             duration: CMTime(value: 1, timescale: sampleRate),
-            presentationTimeStamp: startPTS,
+            presentationTimeStamp: outPTS,
             decodeTimeStamp: .invalid
         )
 
@@ -382,6 +395,7 @@ final class AudioDecoder: @unchecked Sendable {
         )
         resetPending()
         guard status == noErr, let sample = sampleBuffer else { return nil }
+        clock.commit(pts: outPTS, reanchor: reanchor, sampleCount: totalSamples)
         return sample
     }
 
diff --git a/Tests/AetherEngineTests/AudioClockAnchorTests.swift b/Tests/AetherEngineTests/AudioClockAnchorTests.swift
@@ -0,0 +1,147 @@
+import XCTest
+import CoreMedia
+@testable import AetherEngine
+
+/// Issue #89: the software AudioDecoder stamped each CMSampleBuffer with its container-quantized PTS.
+/// For frame durations that are not an integer number of container ticks (1536-sample AC-3 @ 44.1 kHz
+/// = 34.83 ms in a 1 ms MKV timebase) consecutive buffers no longer abut, and
+/// AVSampleBufferAudioRenderer clicks at every frame (~29 Hz, a continuous crackle). AudioClockAnchor
+/// stamps from a running sample count so buffers abut to the sample, re-anchoring only on a real
+/// (> 100 ms) source discontinuity.
+final class AudioClockAnchorTests: XCTestCase {
+
+    /// Container-quantized PTS the demuxer hands us: MKV carries a 1 ms timebase, so the per-packet
+    /// PTS is the frame's true time rounded to the nearest millisecond.
+    private func containerPTS(frame n: Int, samplesPerFrame: Int, sampleRate: Int32) -> CMTime {
+        let seconds = Double(n * samplesPerFrame) / Double(sampleRate)
+        let ms = Int64((seconds * 1000).rounded())
+        return CMTimeMake(value: ms, timescale: 1000)
+    }
+
+    /// Drive the anchor exactly as AudioDecoder.emitPending does: resolve, then commit on success.
+    @discardableResult
+    private func runStream(_ anchor: inout AudioClockAnchor,
+                           ptsList: [CMTime],
+                           samplesPerFrame: Int,
+                           sampleRate: Int32) -> [CMTime] {
+        var out: [CMTime] = []
+        for pts in ptsList {
+            let r = anchor.resolve(startPTS: pts, sampleRate: sampleRate)
+            anchor.commit(pts: r.pts, reanchor: r.reanchor, sampleCount: samplesPerFrame)
+            out.append(r.pts)
+        }
+        return out
+    }
+
+    // MARK: - The crackle bug
+
+    func testConsecutiveBuffersAbutToTheSample_441kHzAC3() {
+        let rate: Int32 = 44100
+        let spf = 1536  // AC-3 frame
+        var anchor = AudioClockAnchor()
+        let ptsList = (0..<10).map { containerPTS(frame: $0, samplesPerFrame: spf, sampleRate: rate) }
+
+        let out = runStream(&anchor, ptsList: ptsList, samplesPerFrame: spf, sampleRate: rate)
+
+        let expectedStep = Double(spf) / Double(rate)  // 34.8299... ms, the true frame length
+        for n in 1..<out.count {
+            let delta = CMTimeGetSeconds(CMTimeSubtract(out[n], out[n - 1]))
+            XCTAssertEqual(delta, expectedStep, accuracy: 1e-6,
+                "buffer \(n) must abut to the sample; got \(delta * 1000) ms, expected \(expectedStep * 1000) ms")
+        }
+    }
+
+    func testFirstBufferAnchorsToItsStartPTS() {
+        var anchor = AudioClockAnchor()
+        let start = CMTimeMake(value: 5000, timescale: 1000)
+        let r = anchor.resolve(startPTS: start, sampleRate: 44100)
+        XCTAssertTrue(r.reanchor)
+        XCTAssertEqual(r.pts, start)
+    }
+
+    // MARK: - Jitter is absorbed, real discontinuities re-anchor
+
+    func testSmallContainerJitterIsAbsorbed() {
+        let rate: Int32 = 44100
+        let spf = 1536
+        var anchor = AudioClockAnchor()
+
+        let r0 = anchor.resolve(startPTS: .zero, sampleRate: rate)
+        anchor.commit(pts: r0.pts, reanchor: r0.reanchor, sampleCount: spf)
+
+        // True next boundary is 34.83 ms; the container rounds it to 35 ms. The 0.17 ms jitter is what
+        // produced the per-frame click, so it must be absorbed (predicted used, not the rounded PTS).
+        let jittery = CMTimeMake(value: 35, timescale: 1000)
+        let r1 = anchor.resolve(startPTS: jittery, sampleRate: rate)
+        XCTAssertFalse(r1.reanchor, "sub-threshold container rounding must not re-anchor")
+        XCTAssertEqual(CMTimeGetSeconds(r1.pts), Double(spf) / Double(rate), accuracy: 1e-6,
+            "buffer must be stamped at the sample-accurate predicted time, not the rounded container PTS")
+    }
+
+    func testRealDiscontinuityReanchors() {
+        let rate: Int32 = 44100
+        let spf = 1536
+        var anchor = AudioClockAnchor()
+        let ptsList = (0..<5).map { containerPTS(frame: $0, samplesPerFrame: spf, sampleRate: rate) }
+        runStream(&anchor, ptsList: ptsList, samplesPerFrame: spf, sampleRate: rate)
+
+        let seek = CMTimeMake(value: 60_000, timescale: 1000)  // a 60 s jump dwarfs container jitter
+        let r = anchor.resolve(startPTS: seek, sampleRate: rate)
+        XCTAssertTrue(r.reanchor)
+        XCTAssertEqual(r.pts, seek, "a real source discontinuity must be honoured, not papered over")
+    }
+
+    func testResetReanchorsNextBuffer() {
+        let rate: Int32 = 44100
+        let spf = 1536
+        var anchor = AudioClockAnchor()
+        let ptsList = (0..<5).map { containerPTS(frame: $0, samplesPerFrame: spf, sampleRate: rate) }
+        runStream(&anchor, ptsList: ptsList, samplesPerFrame: spf, sampleRate: rate)
+
+        anchor.reset()
+
+        // 174 ms sits right on the predicted clock (5 * 1536 / 44100 = 174.1 ms); without reset it would
+        // be absorbed as a continuation. After a flush it must re-anchor to its own PTS instead.
+        let afterFlush = CMTimeMake(value: 174, timescale: 1000)
+        let r = anchor.resolve(startPTS: afterFlush, sampleRate: rate)
+        XCTAssertTrue(r.reanchor, "flush must drop the anchor so the post-seek buffer re-anchors")
+        XCTAssertEqual(r.pts, afterFlush)
+    }
+
+    // MARK: - The clean case stays clean, and dropped buffers don't drift the clock
+
+    func test48kHzAC3StaysGapless() {
+        let rate: Int32 = 48000
+        let spf = 1536  // exactly 32 ms, always was gapless
+        var anchor = AudioClockAnchor()
+        let ptsList = (0..<10).map { containerPTS(frame: $0, samplesPerFrame: spf, sampleRate: rate) }
+
+        let out = runStream(&anchor, ptsList: ptsList, samplesPerFrame: spf, sampleRate: rate)
+
+        let expectedStep = Double(spf) / Double(rate)
+        for n in 1..<out.count {
+            let delta = CMTimeGetSeconds(CMTimeSubtract(out[n], out[n - 1]))
+            XCTAssertEqual(delta, expectedStep, accuracy: 1e-6)
+        }
+    }
+
+    func testDroppedBufferDoesNotAdvanceClock() {
+        let rate: Int32 = 44100
+        let spf = 1536
+        var anchor = AudioClockAnchor()
+
+        // Frame 0 emits and commits: clock now holds 1536 samples.
+        let r0 = anchor.resolve(startPTS: .zero, sampleRate: rate)
+        anchor.commit(pts: r0.pts, reanchor: r0.reanchor, sampleCount: spf)
+
+        // Frame 1 resolves but its CMSampleBuffer creation fails, so it is never committed.
+        _ = anchor.resolve(startPTS: containerPTS(frame: 1, samplesPerFrame: spf, sampleRate: rate), sampleRate: rate)
+
+        // Frame 2 must predict off the still-1536 sample count (one frame in), proving the dropped
+        // buffer injected no phantom samples.
+        let r2 = anchor.resolve(startPTS: containerPTS(frame: 2, samplesPerFrame: spf, sampleRate: rate), sampleRate: rate)
+        XCTAssertFalse(r2.reanchor)
+        XCTAssertEqual(CMTimeGetSeconds(r2.pts), Double(spf) / Double(rate), accuracy: 1e-6,
+            "an uncommitted (dropped) buffer must not advance the sample clock")
+    }
+}
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -38,6 +38,8 @@ Source URL ──► Demuxer ──┬─► SoftwareVideoDecoder (dav1d) ──
                                               AVR / speakers
 ```
 
+`AudioDecoder` stamps each `CMSampleBuffer` from a running sample count anchored to the first frame (`AudioClockAnchor`), not from the container-quantized per-packet PTS. Container timebases are coarse (1 ms in MKV), so when a frame's duration is not an integer number of ticks (a 1536-sample AC-3 frame is 34.83 ms at 44.1 kHz but exactly 32 ms at 48 kHz) the quantized PTS leave a sub-millisecond gap or overlap at every buffer boundary, and `AVSampleBufferAudioRenderer` reconciles a discontinuity at each one (~29 clicks/sec, a continuous crackle). Anchoring to the sample clock makes consecutive buffers abut exactly; a real source discontinuity (> 100 ms off the predicted clock, i.e. a seek or edit) re-anchors so genuine gaps are not papered over, and `flush()` drops the anchor. The clock advances only on a successfully emitted buffer, so a dropped buffer injects no phantom samples.
+
 AV1+DV (Profile 10.0 / 10.1 / 10.4) routes through the native path on hardware-AV1 hosts via the `dav1` / `av01` track type plus the source's `dvvC` box. AV1+Atmos is genuinely rare in the wild (mastering still runs in HEVC overwhelmingly), so the SW pipeline's lack of Atmos passthrough is a theoretical limitation rather than a real one. The dispatch happens once at load time; hosts see a unified `@Published` state surface either way.
 
 **Background audio (iOS).** When the app backgrounds while playing, the engine keeps audio going rather than tearing the pipeline down. The decision is a pure, unit-tested policy, `backgroundAction(isAudioBackend:hasSoftwareHost:keepVideoAlive:state:)`, driven from the `UIApplication` lifecycle observers; `keepVideoAlive` comes from `shouldKeepVideoAlive(enabled:pipActive:state:)` and is gated to iOS (tvOS always tears down, wedge-safe: a frozen decode session crossing a multi-hour suspension wedged `mediaserverd`). On the native path "keep audio alive" is just declining to tear down: `AVPlayer` under the `.playback` session keeps decoding. The software path has no `AVPlayer`, and its combined demux loop normally paces the whole loop (audio and video) on the video renderer's `isReadyForMoreMediaData`; once `AVSampleBufferDisplayLayer` stops draining in the background that gate never reopens and audio would starve. So the host enters `backgroundAudioOnly`: the loop drops video packets and paces on the audio renderer (`AudioOutput.isReadyForMoreMediaData`) instead, keeping `AVSampleBufferAudioRenderer` fed and the synchronizer advancing. On foreground return the flag clears, the video decoder and renderer flush, and video resyncs at the next keyframe with audio uninterrupted. Scope is the combined VOD loop (and live-without-DVR, which shares it); the DVR feeder loop is unchanged. Exercise it headless with `aetherctl bgaudio` (see [cli.md](cli.md)).