Export pipeline speed by richiemcilroy · Pull Request #1601 · CapSoftware/Cap

richiemcilroy · 2026-02-15T01:14:15Z

Implement GPU NV12 export path and optimize H264 encoder settings for faster video export.

The primary goal is to leverage the GPU for RGBA to NV12 conversion, drastically reducing CPU readback data and eliminating CPU-based swscale. Additionally, encoder settings are tuned for export throughput (e.g., realtime=false for VideoToolbox, tune=hq for NVENC) instead of recording-optimized low latency. This PR also includes necessary pipeline adjustments like flush_pipeline_nv12 and robust RGBA fallback handling to ensure correctness.

Greptile Summary

This PR introduces a GPU-accelerated NV12 export pipeline to replace the previous RGBA-based export path, reducing GPU readback data by ~62.5% and eliminating CPU-based swscale conversion. It also tunes H264 encoder settings for export quality/throughput (e.g., realtime=false for VideoToolbox, tune=hq for NVENC) separately from recording-optimized low-latency settings.

GPU NV12 pipeline: New render_video_to_channel_nv12 function and export_nv12 method that perform RGBA-to-NV12 conversion on the GPU via compute shader, with robust RGBA fallback when GPU conversion fails
Pipelined rendering: render() and render_nv12() now return Option<Frame> to support double-buffered readback pipelining; new render_immediate() wraps the old blocking behavior for callers that need synchronous frame output
Frame prefetching: Overlaps next-frame decoding with current-frame rendering via tokio::join! for better pipeline utilization
Arc-wrapped frame data: RenderedFrame.data and Nv12RenderedFrame.data changed from Vec<u8> to Arc<Vec<u8>> to reduce cloning costs
Encoder optimizations: Per-encoder export settings (VideoToolbox, NVENC, QSV, AMF, MF, libx264) with with_export_settings() builder, reusable frame encoding via queue_frame_reusable, and with_external_conversion() to skip internal swscale
Decoder threading: Software decode now uses both FF_THREAD_FRAME and FF_THREAD_SLICE with CPU-count-aware thread limits
Clippy fixes: Collapsed nested if-let chains across test crates, derived Default for AudioConfig
Potential dimension mismatch: The aligned output_size (4-aligned width) used for the encoder may not match the renderer's actual output dimensions when resolutions aren't multiples of 4, since self.resolution_base (unaligned) is passed to the renderer
Test compilation issue: ensure_nv12_data_passthrough_for_nv12_format test constructs Nv12RenderedFrame with Vec<u8> but the field now expects Arc<Vec<u8>>

Confidence Score: 3/5

This PR has a test compilation error and a potential dimension mismatch bug for non-standard resolutions, but the core NV12 pipeline logic is sound for common resolutions.
Score of 3 reflects: (1) a definite compilation error in the test suite where Vec<u8> is used where Arc<Vec<u8>> is needed, and (2) a potential dimension mismatch between the NV12-aligned encoder dimensions and the renderer's actual output dimensions when width is even but not a multiple of 4. The core pipeline changes (pipelined readback, prefetching, Arc-wrapped data, encoder settings) are well-structured and the RGBA fallback path provides safety for edge cases.
crates/export/src/mp4.rs needs attention for both the test compilation fix and the dimension mismatch between aligned output_size and unaligned resolution_base passed to the renderer.

Important Files Changed

Filename	Overview
crates/export/src/mp4.rs	Major changes to switch export pipeline to NV12 GPU path. Contains a dimension mismatch bug between renderer output and encoder expectations for non-4-aligned widths, and a test compilation error from `Arc<Vec<u8>>` type change.
crates/rendering/src/lib.rs	Adds `render_video_to_channel_nv12`, `decode_segment_frames_with_retry` helper, prefetch-while-rendering optimization, and `render_immediate` wrapper. Pipeline signature changes from direct frame return to `Option<RenderedFrame>` for pipelined readback. Well-structured refactor.
crates/rendering/src/frame_pipeline.rs	Changes `RenderedFrame.data` and `Nv12RenderedFrame.data` from `Vec<u8>` to `Arc<Vec<u8>>` for cheaper cloning. Adds bind group caching using pointer-based texture identity. Pipelined readback now returns `Option` instead of blocking. Adds `Nv12RenderedFrame` helper methods.
crates/enc-ffmpeg/src/video/h264.rs	Adds `is_export` flag to distinguish recording vs export encoder settings. Export mode enables quality-optimized settings per-encoder: VideoToolbox (`realtime=false`, `profile=main`), NVENC (`preset=p5`, `tune=hq`), QSV (`preset=medium`), AMF (`quality=quality`), libx264 (`veryfast` default). Adds `queue_frame_reusable` method for zero-alloc frame encoding.
apps/desktop/src-tauri/src/export.rs	Updates callers to use `render_immediate` (which handles the new `Option` return from pipelined `render`). Removes a 200ms sleep before export start.
apps/desktop/src-tauri/src/frame_ws.rs	Changes `WSFrame.data` from `Vec<u8>` to `Arc<Vec<u8>>` to match the rendering crate's change. Uses `Arc::try_unwrap` with clone fallback when ownership is needed. Collapses nested `if-let` per clippy lint.
crates/video-decode/src/ffmpeg.rs	Improves software decoder threading: uses both `FF_THREAD_FRAME` and `FF_THREAD_SLICE`, adjusts thread count to use `clamp` instead of hard-coded values, better scaling with CPU count.

Flowchart

flowchart TD
    A[render_video_to_channel_nv12] --> B{Decode Frame}
    B -->|Prefetch next frame| C[tokio::join! decode + render]
    B -->|No prefetch| D[render_nv12]
    C --> D
    D --> E{GPU NV12 Converter}
    E -->|Success| F[Pipelined NV12 Readback]
    E -->|Width not 4-aligned or GPU fail| G[RGBA Fallback via finish_encoder]
    F --> H[Return previous NV12 frame]
    G --> I[Nv12RenderedFrame with format=Rgba]
    H --> J[mpsc channel to encoder thread]
    I --> J
    J --> K[ensure_nv12_data]
    K -->|format=Nv12| L[Pass through Arc data]
    K -->|format=Rgba| M[CPU swscale RGBA to NV12]
    L --> N[fill_nv12_frame]
    M --> N
    N --> O[queue_frame_reusable]
    O --> P[H264 Encoder with export settings]
    P --> Q[MP4 Output]

_{Last reviewed commit: 5db2514}

Major export pipeline speed optimization: 1. GPU NV12 Export Path (Phase 1): - Added render_video_to_channel_nv12() that renders frames and converts RGBA→NV12 on the GPU using the existing compute shader - Export now reads back NV12 (1.5 bytes/pixel) instead of RGBA (4 bytes/pixel) = 62% less GPU→CPU data transfer per frame - Eliminates CPU swscale RGBA→NV12 conversion entirely - Uses with_external_conversion() to skip encoder's internal converter - NV12 frames fed directly to H264 encoder via queue_frame() - Falls back to RGBA path if output dimensions aren't NV12-compatible - Screenshots generated from NV12 using cpu_yuv::nv12_to_rgba_simd() 2. Export-Optimized Encoder Settings (Phase 2): - VideoToolbox: realtime=false, profile=main (was realtime=true, baseline) - NVENC: preset=p5, tune=hq (was p4, tune=ll) - QSV: preset=medium, look_ahead_depth=20 (was faster) - AMF: quality=quality, rc=vbr_peak (was balanced, vbr_latency) - libx264: preset=veryfast (was ultrafast with zerolatency) - These settings optimize for throughput/quality instead of latency 3. Pipeline Tuning (Phase 4): - Increased channel buffer sizes from 8 to 16 for both stages - Allows better pipeline overlap between render/process/encode stages Added Nv12RenderedFrame helper methods (y_plane, uv_plane, clone_metadata_with_data) and NV12→ffmpeg frame conversion utility. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>

When the GPU NV12 converter fails (e.g. incompatible dimensions at runtime), finish_encoder_nv12 returns an Nv12RenderedFrame with format=GpuOutputFormat::Rgba. The nv12_to_ffmpeg_frame function now checks this format field and falls back to wrapping as RGBA when needed, preventing corrupted output in edge cases. Also added rgba_video_info for the fallback path to correctly describe RGBA pixel format to the ffmpeg frame wrapper. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>

When the GPU NV12 converter falls back to returning RGBA data (rare edge case), we now convert RGBA→NV12 using ffmpeg's swscale on CPU before sending to the encoder. This ensures the encoder always receives NV12 frames since it was configured with external_conversion (no internal converter). Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>

The NV12 pipelined readback always has one frame in flight. Without flushing, the last exported frame would be lost. Added flush_pipeline_nv12() to FrameRenderer and call it at the end of render_video_to_channel_nv12() to ensure all frames are delivered. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>

cursor · 2026-02-15T01:14:16Z

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
_{Learn more about Cursor Agents}

- Remove panic-prone expect() calls in RGBA→NV12 CPU fallback path; gracefully handle scaler creation and conversion failures - Add encoder thread timing: logs encoded frame count, elapsed time, and effective encode FPS for both NV12 and RGBA paths - These metrics help identify whether the encoder is the bottleneck when profiling export performance Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>

Restructured the NV12 export path to eliminate per-frame ffmpeg frame allocation (~3.1MB per frame at 1080p): - Encoder thread now owns a single reusable frame::Video (NV12) that is filled from raw NV12 bytes each frame via fill_nv12_frame() - Render task sends raw Nv12ExportFrame (Vec<u8> + metadata) through the sync_channel instead of pre-built frame::Video objects - Added MP4File::queue_video_frame_reusable() that delegates to H264Encoder::queue_frame_reusable() for zero-allocation encoding - Added ensure_nv12_data() for CPU RGBA→NV12 fallback when GPU converter returns RGBA data (extremely rare edge case) This eliminates one heap allocation + zeroing per frame in the hot encoding path, reducing memory pressure and improving cache locality. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>

Replace manual pixel-by-pixel RGBA→NV12 conversion with ffmpeg's swscale (SIMD-optimized) in the rare GPU fallback path. This makes the fallback path significantly faster if it ever triggers. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>

Move ensure_nv12_data() before first_frame_data capture so the screenshot data is always in NV12 format. Previously, if the GPU NV12 converter fell back to RGBA, the screenshot would receive RGBA data but interpret it as NV12, producing a corrupted image. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>

- fill_nv12_frame_preserves_data_layout: verifies Y and UV plane data is correctly copied into ffmpeg frame with proper stride handling - ensure_nv12_data_passthrough_for_nv12_format: verifies NV12 data passes through without conversion when format is already NV12 - nv12_export_frame_dimensions_match: validates NV12 size calculations and confirms 62.5% data reduction vs RGBA at 1080p Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>

Deduplicate the frame decode-with-retry logic that was duplicated between render_video_to_channel (RGBA) and render_video_to_channel_nv12. Both functions now call the shared decode_segment_frames_with_retry() helper, reducing ~100 lines of duplicated decode/retry/backoff code to a single source of truth. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>

… count

tembo · 2026-02-15T21:09:20Z

crates/export/src/mp4.rs

+        let output_size = ((raw_output_size.0 + 3) & !3, (raw_output_size.1 + 1) & !1);
+
+        if output_size != raw_output_size {
+            info!(
+                raw_width = raw_output_size.0,
+                raw_height = raw_output_size.1,
+                aligned_width = output_size.0,
+                aligned_height = output_size.1,
+                "Aligned output dimensions for NV12 GPU path"
+            );
+        }
+
+        info!(
+            width = output_size.0,
+            height = output_size.1,
+            "Using GPU NV12 export path (reduced readback + no CPU swscale)"
+        );


This output_size alignment is only applied on the encoder side. The renderer still derives its own output size from resolution_base/project, and RgbaToNv12Converter::submit_conversion returns false when width isn’t divisible by 4. In that case you’ll fall back to RGBA frames whose dimensions can differ from the aligned VideoInfo/ffmpeg frame size, which looks like it could silently corrupt output (truncated copy) when raw_output_size.0 % 4 != 0.

Suggested change

let output_size = ((raw_output_size.0 + 3) & !3, (raw_output_size.1 + 1) & !1);

if output_size != raw_output_size {

info!(

raw_width = raw_output_size.0,

raw_height = raw_output_size.1,

aligned_width = output_size.0,

aligned_height = output_size.1,

"Aligned output dimensions for NV12 GPU path"

);

}

info!(

width = output_size.0,

height = output_size.1,

"Using GPU NV12 export path (reduced readback + no CPU swscale)"

);

let output_size = raw_output_size;

info!(

width = output_size.0,

height = output_size.1,

"Exporting with NV12 pipeline (GPU when possible, CPU fallback otherwise)"

);

tembo · 2026-02-15T21:09:26Z

crates/export/src/mp4.rs

+    use cap_rendering::GpuOutputFormat;
+
+    if frame.format != GpuOutputFormat::Rgba {
+        return frame.into_data();


Minor perf note: frame.into_data() will almost always clone here because render_video_to_channel_nv12 keeps last_successful_frame (another Arc) for fallback. If the goal is to avoid per-frame copies, it’d be worth carrying Arc<Vec<u8>> through Nv12ExportFrame for the NV12 case and only allocating/cloning when doing the RGBA->NV12 CPU fallback.

tembo · 2026-02-15T21:09:32Z

apps/desktop/src-tauri/src/frame_ws.rs

                        Ok(frame) => {
                            let packed = pack_frame_data(
-                                frame.data,
+                                std::sync::Arc::try_unwrap(frame.data).unwrap_or_else(|arc| (*arc).clone()),


Small readability tweak (and avoids the closure):

Suggested change

std::sync::Arc::try_unwrap(frame.data).unwrap_or_else(|arc| (*arc).clone()),

std::sync::Arc::unwrap_or_clone(frame.data),

tembo · 2026-02-15T21:09:38Z

crates/rendering/src/frame_pipeline.rs

+    }
+
+    pub fn into_data(self) -> Vec<u8> {
+        Arc::try_unwrap(self.data).unwrap_or_else(|arc| (*arc).clone())


Same here—Arc::unwrap_or_clone reads a bit cleaner.

Suggested change

Arc::try_unwrap(self.data).unwrap_or_else(|arc| (*arc).clone())

Arc::unwrap_or_clone(self.data)

greptile-apps

_{17 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-15T21:11:13Z

crates/export/src/mp4.rs

+        let data = vec![1u8, 2, 3, 4, 5, 6];
+        let frame = Nv12RenderedFrame {
+            data: data.clone(),


Test won't compile: type mismatch

Nv12RenderedFrame.data is Arc<Vec<u8>> (changed in this PR at frame_pipeline.rs:357), but data.clone() returns Vec<u8>. This will fail to compile.

Suggested change

let data = vec![1u8, 2, 3, 4, 5, 6];

let frame = Nv12RenderedFrame {

data: data.clone(),

let data = vec![1u8, 2, 3, 4, 5, 6];

let frame = Nv12RenderedFrame {

data: std::sync::Arc::new(data.clone()),

width: 4,

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/export/src/mp4.rs Line: 614:616 Comment: **Test won't compile: type mismatch** `Nv12RenderedFrame.data` is `Arc<Vec<u8>>` (changed in this PR at `frame_pipeline.rs:357`), but `data.clone()` returns `Vec<u8>`. This will fail to compile. ```suggestion let data = vec![1u8, 2, 3, 4, 5, 6]; let frame = Nv12RenderedFrame { data: std::sync::Arc::new(data.clone()), width: 4, ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-02-15T21:11:14Z

crates/export/src/mp4.rs

                    if frame_tx
-                        .send(MP4Input {
+                        .send(Nv12ExportFrame {
                            audio: audio_frame,
-                            video: video_info.wrap_frame(
-                                &frame.data,
-                                frame_number as i64,
-                                frame.padded_bytes_per_row as usize,
-                            ),
+                            nv12_data,
+                            width: output_size.0,
+                            height: output_size.1,
+                            y_stride: output_size.0,
+                            pts: frame_number as i64,
                        })


Dimension mismatch between renderer and encoder

The Nv12ExportFrame uses the NV12-aligned output_size (4-aligned width, 2-aligned height from line 67), but the renderer receives self.resolution_base (line 340) and internally computes output dimensions via get_output_size() which only aligns to multiples of 2 (see lib.rs:1434).

When the rendered frame width is even but not a multiple of 4 (e.g., 1278, 1918), the GPU NV12 path falls back to RGBA-to-NV12 CPU conversion using the frame's actual dimensions. But here the Nv12ExportFrame declares the aligned output_size dimensions. fill_nv12_frame then reads the NV12 data with the wrong stride, producing garbled video.

Consider either:

Passing the aligned output_size as resolution_base to the renderer so it produces frames matching the encoder's expected dimensions, or

Using the frame's actual dimensions (frame.width, frame.height) instead of output_size for the Nv12ExportFrame.

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/export/src/mp4.rs Line: 279:287 Comment: **Dimension mismatch between renderer and encoder** The `Nv12ExportFrame` uses the NV12-aligned `output_size` (4-aligned width, 2-aligned height from line 67), but the renderer receives `self.resolution_base` (line 340) and internally computes output dimensions via `get_output_size()` which only aligns to multiples of 2 (see `lib.rs:1434`). When the rendered frame width is even but not a multiple of 4 (e.g., 1278, 1918), the GPU NV12 path falls back to RGBA-to-NV12 CPU conversion using the frame's actual dimensions. But here the `Nv12ExportFrame` declares the aligned `output_size` dimensions. `fill_nv12_frame` then reads the NV12 data with the wrong stride, producing garbled video. Consider either: 1. Passing the aligned `output_size` as `resolution_base` to the renderer so it produces frames matching the encoder's expected dimensions, or 2. Using the frame's actual dimensions (`frame.width`, `frame.height`) instead of `output_size` for the `Nv12ExportFrame`. How can I resolve this? If you propose a fix, please make it concise.

tembo · 2026-02-15T22:54:15Z

crates/rendering/src/frame_pipeline.rs

-                },
-            ],
-        });
+            let bg0 = make_bind_group(&source_view);


bg0/bg1 are built from the same source_view + nv12_buffer + params, and then selected via readback_idx (which is the readback buffer index, not a bind-group variant). This looks redundant/overly complex.

Consider caching a single BindGroup (and pass.set_bind_group(0, bind_group, &[])), or if the intent is to pre-cache variants, key them off the source texture instead of the readback index.

cursoragent and others added 4 commits February 15, 2026 01:03

cursoragent and others added 19 commits February 15, 2026 01:17

clippy

7bef34c

perf(video-decode): enable slice threading and increase decode thread…

cf4c170

… count

perf(export): remove unnecessary 200ms sleep before export

7aa4f7e

style(desktop): collapse nested if into let-chain

a7e73c8

refactor(rendering): wrap frame data in Arc<Vec<u8>>

f89c780

perf(rendering): cache NV12 converter bind groups across frames

a4cac0a

perf(rendering): pipeline render with decode prefetch and latency hiding

937e30b

refactor(desktop): use render_immediate for non-export render calls

0d0cf47

perf(export): always use NV12 GPU path with dimension alignment

f078f5c

perf(export): increase pipeline channel buffer sizes to 32

981d6d7

refactor(export): use into_data for Arc frame extraction

acdc1be

clippy

e9e01ed

clippy

5db2514

richiemcilroy marked this pull request as ready for review February 15, 2026 21:03

tembo bot reviewed Feb 15, 2026

View reviewed changes

greptile-apps bot reviewed Feb 15, 2026

View reviewed changes

richiemcilroy added 3 commits February 15, 2026 22:17

clippy

1fe56b3

clippy

3f25623

Fix AVAssetReader unwraps and add session notes

a451f83

tembo bot reviewed Feb 15, 2026

View reviewed changes

Sync system audio start_time; minor rust refactors

4186535

richiemcilroy merged commit 084f973 into main Feb 15, 2026
15 of 16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export pipeline speed#1601

Export pipeline speed#1601
richiemcilroy merged 27 commits intomainfrom
cursor/export-pipeline-speed-cbf4

richiemcilroy commented Feb 15, 2026 •

edited by greptile-apps bot

Loading

Uh oh!

cursor bot commented Feb 15, 2026

Uh oh!

tembo bot Feb 15, 2026

Uh oh!

tembo bot Feb 15, 2026

Uh oh!

tembo bot Feb 15, 2026

Uh oh!

tembo bot Feb 15, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Feb 15, 2026

Uh oh!

greptile-apps bot Feb 15, 2026

Uh oh!

tembo bot Feb 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	std::sync::Arc::try_unwrap(frame.data).unwrap_or_else(\|arc\| (*arc).clone()),
	std::sync::Arc::unwrap_or_clone(frame.data),

	Arc::try_unwrap(self.data).unwrap_or_else(\|arc\| (*arc).clone())
	Arc::unwrap_or_clone(self.data)

Conversation

richiemcilroy commented Feb 15, 2026 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Uh oh!

cursor bot commented Feb 15, 2026

Uh oh!

tembo bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

tembo bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

tembo bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

tembo bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

tembo bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

richiemcilroy commented Feb 15, 2026 •

edited by greptile-apps bot

Loading