Skip to content

Export pipeline speed#1601

Merged
richiemcilroy merged 27 commits intomainfrom
cursor/export-pipeline-speed-cbf4
Feb 15, 2026
Merged

Export pipeline speed#1601
richiemcilroy merged 27 commits intomainfrom
cursor/export-pipeline-speed-cbf4

Conversation

@richiemcilroy
Copy link
Copy Markdown
Member

@richiemcilroy richiemcilroy commented Feb 15, 2026

Implement GPU NV12 export path and optimize H264 encoder settings for faster video export.

The primary goal is to leverage the GPU for RGBA to NV12 conversion, drastically reducing CPU readback data and eliminating CPU-based swscale. Additionally, encoder settings are tuned for export throughput (e.g., realtime=false for VideoToolbox, tune=hq for NVENC) instead of recording-optimized low latency. This PR also includes necessary pipeline adjustments like flush_pipeline_nv12 and robust RGBA fallback handling to ensure correctness.


Open in Cursor Open in Web

Greptile Summary

This PR introduces a GPU-accelerated NV12 export pipeline to replace the previous RGBA-based export path, reducing GPU readback data by ~62.5% and eliminating CPU-based swscale conversion. It also tunes H264 encoder settings for export quality/throughput (e.g., realtime=false for VideoToolbox, tune=hq for NVENC) separately from recording-optimized low-latency settings.

  • GPU NV12 pipeline: New render_video_to_channel_nv12 function and export_nv12 method that perform RGBA-to-NV12 conversion on the GPU via compute shader, with robust RGBA fallback when GPU conversion fails
  • Pipelined rendering: render() and render_nv12() now return Option<Frame> to support double-buffered readback pipelining; new render_immediate() wraps the old blocking behavior for callers that need synchronous frame output
  • Frame prefetching: Overlaps next-frame decoding with current-frame rendering via tokio::join! for better pipeline utilization
  • Arc-wrapped frame data: RenderedFrame.data and Nv12RenderedFrame.data changed from Vec<u8> to Arc<Vec<u8>> to reduce cloning costs
  • Encoder optimizations: Per-encoder export settings (VideoToolbox, NVENC, QSV, AMF, MF, libx264) with with_export_settings() builder, reusable frame encoding via queue_frame_reusable, and with_external_conversion() to skip internal swscale
  • Decoder threading: Software decode now uses both FF_THREAD_FRAME and FF_THREAD_SLICE with CPU-count-aware thread limits
  • Clippy fixes: Collapsed nested if-let chains across test crates, derived Default for AudioConfig
  • Potential dimension mismatch: The aligned output_size (4-aligned width) used for the encoder may not match the renderer's actual output dimensions when resolutions aren't multiples of 4, since self.resolution_base (unaligned) is passed to the renderer
  • Test compilation issue: ensure_nv12_data_passthrough_for_nv12_format test constructs Nv12RenderedFrame with Vec<u8> but the field now expects Arc<Vec<u8>>

Confidence Score: 3/5

  • This PR has a test compilation error and a potential dimension mismatch bug for non-standard resolutions, but the core NV12 pipeline logic is sound for common resolutions.
  • Score of 3 reflects: (1) a definite compilation error in the test suite where Vec<u8> is used where Arc<Vec<u8>> is needed, and (2) a potential dimension mismatch between the NV12-aligned encoder dimensions and the renderer's actual output dimensions when width is even but not a multiple of 4. The core pipeline changes (pipelined readback, prefetching, Arc-wrapped data, encoder settings) are well-structured and the RGBA fallback path provides safety for edge cases.
  • crates/export/src/mp4.rs needs attention for both the test compilation fix and the dimension mismatch between aligned output_size and unaligned resolution_base passed to the renderer.

Important Files Changed

Filename Overview
crates/export/src/mp4.rs Major changes to switch export pipeline to NV12 GPU path. Contains a dimension mismatch bug between renderer output and encoder expectations for non-4-aligned widths, and a test compilation error from Arc<Vec<u8>> type change.
crates/rendering/src/lib.rs Adds render_video_to_channel_nv12, decode_segment_frames_with_retry helper, prefetch-while-rendering optimization, and render_immediate wrapper. Pipeline signature changes from direct frame return to Option<RenderedFrame> for pipelined readback. Well-structured refactor.
crates/rendering/src/frame_pipeline.rs Changes RenderedFrame.data and Nv12RenderedFrame.data from Vec<u8> to Arc<Vec<u8>> for cheaper cloning. Adds bind group caching using pointer-based texture identity. Pipelined readback now returns Option instead of blocking. Adds Nv12RenderedFrame helper methods.
crates/enc-ffmpeg/src/video/h264.rs Adds is_export flag to distinguish recording vs export encoder settings. Export mode enables quality-optimized settings per-encoder: VideoToolbox (realtime=false, profile=main), NVENC (preset=p5, tune=hq), QSV (preset=medium), AMF (quality=quality), libx264 (veryfast default). Adds queue_frame_reusable method for zero-alloc frame encoding.
apps/desktop/src-tauri/src/export.rs Updates callers to use render_immediate (which handles the new Option return from pipelined render). Removes a 200ms sleep before export start.
apps/desktop/src-tauri/src/frame_ws.rs Changes WSFrame.data from Vec<u8> to Arc<Vec<u8>> to match the rendering crate's change. Uses Arc::try_unwrap with clone fallback when ownership is needed. Collapses nested if-let per clippy lint.
crates/video-decode/src/ffmpeg.rs Improves software decoder threading: uses both FF_THREAD_FRAME and FF_THREAD_SLICE, adjusts thread count to use clamp instead of hard-coded values, better scaling with CPU count.

Flowchart

flowchart TD
    A[render_video_to_channel_nv12] --> B{Decode Frame}
    B -->|Prefetch next frame| C[tokio::join! decode + render]
    B -->|No prefetch| D[render_nv12]
    C --> D
    D --> E{GPU NV12 Converter}
    E -->|Success| F[Pipelined NV12 Readback]
    E -->|Width not 4-aligned or GPU fail| G[RGBA Fallback via finish_encoder]
    F --> H[Return previous NV12 frame]
    G --> I[Nv12RenderedFrame with format=Rgba]
    H --> J[mpsc channel to encoder thread]
    I --> J
    J --> K[ensure_nv12_data]
    K -->|format=Nv12| L[Pass through Arc data]
    K -->|format=Rgba| M[CPU swscale RGBA to NV12]
    L --> N[fill_nv12_frame]
    M --> N
    N --> O[queue_frame_reusable]
    O --> P[H264 Encoder with export settings]
    P --> Q[MP4 Output]
Loading

Last reviewed commit: 5db2514

cursoragent and others added 4 commits February 15, 2026 01:03
Major export pipeline speed optimization:

1. GPU NV12 Export Path (Phase 1):
   - Added render_video_to_channel_nv12() that renders frames and converts
     RGBA→NV12 on the GPU using the existing compute shader
   - Export now reads back NV12 (1.5 bytes/pixel) instead of RGBA (4 bytes/pixel)
     = 62% less GPU→CPU data transfer per frame
   - Eliminates CPU swscale RGBA→NV12 conversion entirely
   - Uses with_external_conversion() to skip encoder's internal converter
   - NV12 frames fed directly to H264 encoder via queue_frame()
   - Falls back to RGBA path if output dimensions aren't NV12-compatible
   - Screenshots generated from NV12 using cpu_yuv::nv12_to_rgba_simd()

2. Export-Optimized Encoder Settings (Phase 2):
   - VideoToolbox: realtime=false, profile=main (was realtime=true, baseline)
   - NVENC: preset=p5, tune=hq (was p4, tune=ll)
   - QSV: preset=medium, look_ahead_depth=20 (was faster)
   - AMF: quality=quality, rc=vbr_peak (was balanced, vbr_latency)
   - libx264: preset=veryfast (was ultrafast with zerolatency)
   - These settings optimize for throughput/quality instead of latency

3. Pipeline Tuning (Phase 4):
   - Increased channel buffer sizes from 8 to 16 for both stages
   - Allows better pipeline overlap between render/process/encode stages

Added Nv12RenderedFrame helper methods (y_plane, uv_plane, clone_metadata_with_data)
and NV12→ffmpeg frame conversion utility.

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
When the GPU NV12 converter fails (e.g. incompatible dimensions at
runtime), finish_encoder_nv12 returns an Nv12RenderedFrame with
format=GpuOutputFormat::Rgba. The nv12_to_ffmpeg_frame function now
checks this format field and falls back to wrapping as RGBA when needed,
preventing corrupted output in edge cases.

Also added rgba_video_info for the fallback path to correctly describe
RGBA pixel format to the ffmpeg frame wrapper.

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
When the GPU NV12 converter falls back to returning RGBA data (rare edge
case), we now convert RGBA→NV12 using ffmpeg's swscale on CPU before
sending to the encoder. This ensures the encoder always receives NV12
frames since it was configured with external_conversion (no internal
converter).

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
The NV12 pipelined readback always has one frame in flight. Without
flushing, the last exported frame would be lost. Added
flush_pipeline_nv12() to FrameRenderer and call it at the end of
render_video_to_channel_nv12() to ensure all frames are delivered.

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
@cursor
Copy link
Copy Markdown

cursor bot commented Feb 15, 2026

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
Learn more about Cursor Agents

cursoragent and others added 19 commits February 15, 2026 01:17
- Remove panic-prone expect() calls in RGBA→NV12 CPU fallback path;
  gracefully handle scaler creation and conversion failures
- Add encoder thread timing: logs encoded frame count, elapsed time,
  and effective encode FPS for both NV12 and RGBA paths
- These metrics help identify whether the encoder is the bottleneck
  when profiling export performance

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Restructured the NV12 export path to eliminate per-frame ffmpeg frame
allocation (~3.1MB per frame at 1080p):

- Encoder thread now owns a single reusable frame::Video (NV12) that is
  filled from raw NV12 bytes each frame via fill_nv12_frame()
- Render task sends raw Nv12ExportFrame (Vec<u8> + metadata) through
  the sync_channel instead of pre-built frame::Video objects
- Added MP4File::queue_video_frame_reusable() that delegates to
  H264Encoder::queue_frame_reusable() for zero-allocation encoding
- Added ensure_nv12_data() for CPU RGBA→NV12 fallback when GPU
  converter returns RGBA data (extremely rare edge case)

This eliminates one heap allocation + zeroing per frame in the hot
encoding path, reducing memory pressure and improving cache locality.

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Replace manual pixel-by-pixel RGBA→NV12 conversion with ffmpeg's
swscale (SIMD-optimized) in the rare GPU fallback path. This makes the
fallback path significantly faster if it ever triggers.

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Move ensure_nv12_data() before first_frame_data capture so the
screenshot data is always in NV12 format. Previously, if the GPU
NV12 converter fell back to RGBA, the screenshot would receive
RGBA data but interpret it as NV12, producing a corrupted image.

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
- fill_nv12_frame_preserves_data_layout: verifies Y and UV plane data
  is correctly copied into ffmpeg frame with proper stride handling
- ensure_nv12_data_passthrough_for_nv12_format: verifies NV12 data
  passes through without conversion when format is already NV12
- nv12_export_frame_dimensions_match: validates NV12 size calculations
  and confirms 62.5% data reduction vs RGBA at 1080p

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Deduplicate the frame decode-with-retry logic that was duplicated
between render_video_to_channel (RGBA) and render_video_to_channel_nv12.
Both functions now call the shared decode_segment_frames_with_retry()
helper, reducing ~100 lines of duplicated decode/retry/backoff code
to a single source of truth.

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
@richiemcilroy richiemcilroy marked this pull request as ready for review February 15, 2026 21:03
Comment on lines +67 to +83
let output_size = ((raw_output_size.0 + 3) & !3, (raw_output_size.1 + 1) & !1);

if output_size != raw_output_size {
info!(
raw_width = raw_output_size.0,
raw_height = raw_output_size.1,
aligned_width = output_size.0,
aligned_height = output_size.1,
"Aligned output dimensions for NV12 GPU path"
);
}

info!(
width = output_size.0,
height = output_size.1,
"Using GPU NV12 export path (reduced readback + no CPU swscale)"
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This output_size alignment is only applied on the encoder side. The renderer still derives its own output size from resolution_base/project, and RgbaToNv12Converter::submit_conversion returns false when width isn’t divisible by 4. In that case you’ll fall back to RGBA frames whose dimensions can differ from the aligned VideoInfo/ffmpeg frame size, which looks like it could silently corrupt output (truncated copy) when raw_output_size.0 % 4 != 0.

Suggested change
let output_size = ((raw_output_size.0 + 3) & !3, (raw_output_size.1 + 1) & !1);
if output_size != raw_output_size {
info!(
raw_width = raw_output_size.0,
raw_height = raw_output_size.1,
aligned_width = output_size.0,
aligned_height = output_size.1,
"Aligned output dimensions for NV12 GPU path"
);
}
info!(
width = output_size.0,
height = output_size.1,
"Using GPU NV12 export path (reduced readback + no CPU swscale)"
);
let output_size = raw_output_size;
info!(
width = output_size.0,
height = output_size.1,
"Exporting with NV12 pipeline (GPU when possible, CPU fallback otherwise)"
);

use cap_rendering::GpuOutputFormat;

if frame.format != GpuOutputFormat::Rgba {
return frame.into_data();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor perf note: frame.into_data() will almost always clone here because render_video_to_channel_nv12 keeps last_successful_frame (another Arc) for fallback. If the goal is to avoid per-frame copies, it’d be worth carrying Arc<Vec<u8>> through Nv12ExportFrame for the NV12 case and only allocating/cloning when doing the RGBA->NV12 CPU fallback.

Ok(frame) => {
let packed = pack_frame_data(
frame.data,
std::sync::Arc::try_unwrap(frame.data).unwrap_or_else(|arc| (*arc).clone()),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small readability tweak (and avoids the closure):

Suggested change
std::sync::Arc::try_unwrap(frame.data).unwrap_or_else(|arc| (*arc).clone()),
std::sync::Arc::unwrap_or_clone(frame.data),

}

pub fn into_data(self) -> Vec<u8> {
Arc::try_unwrap(self.data).unwrap_or_else(|arc| (*arc).clone())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here—Arc::unwrap_or_clone reads a bit cleaner.

Suggested change
Arc::try_unwrap(self.data).unwrap_or_else(|arc| (*arc).clone())
Arc::unwrap_or_clone(self.data)

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

17 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +614 to +616
let data = vec![1u8, 2, 3, 4, 5, 6];
let frame = Nv12RenderedFrame {
data: data.clone(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test won't compile: type mismatch

Nv12RenderedFrame.data is Arc<Vec<u8>> (changed in this PR at frame_pipeline.rs:357), but data.clone() returns Vec<u8>. This will fail to compile.

Suggested change
let data = vec![1u8, 2, 3, 4, 5, 6];
let frame = Nv12RenderedFrame {
data: data.clone(),
let data = vec![1u8, 2, 3, 4, 5, 6];
let frame = Nv12RenderedFrame {
data: std::sync::Arc::new(data.clone()),
width: 4,
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/export/src/mp4.rs
Line: 614:616

Comment:
**Test won't compile: type mismatch**

`Nv12RenderedFrame.data` is `Arc<Vec<u8>>` (changed in this PR at `frame_pipeline.rs:357`), but `data.clone()` returns `Vec<u8>`. This will fail to compile.

```suggestion
        let data = vec![1u8, 2, 3, 4, 5, 6];
        let frame = Nv12RenderedFrame {
            data: std::sync::Arc::new(data.clone()),
            width: 4,
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 279 to 287
if frame_tx
.send(MP4Input {
.send(Nv12ExportFrame {
audio: audio_frame,
video: video_info.wrap_frame(
&frame.data,
frame_number as i64,
frame.padded_bytes_per_row as usize,
),
nv12_data,
width: output_size.0,
height: output_size.1,
y_stride: output_size.0,
pts: frame_number as i64,
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dimension mismatch between renderer and encoder

The Nv12ExportFrame uses the NV12-aligned output_size (4-aligned width, 2-aligned height from line 67), but the renderer receives self.resolution_base (line 340) and internally computes output dimensions via get_output_size() which only aligns to multiples of 2 (see lib.rs:1434).

When the rendered frame width is even but not a multiple of 4 (e.g., 1278, 1918), the GPU NV12 path falls back to RGBA-to-NV12 CPU conversion using the frame's actual dimensions. But here the Nv12ExportFrame declares the aligned output_size dimensions. fill_nv12_frame then reads the NV12 data with the wrong stride, producing garbled video.

Consider either:

  1. Passing the aligned output_size as resolution_base to the renderer so it produces frames matching the encoder's expected dimensions, or
  2. Using the frame's actual dimensions (frame.width, frame.height) instead of output_size for the Nv12ExportFrame.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/export/src/mp4.rs
Line: 279:287

Comment:
**Dimension mismatch between renderer and encoder**

The `Nv12ExportFrame` uses the NV12-aligned `output_size` (4-aligned width, 2-aligned height from line 67), but the renderer receives `self.resolution_base` (line 340) and internally computes output dimensions via `get_output_size()` which only aligns to multiples of 2 (see `lib.rs:1434`).

When the rendered frame width is even but not a multiple of 4 (e.g., 1278, 1918), the GPU NV12 path falls back to RGBA-to-NV12 CPU conversion using the frame's actual dimensions. But here the `Nv12ExportFrame` declares the aligned `output_size` dimensions. `fill_nv12_frame` then reads the NV12 data with the wrong stride, producing garbled video.

Consider either:
1. Passing the aligned `output_size` as `resolution_base` to the renderer so it produces frames matching the encoder's expected dimensions, or
2. Using the frame's actual dimensions (`frame.width`, `frame.height`) instead of `output_size` for the `Nv12ExportFrame`.

How can I resolve this? If you propose a fix, please make it concise.

},
],
});
let bg0 = make_bind_group(&source_view);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bg0/bg1 are built from the same source_view + nv12_buffer + params, and then selected via readback_idx (which is the readback buffer index, not a bind-group variant). This looks redundant/overly complex.

Consider caching a single BindGroup (and pass.set_bind_group(0, bind_group, &[])), or if the intent is to pre-cache variants, key them off the source texture instead of the readback index.

@richiemcilroy richiemcilroy merged commit 084f973 into main Feb 15, 2026
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants