Conversation
Major export pipeline speed optimization:
1. GPU NV12 Export Path (Phase 1):
- Added render_video_to_channel_nv12() that renders frames and converts
RGBA→NV12 on the GPU using the existing compute shader
- Export now reads back NV12 (1.5 bytes/pixel) instead of RGBA (4 bytes/pixel)
= 62% less GPU→CPU data transfer per frame
- Eliminates CPU swscale RGBA→NV12 conversion entirely
- Uses with_external_conversion() to skip encoder's internal converter
- NV12 frames fed directly to H264 encoder via queue_frame()
- Falls back to RGBA path if output dimensions aren't NV12-compatible
- Screenshots generated from NV12 using cpu_yuv::nv12_to_rgba_simd()
2. Export-Optimized Encoder Settings (Phase 2):
- VideoToolbox: realtime=false, profile=main (was realtime=true, baseline)
- NVENC: preset=p5, tune=hq (was p4, tune=ll)
- QSV: preset=medium, look_ahead_depth=20 (was faster)
- AMF: quality=quality, rc=vbr_peak (was balanced, vbr_latency)
- libx264: preset=veryfast (was ultrafast with zerolatency)
- These settings optimize for throughput/quality instead of latency
3. Pipeline Tuning (Phase 4):
- Increased channel buffer sizes from 8 to 16 for both stages
- Allows better pipeline overlap between render/process/encode stages
Added Nv12RenderedFrame helper methods (y_plane, uv_plane, clone_metadata_with_data)
and NV12→ffmpeg frame conversion utility.
Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
When the GPU NV12 converter fails (e.g. incompatible dimensions at runtime), finish_encoder_nv12 returns an Nv12RenderedFrame with format=GpuOutputFormat::Rgba. The nv12_to_ffmpeg_frame function now checks this format field and falls back to wrapping as RGBA when needed, preventing corrupted output in edge cases. Also added rgba_video_info for the fallback path to correctly describe RGBA pixel format to the ffmpeg frame wrapper. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
When the GPU NV12 converter falls back to returning RGBA data (rare edge case), we now convert RGBA→NV12 using ffmpeg's swscale on CPU before sending to the encoder. This ensures the encoder always receives NV12 frames since it was configured with external_conversion (no internal converter). Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
The NV12 pipelined readback always has one frame in flight. Without flushing, the last exported frame would be lost. Added flush_pipeline_nv12() to FrameRenderer and call it at the end of render_video_to_channel_nv12() to ensure all frames are delivered. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
|
Cursor Agent can help with this pull request. Just |
- Remove panic-prone expect() calls in RGBA→NV12 CPU fallback path; gracefully handle scaler creation and conversion failures - Add encoder thread timing: logs encoded frame count, elapsed time, and effective encode FPS for both NV12 and RGBA paths - These metrics help identify whether the encoder is the bottleneck when profiling export performance Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Restructured the NV12 export path to eliminate per-frame ffmpeg frame allocation (~3.1MB per frame at 1080p): - Encoder thread now owns a single reusable frame::Video (NV12) that is filled from raw NV12 bytes each frame via fill_nv12_frame() - Render task sends raw Nv12ExportFrame (Vec<u8> + metadata) through the sync_channel instead of pre-built frame::Video objects - Added MP4File::queue_video_frame_reusable() that delegates to H264Encoder::queue_frame_reusable() for zero-allocation encoding - Added ensure_nv12_data() for CPU RGBA→NV12 fallback when GPU converter returns RGBA data (extremely rare edge case) This eliminates one heap allocation + zeroing per frame in the hot encoding path, reducing memory pressure and improving cache locality. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Replace manual pixel-by-pixel RGBA→NV12 conversion with ffmpeg's swscale (SIMD-optimized) in the rare GPU fallback path. This makes the fallback path significantly faster if it ever triggers. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Move ensure_nv12_data() before first_frame_data capture so the screenshot data is always in NV12 format. Previously, if the GPU NV12 converter fell back to RGBA, the screenshot would receive RGBA data but interpret it as NV12, producing a corrupted image. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
- fill_nv12_frame_preserves_data_layout: verifies Y and UV plane data is correctly copied into ffmpeg frame with proper stride handling - ensure_nv12_data_passthrough_for_nv12_format: verifies NV12 data passes through without conversion when format is already NV12 - nv12_export_frame_dimensions_match: validates NV12 size calculations and confirms 62.5% data reduction vs RGBA at 1080p Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Deduplicate the frame decode-with-retry logic that was duplicated between render_video_to_channel (RGBA) and render_video_to_channel_nv12. Both functions now call the shared decode_segment_frames_with_retry() helper, reducing ~100 lines of duplicated decode/retry/backoff code to a single source of truth. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
crates/export/src/mp4.rs
Outdated
| let output_size = ((raw_output_size.0 + 3) & !3, (raw_output_size.1 + 1) & !1); | ||
|
|
||
| if output_size != raw_output_size { | ||
| info!( | ||
| raw_width = raw_output_size.0, | ||
| raw_height = raw_output_size.1, | ||
| aligned_width = output_size.0, | ||
| aligned_height = output_size.1, | ||
| "Aligned output dimensions for NV12 GPU path" | ||
| ); | ||
| } | ||
|
|
||
| info!( | ||
| width = output_size.0, | ||
| height = output_size.1, | ||
| "Using GPU NV12 export path (reduced readback + no CPU swscale)" | ||
| ); |
There was a problem hiding this comment.
This output_size alignment is only applied on the encoder side. The renderer still derives its own output size from resolution_base/project, and RgbaToNv12Converter::submit_conversion returns false when width isn’t divisible by 4. In that case you’ll fall back to RGBA frames whose dimensions can differ from the aligned VideoInfo/ffmpeg frame size, which looks like it could silently corrupt output (truncated copy) when raw_output_size.0 % 4 != 0.
| let output_size = ((raw_output_size.0 + 3) & !3, (raw_output_size.1 + 1) & !1); | |
| if output_size != raw_output_size { | |
| info!( | |
| raw_width = raw_output_size.0, | |
| raw_height = raw_output_size.1, | |
| aligned_width = output_size.0, | |
| aligned_height = output_size.1, | |
| "Aligned output dimensions for NV12 GPU path" | |
| ); | |
| } | |
| info!( | |
| width = output_size.0, | |
| height = output_size.1, | |
| "Using GPU NV12 export path (reduced readback + no CPU swscale)" | |
| ); | |
| let output_size = raw_output_size; | |
| info!( | |
| width = output_size.0, | |
| height = output_size.1, | |
| "Exporting with NV12 pipeline (GPU when possible, CPU fallback otherwise)" | |
| ); |
crates/export/src/mp4.rs
Outdated
| use cap_rendering::GpuOutputFormat; | ||
|
|
||
| if frame.format != GpuOutputFormat::Rgba { | ||
| return frame.into_data(); |
There was a problem hiding this comment.
Minor perf note: frame.into_data() will almost always clone here because render_video_to_channel_nv12 keeps last_successful_frame (another Arc) for fallback. If the goal is to avoid per-frame copies, it’d be worth carrying Arc<Vec<u8>> through Nv12ExportFrame for the NV12 case and only allocating/cloning when doing the RGBA->NV12 CPU fallback.
| Ok(frame) => { | ||
| let packed = pack_frame_data( | ||
| frame.data, | ||
| std::sync::Arc::try_unwrap(frame.data).unwrap_or_else(|arc| (*arc).clone()), |
There was a problem hiding this comment.
Small readability tweak (and avoids the closure):
| std::sync::Arc::try_unwrap(frame.data).unwrap_or_else(|arc| (*arc).clone()), | |
| std::sync::Arc::unwrap_or_clone(frame.data), |
| } | ||
|
|
||
| pub fn into_data(self) -> Vec<u8> { | ||
| Arc::try_unwrap(self.data).unwrap_or_else(|arc| (*arc).clone()) |
There was a problem hiding this comment.
Same here—Arc::unwrap_or_clone reads a bit cleaner.
| Arc::try_unwrap(self.data).unwrap_or_else(|arc| (*arc).clone()) | |
| Arc::unwrap_or_clone(self.data) |
crates/export/src/mp4.rs
Outdated
| let data = vec![1u8, 2, 3, 4, 5, 6]; | ||
| let frame = Nv12RenderedFrame { | ||
| data: data.clone(), |
There was a problem hiding this comment.
Test won't compile: type mismatch
Nv12RenderedFrame.data is Arc<Vec<u8>> (changed in this PR at frame_pipeline.rs:357), but data.clone() returns Vec<u8>. This will fail to compile.
| let data = vec![1u8, 2, 3, 4, 5, 6]; | |
| let frame = Nv12RenderedFrame { | |
| data: data.clone(), | |
| let data = vec![1u8, 2, 3, 4, 5, 6]; | |
| let frame = Nv12RenderedFrame { | |
| data: std::sync::Arc::new(data.clone()), | |
| width: 4, |
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/export/src/mp4.rs
Line: 614:616
Comment:
**Test won't compile: type mismatch**
`Nv12RenderedFrame.data` is `Arc<Vec<u8>>` (changed in this PR at `frame_pipeline.rs:357`), but `data.clone()` returns `Vec<u8>`. This will fail to compile.
```suggestion
let data = vec![1u8, 2, 3, 4, 5, 6];
let frame = Nv12RenderedFrame {
data: std::sync::Arc::new(data.clone()),
width: 4,
```
How can I resolve this? If you propose a fix, please make it concise.| if frame_tx | ||
| .send(MP4Input { | ||
| .send(Nv12ExportFrame { | ||
| audio: audio_frame, | ||
| video: video_info.wrap_frame( | ||
| &frame.data, | ||
| frame_number as i64, | ||
| frame.padded_bytes_per_row as usize, | ||
| ), | ||
| nv12_data, | ||
| width: output_size.0, | ||
| height: output_size.1, | ||
| y_stride: output_size.0, | ||
| pts: frame_number as i64, | ||
| }) |
There was a problem hiding this comment.
Dimension mismatch between renderer and encoder
The Nv12ExportFrame uses the NV12-aligned output_size (4-aligned width, 2-aligned height from line 67), but the renderer receives self.resolution_base (line 340) and internally computes output dimensions via get_output_size() which only aligns to multiples of 2 (see lib.rs:1434).
When the rendered frame width is even but not a multiple of 4 (e.g., 1278, 1918), the GPU NV12 path falls back to RGBA-to-NV12 CPU conversion using the frame's actual dimensions. But here the Nv12ExportFrame declares the aligned output_size dimensions. fill_nv12_frame then reads the NV12 data with the wrong stride, producing garbled video.
Consider either:
- Passing the aligned
output_sizeasresolution_baseto the renderer so it produces frames matching the encoder's expected dimensions, or - Using the frame's actual dimensions (
frame.width,frame.height) instead ofoutput_sizefor theNv12ExportFrame.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/export/src/mp4.rs
Line: 279:287
Comment:
**Dimension mismatch between renderer and encoder**
The `Nv12ExportFrame` uses the NV12-aligned `output_size` (4-aligned width, 2-aligned height from line 67), but the renderer receives `self.resolution_base` (line 340) and internally computes output dimensions via `get_output_size()` which only aligns to multiples of 2 (see `lib.rs:1434`).
When the rendered frame width is even but not a multiple of 4 (e.g., 1278, 1918), the GPU NV12 path falls back to RGBA-to-NV12 CPU conversion using the frame's actual dimensions. But here the `Nv12ExportFrame` declares the aligned `output_size` dimensions. `fill_nv12_frame` then reads the NV12 data with the wrong stride, producing garbled video.
Consider either:
1. Passing the aligned `output_size` as `resolution_base` to the renderer so it produces frames matching the encoder's expected dimensions, or
2. Using the frame's actual dimensions (`frame.width`, `frame.height`) instead of `output_size` for the `Nv12ExportFrame`.
How can I resolve this? If you propose a fix, please make it concise.| }, | ||
| ], | ||
| }); | ||
| let bg0 = make_bind_group(&source_view); |
There was a problem hiding this comment.
bg0/bg1 are built from the same source_view + nv12_buffer + params, and then selected via readback_idx (which is the readback buffer index, not a bind-group variant). This looks redundant/overly complex.
Consider caching a single BindGroup (and pass.set_bind_group(0, bind_group, &[])), or if the intent is to pre-cache variants, key them off the source texture instead of the readback index.
Implement GPU NV12 export path and optimize H264 encoder settings for faster video export.
The primary goal is to leverage the GPU for RGBA to NV12 conversion, drastically reducing CPU readback data and eliminating CPU-based
swscale. Additionally, encoder settings are tuned for export throughput (e.g.,realtime=falsefor VideoToolbox,tune=hqfor NVENC) instead of recording-optimized low latency. This PR also includes necessary pipeline adjustments likeflush_pipeline_nv12and robust RGBA fallback handling to ensure correctness.Greptile Summary
This PR introduces a GPU-accelerated NV12 export pipeline to replace the previous RGBA-based export path, reducing GPU readback data by ~62.5% and eliminating CPU-based
swscaleconversion. It also tunes H264 encoder settings for export quality/throughput (e.g.,realtime=falsefor VideoToolbox,tune=hqfor NVENC) separately from recording-optimized low-latency settings.render_video_to_channel_nv12function andexport_nv12method that perform RGBA-to-NV12 conversion on the GPU via compute shader, with robust RGBA fallback when GPU conversion failsrender()andrender_nv12()now returnOption<Frame>to support double-buffered readback pipelining; newrender_immediate()wraps the old blocking behavior for callers that need synchronous frame outputtokio::join!for better pipeline utilizationRenderedFrame.dataandNv12RenderedFrame.datachanged fromVec<u8>toArc<Vec<u8>>to reduce cloning costswith_export_settings()builder, reusable frame encoding viaqueue_frame_reusable, andwith_external_conversion()to skip internal swscaleFF_THREAD_FRAMEandFF_THREAD_SLICEwith CPU-count-aware thread limitsif-letchains across test crates, derivedDefaultforAudioConfigoutput_size(4-aligned width) used for the encoder may not match the renderer's actual output dimensions when resolutions aren't multiples of 4, sinceself.resolution_base(unaligned) is passed to the rendererensure_nv12_data_passthrough_for_nv12_formattest constructsNv12RenderedFramewithVec<u8>but the field now expectsArc<Vec<u8>>Confidence Score: 3/5
Vec<u8>is used whereArc<Vec<u8>>is needed, and (2) a potential dimension mismatch between the NV12-aligned encoder dimensions and the renderer's actual output dimensions when width is even but not a multiple of 4. The core pipeline changes (pipelined readback, prefetching, Arc-wrapped data, encoder settings) are well-structured and the RGBA fallback path provides safety for edge cases.crates/export/src/mp4.rsneeds attention for both the test compilation fix and the dimension mismatch between alignedoutput_sizeand unalignedresolution_basepassed to the renderer.Important Files Changed
Arc<Vec<u8>>type change.render_video_to_channel_nv12,decode_segment_frames_with_retryhelper, prefetch-while-rendering optimization, andrender_immediatewrapper. Pipeline signature changes from direct frame return toOption<RenderedFrame>for pipelined readback. Well-structured refactor.RenderedFrame.dataandNv12RenderedFrame.datafromVec<u8>toArc<Vec<u8>>for cheaper cloning. Adds bind group caching using pointer-based texture identity. Pipelined readback now returnsOptioninstead of blocking. AddsNv12RenderedFramehelper methods.is_exportflag to distinguish recording vs export encoder settings. Export mode enables quality-optimized settings per-encoder: VideoToolbox (realtime=false,profile=main), NVENC (preset=p5,tune=hq), QSV (preset=medium), AMF (quality=quality), libx264 (veryfastdefault). Addsqueue_frame_reusablemethod for zero-alloc frame encoding.render_immediate(which handles the newOptionreturn from pipelinedrender). Removes a 200ms sleep before export start.WSFrame.datafromVec<u8>toArc<Vec<u8>>to match the rendering crate's change. UsesArc::try_unwrapwith clone fallback when ownership is needed. Collapses nestedif-letper clippy lint.FF_THREAD_FRAMEandFF_THREAD_SLICE, adjusts thread count to useclampinstead of hard-coded values, better scaling with CPU count.Flowchart
flowchart TD A[render_video_to_channel_nv12] --> B{Decode Frame} B -->|Prefetch next frame| C[tokio::join! decode + render] B -->|No prefetch| D[render_nv12] C --> D D --> E{GPU NV12 Converter} E -->|Success| F[Pipelined NV12 Readback] E -->|Width not 4-aligned or GPU fail| G[RGBA Fallback via finish_encoder] F --> H[Return previous NV12 frame] G --> I[Nv12RenderedFrame with format=Rgba] H --> J[mpsc channel to encoder thread] I --> J J --> K[ensure_nv12_data] K -->|format=Nv12| L[Pass through Arc data] K -->|format=Rgba| M[CPU swscale RGBA to NV12] L --> N[fill_nv12_frame] M --> N N --> O[queue_frame_reusable] O --> P[H264 Encoder with export settings] P --> Q[MP4 Output]Last reviewed commit: 5db2514