Skip to content

GFPGAN Speed UP PR#473

Open
wangzijian1010 wants to merge 30 commits into
xlite-dev:mainfrom
wangzijian1010:start_0601
Open

GFPGAN Speed UP PR#473
wangzijian1010 wants to merge 30 commits into
xlite-dev:mainfrom
wangzijian1010:start_0601

Conversation

@wangzijian1010

Copy link
Copy Markdown
Member

This pull request introduces a comprehensive GPU benchmarking framework and significant optimizations for the face restoration pipeline (GFPGAN) in the lite.ai.toolkit project. It adds a unified, backend-agnostic benchmarking utility, a dedicated face restoration benchmark executable, and refactors the core restoration code to enable detailed, per-stage timing and substantial speedups via CUDA kernel fusion and buffer reuse.

Benchmarking and Profiling Infrastructure:

  • Added a header-only, backend-agnostic benchmarking utility (profiler.h) that supports CPU and GPU timing, aggregates statistics (mean, percentiles, min/max), and can export results as aligned tables or CSV files. It provides convenient macros for instrumenting code and is designed for minimal overhead when disabled.

  • Introduced a new benchmark executable (lite_face_restoration_bench) for end-to-end profiling of the GFPGAN face restoration stage. This tool checks CPU vs GPU correctness for the paste-back operation, runs warmup and timed iterations, and reports per-stage and total timings, including CSV export. [1] [2]

Face Restoration Pipeline Optimizations:

  • Refactored the core TRTFaceFusionFaceRestoration class to support detailed per-stage profiling and major performance improvements:
    • Fused CPU pre/post-processing steps (affine warp, color conversion, normalization, tensor layout, paste-back) into CUDA kernels with buffer reuse and asynchronous memory copies.
    • Cached the static box mask to avoid redundant computation.
    • Exposed a new restore method that returns the restored frame (without disk I/O) and accepts an optional profiler for instrumentation. [1] [2]

Documentation and Results:

  • Added a detailed "Benchmark" section to the README.md, summarizing the optimization strategies, measured speedups (up to 4.4× end-to-end), and per-stage breakdowns, with a reproducible methodology and environment.

Summary of Most Important Changes:

Benchmarking Infrastructure:

  • Added a unified, header-only benchmarking utility (lite/bench/profiler.h) for CPU and GPU timing, stats aggregation, and CSV export.
  • Introduced a dedicated face restoration benchmark executable (lite_face_restoration_bench) and its CMake target. [1] [2]

Face Restoration Pipeline Optimization:

  • Refactored TRTFaceFusionFaceRestoration to support detailed per-stage profiling, GPU kernel fusion, buffer reuse, and a new restore method for compute-only evaluation. [1] [2]

Documentation:

  • Added a comprehensive "Benchmark" section to the README.md, including performance results and optimization details.

wangzijian1010 and others added 5 commits May 30, 2026 22:37
…mark harness

Phase 1 first cut on the TensorRT face-restoration (GFPGAN) stage. Profiling showed
paste_back was ~50% of the pipeline: two full-frame cv::warpAffine on the CPU plus
per-call cudaMalloc/free and synchronous copies. Rewrote it as a single inverse-mapping
CUDA kernel with reused device buffers and pinned/async copies.

Measured on RTX 4090 / TRT 10.1 fp32, gfpgan 512:
  paste_back 39.07ms -> 2.37ms (16.5x); end-to-end 78.2ms -> 30.4ms; 12.8 -> 32.9 FPS.
CPU vs GPU output is numerically equivalent (max|diff| = 2/255).

Per-file changes:
- lite/bench/profiler.h: new header-only, backend-agnostic profiler. CPU chrono +
  CUDA-event timers, mean/p50/p90/p99 + FPS aggregation, CSV export, scoped-timer macros.
- lite/trt/kernel/paste_back.cu, paste_back.cuh: add paste_back_fused_kernel —
  per-output-pixel inverse mapping (uses the original->crop affine directly, no inversion),
  bilinear crop/mask sampling with border-0, blend, writes uint8. Old paste_back_kernel kept.
- lite/trt/kernel/paste_back_manager.cpp, paste_back_manager.h: add PasteBackGPU, which owns
  reusable device + pinned buffers (ensure_capacity) and drives the fused kernel. Old CPU
  launch_paste_back kept for A/B reference. Comments translated to English.
- lite/trt/cv/trt_face_restoration.cpp, trt_face_restoration.h: factor detect() into restore()
  (returns the frame, no disk write, optional per-stage Profiler); detect() now calls
  restore() + imwrite; restore() uses the GPU PasteBackGPU member instead of CPU paste_back.
- examples/lite/cv/test_lite_face_restoration_bench.cpp: new benchmark — CPU-vs-GPU
  paste_back equivalence check + compute-only per-stage latency/FPS over N iterations.
- examples/lite/CMakeLists.txt: register the lite_face_restoration_bench executable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…20.9ms->17.7ms)

Phase 1 continued on the face-restoration stage. Per-stage profiling of preprocess showed
two wins: (1) create_static_box_mask was rebuilt every frame (a large-kernel GaussianBlur,
~10ms) although it only depends on the fixed 512 crop size; (2) bgr2rgb + normalize +
HWC->CHW ran on the CPU (~3.2ms) and then a separate H2D copied the tensor to the device.

Mask is now built once and cached. bgr2rgb/normalize/CHW are fused into a single CUDA kernel
that writes the normalized RGB CHW tensor straight into the inference input buffer, removing
the per-frame mask rebuild, the CPU tensor build, and the separate H2D.

Measured on RTX 4090 / TRT 10.1 fp32, gfpgan 512 (cumulative from the 78.2ms baseline):
  preprocess 14.4ms -> 1.3ms; end-to-end 30.8ms -> 17.7ms; 32.4 -> 56.6 FPS.
Output unchanged (PSNR 59.5 dB vs the pre-change result).

Per-file changes:
- lite/trt/kernel/face_restoration_preprocess.cu, face_restoration_preprocess.cuh: new
  face_restoration_preprocess_kernel — one thread per crop pixel, reads interleaved BGR uint8,
  writes planar RGB float (CHW) normalized to [-1,1] (v/127.5 - 1).
- lite/trt/kernel/face_restoration_preprocess_manager.cpp, face_restoration_preprocess_manager.h:
  new FaceRestorePreprocessGPU, owns reusable device + pinned staging buffers and launches the
  kernel writing directly into the inference input buffer.
- lite/trt/cv/trt_face_restoration.h: add FaceRestorePreprocessGPU member, a cached box mask
  member, and the new preprocess-manager include.
- lite/trt/cv/trt_face_restoration.cpp: cache the static box mask; replace CPU
  bgr2rgb/normalize/create_tensor with the fused GPU preprocess into buffers[0]; drop the now
  redundant H2D in the inference step; add per-substage profiler scopes under preprocess.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ation stage

Document the FaceFusion face-restoration (GFPGAN 1.4) TensorRT optimization with a
stage-by-stage baseline vs optimized table (RTX 4090, TRT 10.1, FP32, 512x512):
end-to-end 78.2ms -> 17.7ms (4.4x), 12.8 -> 56.6 FPS. Includes setup/methodology
(warmup, iterations, p50, compute-only) and how to reproduce via lite_face_restoration_bench.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ffort

Position the work as a repeatable playbook (built-in profiler + moving CPU pre/post
into fused CUDA kernels) applied across the whole FaceFusion pipeline, with a per-stage
optimization-status table (detect/landmarks/recognize/swap = WIP, restoration done,
FP16 next). Keep the GFPGAN face-restoration result as a worked deep-dive rather than
the whole story.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
One row per algorithm (before / after / speedup / what changed) so the log reads as a
growing list of optimized algorithms rather than a single case study; FaceFusion face
restoration is the first entry, further algorithms are placeholders. Move the GFPGAN
per-stage breakdown into a collapsible details block.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a unified benchmarking/profiling utility and a dedicated GFPGAN face-restoration benchmark binary, and refactors the TensorRT GFPGAN restoration path to enable per-stage timing and accelerate pre/post-processing via new CUDA kernels (preprocess fusion + fused paste-back with buffer reuse).

Changes:

  • Introduces a header-only lite/bench/profiler.h to collect CPU/GPU timing samples, aggregate stats, and export CSV.
  • Adds lite_face_restoration_bench example executable for end-to-end GFPGAN benchmarking and CPU-vs-GPU paste-back equivalence checking.
  • Refactors TRTFaceFusionFaceRestoration to expose a compute-only restore() API and integrate fused CUDA preprocess + fused paste-back with buffer reuse and mask caching.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
README.md Documents benchmarking methodology and reported speedups/breakdowns.
lite/trt/kernel/paste_back.cuh Declares fused paste-back kernel API.
lite/trt/kernel/paste_back.cu Implements fused inverse-mapping paste-back CUDA kernel.
lite/trt/kernel/paste_back_manager.h Adds PasteBackGPU reusable GPU paste-back wrapper.
lite/trt/kernel/paste_back_manager.cpp Implements PasteBackGPU buffer management and kernel launch.
lite/trt/kernel/face_restoration_preprocess.cuh Declares fused preprocess CUDA kernel (BGR->RGB, normalize, HWC->CHW).
lite/trt/kernel/face_restoration_preprocess.cu Implements fused preprocess CUDA kernel.
lite/trt/kernel/face_restoration_preprocess_manager.h Adds reusable GPU preprocess wrapper writing directly to TRT input buffer.
lite/trt/kernel/face_restoration_preprocess_manager.cpp Implements preprocess buffer reuse + kernel launch.
lite/trt/cv/trt_face_restoration.h Adds restore() API and profiling hook; wires in new GPU helpers + mask cache.
lite/trt/cv/trt_face_restoration.cpp Refactors restore pipeline into timed stages and uses new GPU preprocess/paste-back.
lite/bench/profiler.h Adds benchmarking/profiling utility and instrumentation macros.
examples/lite/cv/test_lite_face_restoration_bench.cpp Adds GFPGAN restoration benchmark executable and equivalence check.
examples/lite/CMakeLists.txt Registers lite_face_restoration_bench build target.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +133 to +135
cv::Mat crop = crop_vision_frame.isContinuous() ? crop_vision_frame : crop_vision_frame.clone();
cv::Mat mask = crop_mask.isContinuous() ? crop_mask : crop_mask.clone();

Comment thread lite/trt/cv/trt_face_restoration.cpp Outdated
Comment on lines +62 to +66
launch_face_restoration_postprocess(
static_cast<float *>(buffers[1]), transposed_data.data(), 3, 512, 512);
std::vector<float> transposed_data_float(transposed_data.begin(), transposed_data.end());
cudaStreamSynchronize(stream);

Comment on lines +13 to +15
// GPU fused version: inverse-mapping sampling + blend entirely in the kernel, reused device
// buffers, pinned + async copies. Numerically equivalent to launch_paste_back; returns full-frame BGR uint8.
class PasteBackGPU {
Comment thread lite/bench/profiler.h
Comment thread README.md Outdated
## ⚡ Benchmark 🔥
<div id="benchmark"></div>

GPU-inference optimization log. For each algorithm we profile it with a built-in, backend-agnostic harness ([`lite/bench/profiler.h`](https://github.com/xlite-dev/lite.ai.toolkit/blob/main/lite/bench/profiler.h)), then move the CPU pre/post-processing (affine warp, color convert, normalize, tensor layout, paste-back, NMS …) into **fused CUDA kernels** with reused device buffers and pinned + async copies, so the algorithm spends its time on real inference instead of host glue and `cudaMalloc`/sync round-trips. All numbers are **RTX 4090 · TensorRT 10.1 · CUDA 12.4**, median (p50), compute-only, reproducible via the `lite_*_bench` binaries.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

benchmark部分建议使用最新的tensorrt,以及cuda 13+

wangzijian1010 and others added 18 commits June 5, 2026 22:48
…duct)

MNN/NCNN/TNN were thin per-model wrappers with high maintenance surface and no differentiated value. They are frozen on branch/tag `v0.2-all-backends` and removed from the active line.

ONNXRuntime is retained deliberately: it is the numerical-reference oracle for CPU/GPU equivalence checks and the only backend that can build the test suite (CMake forces ENABLE_TEST=OFF without it). TensorRT remains the maintained high-performance product line.

- delete lite/{mnn,ncnn,tnn}/ (480 files) + cmake/{MNN,ncnn,TNN}.cmake + docs/hub/*.{mnn,ncnn,tnn}.md
- CMakeLists.txt: drop ENABLE_{MNN,NCNN,TNN} options; require ORT
- cmake/utils.cmake + lite.ai.toolkit.cmake.in: drop backend wiring
- lite/config.h.in: drop cmakedefines
- lite/models.h: drop include/typedef blocks; alias resolves to ORT

Example backend stubs stay guarded by #ifdef and compile out. Not yet build-verified (no GPU host available); pending remote 4090 sync via scripts/remote.sh.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Rewrite the README from the upstream "300+ models, 5 backends" catalog into a focused flagship narrative: FaceFusion face-swap on TensorRT, with the GPU-optimization benchmark as the hero. 1152 -> 183 lines.

- lead with the benchmark (GFPGAN 4.4x) and the FaceFusion pipeline
- TensorRT-first build + verified flagship/YOLOv5 GPU code samples
- drop MNN/NCNN/TNN columns, the per-model code dump, Mixed-with-MNN + docker-hub sections
- 100+ ORT CV models demoted to a one-line pointer to docs/hub
- note the legacy multi-backend build is frozen at tag v0.2-all-backends
- keep all xlite-dev URLs / assets / citation unchanged

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… pipeline

Minimal-viable path to actually *use* the flagship without editing source:

- lite_facefusion_cli: argv-driven runner (engine_dir + source/target/output), guarded by ENABLE_TENSORRT, using the verified FaceFusionPipeLine API
- scripts/build_facefusion_engines.sh: one trtexec per model -> 5 engines (GFPGAN kept FP32 on purpose)
- docs/facefusion_quickstart.md: build -> get ONNX -> build engines -> run
- README: out-of-box run block + quickstart link
- remove dead test_lite_facefusion_pipeline_gpu.cpp (referenced a missing header, never registered in CMake)

Not yet build-verified (no GPU host); pending remote 4090 via scripts/remote.sh.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The pipeline had ~5 latent crashes on bad input. Replace them with explicit, message-carrying exceptions (a prerequisite for shipping to users / wrapping in a binding):

- constructor: check all 5 engine files exist before deserializing
- detect: check source/target images actually loaded (imread != empty)
- detect: check a face was found and the face index is in range, on both source and target, instead of out-of-bounds vector access
- drop a dead, unconditional out-of-bounds access (target_test_bounding_box)

No API or happy-path behavior change. Not build-verified (no GPU host).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…mark

Add an opt-in Profiler* to TRTFaceFusionPipeLine::detect() (LITE_CPU_SCOPE_OPT, zero-overhead when null) breaking the pipeline into imread / detect / landmark (x2) / recognizer / swap / restoration, plus lite_facefusion_pipeline_bench to drive it. Also tidies the dead commented-out multithread lines in detect().

First whole-pipeline measurement (4090, fp32, 1024x768) reframes the bottleneck: no single dominant stage -- restoration 27% / detect 24% / swap 19% / imread 15% (disk I/O) / landmark 12%. Surfaced a CUDA OOM (per-iteration memory growth, likely the swap paste-back) to fix next.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…eline

Two bugs surfaced by the whole-pipeline benchmark (CUDA OOM after ~13 frames on a 24GB 4090, GPU memory growing geometrically ~4x per frame):

1. nms_cuda_manager: perform_nms() had 'if (true) init(max(num_boxes, max_boxes_num*2))', re-allocating the NMS device buffers at DOUBLE the size on every call (x4 per frame since detect runs twice) -> geometric growth -> OOM. Now 'init(num_boxes)', a no-op once capacity fits. Side benefit: detect ~20% faster (no per-frame reallocs) and stable timing.

2. trt_face_swap: crop_list (a member) was emplace_back'd every call and never cleared -> unbounded host growth. Only crop_list[0] is used and the mask is constant, so clear() first.

Verified: 50 frames, GPU memory flat at 2636 MiB (was OOM at ~13).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…y benchmark

Add 'cv::Mat detect(src, tgt) -> Mat' that does NO disk I/O (uses restoration's restore() instead of writing to a path); the file-path detect() becomes a thin wrapper (imread -> core -> imwrite). This is both the production API (video/server feed frames, not paths) and what lets the benchmark measure compute only.

The pipeline bench now decodes the two images once, outside the loop, and times the Mat-based detect. Effect (4090, fp32, 50 frames): TOTAL 90.5 -> 63.8 ms (the ~27ms removed was pure imread x2 + imwrite); restoration 27.4 -> 15.1 ms (it had been inflated by a per-frame full-frame imwrite). Real compute bottleneck: detect 28% / swap 26% / restoration 24% / landmark 17% / recognizer 3%.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The face-swap stage still used the old CPU launch_paste_back (two full-frame warpAffine + per-frame cudaMalloc/sync), while restoration had already moved to the GPU-fused PasteBackGPU. Give swap a PasteBackGPU member and reuse the same kernel (inverse-mapping sample + blend, reused device buffers, pinned/async), numerically equivalent to the CPU path (max|diff|=2/255).

4090, fp32, compute-only, 50 frames: swap 16.9 -> 9.4 ms; pipeline TOTAL 63.8 -> 56.2 ms (15.7 -> 17.8 FPS). Output verified visually unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… -> 4.2ms)

Profiling the detect stage showed it was glue-bound, not infer-bound: of ~9.6ms/call, preprocess was 4.9ms (CPU normalize: resize + copyMakeBorder + split + 3x convertTo + merge + create_tensor) while TRT inference was only 1.4ms -- so FP16 would barely help it.

Moved normalize + BGR HWC->CHW into a fused CUDA kernel (YoloFacePreprocessGPU, reused device/pinned buffers) writing straight into the inference input buffer; only resize+letterbox stays on CPU. Also dropped a wasted full-output D2H copy (generate_box reads buffers[1] directly) and the now-dead normalize() method.

4090, fp32, compute-only, 50 frames: detect 9.6 -> 4.2 ms/call (preprocess 4.9 -> 0.8); pipeline TOTAL 56.2 -> 46.2 ms (17.8 -> 21.6 FPS). Output verified visually unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Cracks the previously-blocked gfpgan FP16-on-TRT10.1 problem: naive --fp16 grey-blocks (StyleGAN modulated-conv demodulation overflows in FP16); the TRT11 'Cast strong-typing' route crashes on TRT10.1 (matchTypeSpec). This route works: weak FP16 + OBEY_PRECISION_CONSTRAINTS + per-layer FP32 pin on style_conv*/to_rgb* float layers via the builder API (TRT10.1 python wheel).

Verified on 4090/TRT10.1: restoration infer 10.8 -> 8.0 ms, output numerically clean (no grey blocks); facefusion pipeline 36.6 -> 33.1 ms (27.4 -> 30.3 FPS).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The detect-preprocess refactor (7603793) dropped the old srcimg.clone(); when detect() is given a ROI/submatrix and no resize happens, copyMakeBorder(BORDER_CONSTANT) pulls parent-image pixels (outside the ROI) into the pad region instead of zeros, corrupting the detector input. Add BORDER_ISOLATED to restore the clone()-based behavior. (Latent: the facefusion pipeline passes a clone, so it was not active, but detect() is a public API.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… path

- trt_face_swap: load the model_matrix .npy ONCE in the constructor (was load_npy() from disk every frame) and cache the static 128 box mask (was rebuilt every frame). Swap result byte-identical.

- remove hot-path std::cout spam from trt_face_swap (Face_Swap timing prints + 'done!') and trt_face_recognizer ('done!') -- they pollute stdout and add per-frame latency/jitter.

Correctness/hygiene win (no per-frame disk syscall, clean output) rather than a big speedup -- the swap stage's real cost is the CPU CHW->HWC transpose + paste-back (addressed later by the device-path work). Output verified unchanged on a real source/target swap.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ource caching)

The face-swap source is fixed across a video / session, so re-running detect+landmark+recognize on it every frame is wasted work. Split the pipeline: prepare_source(src,idx) caches the source embedding once; process(tgt,idx)->Mat runs per frame (detect+landmark target, swap with cached source, restore). detect(src,tgt) kept as a one-shot convenience.

The bench now mirrors real video use (prepare source once, time per-frame process). 4090, fp16 + mixed-gfpgan, compute-only: per-frame 38 -> 27.9 ms (26 -> 36 FPS) -- the ~10ms source branch (detect_src+landmark_src+recognizer) is off the per-frame path. Output verified unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…the GPU kernel

The restoration transpose kernel emitted uint8 RGB, forcing a CPU uint8->float copy + cv::cvtColor(RGB2BGR) + a clone() before paste-back. Make the kernel write HWC BGR float[0,255] directly; the cv::Mat now aliases that buffer (no clone). Removes ~1.3ms of CPU glue.

4090, mixed-gfpgan, restoration bench: postprocess 3.33 -> 2.02 ms (transpose+dl 1.45->0.57, cvtColor folded away), restoration 12.6 -> 11.2 ms. Output verified color-correct.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ice-pipeline brick)

Replace the CPU cv::warpAffine in restoration preprocess with an NPP affine warp
whose output stays on the GPU, feeding the fused bgr2rgb+normalize+CHW kernel
directly — the 512 crop no longer makes a D2H/H2D round-trip.

- face_utils: split estimate_affine_by_landmark_5 out of warp_face_by_face_landmark_5
  so the affine estimate stays on CPU while the warp moves to the GPU.
- WarpAffineNpp::warp_to_device: nppiWarpAffine into an internal device buffer on the
  caller's stream, returns the device crop pointer (no D2H, no sync). nppSetStream for
  correct ordering with the consumer kernel.
- FaceRestorePreprocessGPU::run_device: launch the fused preprocess kernel straight on
  a device crop pointer (skips the H2D the cv::Mat path needed).
- cmake: link NPP (nppc/nppig/nppidei), ships with CUDA — no new dependency.

Validated on RTX 4090: output PSNR 55.63 dB vs the CPU-warp result (max|diff|=11,
numerically identical), memory flat (no leak), 28.6 FPS. This is the first device-
resident brick toward keeping frames GPU-resident across the whole pipeline.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… -3ms restoration)

The grey-block problem that forced GFPGAN to FP32 is solved: build_gfpgan_fp16_engine.py
keeps only the StyleGAN style_conv/to_rgb layers in FP32 and runs the rest in FP16. That
output is numerically identical to the FP32 engine (PSNR 57.8 dB on the test pair) while
cutting the restoration stage 15.1->12.1 ms (pipeline 28.6->31.2 FPS on an RTX 4090).

A naive --fp16 GFPGAN (or an over-aggressive pin) instead produces a grey halo around the
pasted-back face (PSNR ~20 dB) — verified by A/B, so it is NOT used.

- build_facefusion_engines.sh: GFPGAN now built via build_gfpgan_fp16_engine.py as
  gfpgan_1.4_mixed.engine; GFPGAN_FP32=1 falls back to the plain FP32 engine.
- facefusion CLI: default restoration engine is gfpgan_1.4_mixed.engine, falling back to
  gfpgan_1.4_fp32.engine if the mixed one isn't present.
- quickstart doc: document the mixed build + the TensorRT python-wheel requirement.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… (shared warp+paste)

First brick of the device pipeline: a DeviceFrame holds the input frame resident in device
memory, so restoration's NPP warp and paste-back both read it straight from the device instead
of each doing its own full-frame H2D.

- DeviceFrame: upload-once / download-once full-frame (HWC BGR uint8) device buffer, reused
  device + pinned staging across frames.
- WarpAffineNpp::warp_device_to_device: NPP warp whose source is already on the device (no H2D).
- PasteBackGPU: overload taking a device temp pointer (skips the full-frame H2D); the existing
  cv::Mat path and the new device path now share a private run() tail.
- restoration: upload face_swap_image once, warp + paste_back both read input_frame_.data().

Removes one full-frame H2D per frame. Output is bit-exact (PSNR 99 dB, max|diff|=0 vs the
pre-brick result); restoration 12.1->11.7 ms, memory flat. Brick 2 (paste output stays on
device + fold blend) and brick 3 (weld the swap->restoration seam) build on this.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…the paste kernel

The face-enhancer blend_frame(target 0.2 / paste 0.8) is algebraically just the paste with the
mask scaled by 0.8: result = temp*(1 - 0.8*mask) + crop*(0.8*mask). Add a blend_alpha to the
paste-back kernel (default 1.0 = plain paste, swap unchanged) and have restoration pass 0.8,
dropping the separate full-frame CPU cv::addWeighted and the now-dead ori_image clone.

restoration 11.7->10.6 ms (the blend ran on the full ~1024x768 frame), TOTAL 31.8->31.0,
31.4->32.2 FPS. Output numerically identical (PSNR 56.5 dB, max|diff|=8 — float rounding order
vs addWeighted), visually identical.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
wangzijian1010 and others added 6 commits June 7, 2026 20:13
…D2H/H2D round-trip)

The swapped full frame now stays GPU-resident across the swap->restoration boundary: swap
pastes straight into a DeviceFrame (no D2H) and restoration reads it from the device (no
upload), eliminating the fattest pair of full-frame copies in the pipeline.

- PasteBackGPU: paste_back_to_device() writes into a caller DeviceFrame (no D2H); refactor into
  upload_temp() + run_core() shared by the host-Mat and device paths.
  FIX: run() now passes nullptr to run_core and resolves d_out_ AFTER ensure_capacity may have
  reallocated it — passing d_out_ directly used a stale/null pointer on the first frame (would
  write through a bad pointer and crash the host paste path).
- face_swap: detect(...,DeviceFrame&) overload (swap_core shared with the cv::Mat path); syncs
  its stream so restoration can safely read the frame on its own stream.
- restoration: restore(const DeviceFrame&) overload (restore_core shared with the cv::Mat path).
- pipeline: owns swapped_frame_; process() hands it swap -> restoration with no host bounce.

The cv::Mat APIs (standalone swap/restoration tests, CLI) are unchanged. Verified on RTX 4090:
no crash, PSNR 99 dB (max|diff|=0, bit-exact), memory flat; swap 9.8->9.2 ms, restoration
10.6->9.8 ms, 32.2->32.8 FPS. This is the device-pipeline v0 across the swap->restoration seam.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…stproc -> paste, no bounce)

The restored 512 crop no longer round-trips through host: the postprocess kernel writes into a
reusable device buffer and paste-back reads it straight from there. restoration is now fully
device-resident internally — warp(device) -> infer -> postproc(device crop) -> paste(device temp
+ device crop) -> a single final D2H.

- FaceRestorePostprocessGPU: runs the transpose/cvtColor/denorm kernel into a reused device
  buffer (no per-call cudaMalloc, no D2H), returns the device crop pointer.
- PasteBackGPU: device-crop overload + upload_crop(); run_core now takes a device crop pointer so
  the host-Mat and device paths share it (mask/affine are the only remaining H2D in paste).
- restoration: postproc_gpu_ -> paste_back(device temp, device crop).

Verified on RTX 4090: PSNR 99 dB (max|diff|=0, bit-exact — same kernel, just kept on device),
memory flat. restoration 9.8->8.8 ms, 32.8->33.6 FPS.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…hared target upload

swap now uploads the target frame ONCE into target_dev_ and both the NPP warp (128 crop) and the
paste-back read it from there — no host-side warp, no redundant upload. This replaces the CPU
warpAffine + cvtColor + convertTo + create_tensor + the 128 CHW-tensor H2D with: one full-frame
H2D (which paste already did) -> GPU warp -> fused bgr2rgb+/255+CHW straight into buffers[0].

(An earlier attempt warped from a SECOND upload of the frame — uploading the full frame twice per
swap — which was a net loss; sharing the single target_dev_ upload is the fix.)

- preprocess kernel generalized to out = v*scale + bias (the reusable normalize+CHW template):
  restoration 1/127.5,-1; swap 1/255,0.
- swap: target_dev_ + warp_npp_ + preprocess_gpu_; swap_core warps from device; both detect()
  overloads paste from target_dev_.

Verified on RTX 4090: PSNR 57.0 dB vs the CPU-warp result (max|diff|=9, equivalent), clean
eyeball, memory flat. swap input is now fully device-resident (a structural step toward sharing one
target upload across detect/landmark/swap). Per-stage ms delta is within this box's ~1ms run-to-run
clock jitter, so the win here is structural + less CPU work, not a measurable single-stage speedup.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… real FP16 numbers

- README: lead with the GPU-resident FaceFusion pipeline (DeviceFrame carrying full frames
  between stages, CUDA/NPP hot kernels); drop the ONNXRuntime badge; fix the legacy tag link.
- quickstart: update Performance to the real video-shaped benchmark — 23.6 ms/frame (42.3 FPS)
  on an RTX 4090 FP16, one full-frame H2D + one D2H, swap->restoration stays GPU-resident.
- .gitignore: ignore *.jpg (the local demo result images).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants