You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update workflow benchmarks docs with TRT GPU results (#2289)
Refresh self-hosted numbers, drop stddev columns in favor of a Workflow
Overhead column, and note the run was on a server-grade NVIDIA GPU using
TensorRT models called directly via the inference package.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All times are in **milliseconds**. Benchmarks were run on a server-grade NVIDIA GPU using TensorRT-optimized models, called directly via the `inference` Python package. Each model was warmed up before timing, then measured over 10 iterations.
8
+
9
+
| Model | Avg Direct (ms) | Avg Workflow (ms) | Workflow Overhead (ms)|
-**Workflow overhead is minimal** — typically 1–5 ms on top of direct inference, depending on the model.
25
-
- For larger models (e.g. `rfdetr-xlarge`), the workflow overhead becomes negligible relative to model inference time (~1.4%).
26
-
- For smaller, faster models (e.g. `yolo26n-640`), the overhead is proportionally larger (~25%) but still under 2 ms in absolute terms.
24
+
-**Workflow overhead is minimal** — typically 1.5–2.6 ms on top of direct inference, regardless of model size.
25
+
- For larger models (e.g. `rfdetr-xlarge`), the workflow overhead is small relative to model inference time (~26%).
26
+
- For smaller, faster models (e.g. `yolo26n-640`), the overhead is proportionally larger but still under 2 ms in absolute terms.
27
27
- Workflow overhead is CPU-bound (graph scheduling, input preparation, output routing), while model inference itself typically runs on the GPU (when available). As a result, the overhead stays relatively constant regardless of GPU speed.
0 commit comments