Skip to content

Commit 8db9cca

Browse files
Erol444claude
andauthored
Update workflow benchmarks docs with TRT GPU results (#2289)
Refresh self-hosted numbers, drop stddev columns in favor of a Workflow Overhead column, and note the run was on a server-grade NVIDIA GPU using TensorRT models called directly via the inference package. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent ceafea5 commit 8db9cca

1 file changed

Lines changed: 17 additions & 17 deletions

File tree

docs/workflows/benchmarks.md

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -4,26 +4,26 @@ This page compares **direct** [model inference](quickstart/inference_101.md) lat
44

55
## Self-Hosted Results
66

7-
All times are in **milliseconds**. Each model was warmed up before timing, then measured over 10 iterations.
8-
9-
| Model | Avg Direct (ms) | Stddev Direct | Avg Workflow (ms) | Stddev Workflow |
10-
|-------|-----------------|---------------|-------------------|-----------------|
11-
| rfdetr-nano | 23.86 | 0.35 | 29.99 | 2.04 |
12-
| rfdetr-small | 29.50 | 0.20 | 31.36 | 0.36 |
13-
| rfdetr-medium | 33.97 | 0.54 | 34.78 | 0.23 |
14-
| rfdetr-large | 40.96 | 0.38 | 46.00 | 1.75 |
15-
| rfdetr-xlarge | 73.37 | 0.26 | 74.41 | 0.13 |
16-
| yolo26n-640 | 7.68 | 0.04 | 9.61 | 0.43 |
17-
| yolo26s-640 | 10.10 | 0.08 | 12.48 | 1.04 |
18-
| yolo26m-640 | 16.43 | 0.05 | 18.10 | 0.58 |
19-
| yolo26l-640 | 18.50 | 0.09 | 20.01 | 0.38 |
20-
| yolo26x-640 | 30.46 | 0.14 | 33.51 | 0.17 |
7+
All times are in **milliseconds**. Benchmarks were run on a server-grade NVIDIA GPU using TensorRT-optimized models, called directly via the `inference` Python package. Each model was warmed up before timing, then measured over 10 iterations.
8+
9+
| Model | Avg Direct (ms) | Avg Workflow (ms) | Workflow Overhead (ms) |
10+
|-------|-----------------|-------------------|------------------------|
11+
| rfdetr-nano | 2.65 | 4.40 | 1.75 |
12+
| rfdetr-small | 3.17 | 4.62 | 1.45 |
13+
| rfdetr-medium | 3.79 | 5.38 | 1.59 |
14+
| rfdetr-large | 4.87 | 7.46 | 2.59 |
15+
| rfdetr-xlarge | 8.44 | 10.65 | 2.21 |
16+
| yolo26n-640 | 2.42 | 4.07 | 1.65 |
17+
| yolo26s-640 | 3.25 | 5.48 | 2.23 |
18+
| yolo26m-640 | 4.56 | 6.29 | 1.73 |
19+
| yolo26l-640 | 5.76 | 8.01 | 2.25 |
20+
| yolo26x-640 | 7.75 | 9.36 | 1.61 |
2121

2222
### Key Takeaways
2323

24-
- **Workflow overhead is minimal** — typically 1–5 ms on top of direct inference, depending on the model.
25-
- For larger models (e.g. `rfdetr-xlarge`), the workflow overhead becomes negligible relative to model inference time (~1.4%).
26-
- For smaller, faster models (e.g. `yolo26n-640`), the overhead is proportionally larger (~25%) but still under 2 ms in absolute terms.
24+
- **Workflow overhead is minimal** — typically 1.5–2.6 ms on top of direct inference, regardless of model size.
25+
- For larger models (e.g. `rfdetr-xlarge`), the workflow overhead is small relative to model inference time (~26%).
26+
- For smaller, faster models (e.g. `yolo26n-640`), the overhead is proportionally larger but still under 2 ms in absolute terms.
2727
- Workflow overhead is CPU-bound (graph scheduling, input preparation, output routing), while model inference itself typically runs on the GPU (when available). As a result, the overhead stays relatively constant regardless of GPU speed.
2828

2929
### Methodology

0 commit comments

Comments
 (0)