cuda.bindings latency benchmarks part 4 by danielfrg · Pull Request #1959 · NVIDIA/cuda-python

danielfrg · 2026-04-21T19:50:08Z

Description

Follow up #1580

Mostly all generated by AI agents.

It also migrated a most of the existing pytest-benchmarks, will finish that in the next PR.

Results in my dev station:

----------------------------------------------------------------------------------
Benchmark                                   C++ (mean)   Python (mean)    Overhead
----------------------------------------------------------------------------------
ctx_device.ctx_get_current                        8 ns          113 ns     +105 ns
ctx_device.ctx_get_device                        10 ns          115 ns     +105 ns
ctx_device.ctx_set_current                       11 ns          101 ns      +90 ns
ctx_device.device_get                             6 ns          126 ns     +120 ns
ctx_device.device_get_attribute                   8 ns          194 ns     +186 ns
event.event_create_destroy                       99 ns          302 ns     +203 ns
event.event_query                                83 ns          204 ns     +121 ns
event.event_record                               92 ns          220 ns     +128 ns
event.event_synchronize                          97 ns          224 ns     +127 ns
launch.launch_16_args                          1.57 us         3.08 us    +1511 ns
launch.launch_16_args_pre_packed               1.57 us         1.97 us     +399 ns
launch.launch_2048B                            1.58 us         2.46 us     +877 ns
launch.launch_256_args                         2.35 us        16.57 us   +14221 ns
launch.launch_512_args                         3.25 us        31.50 us   +28245 ns
launch.launch_512_args_pre_packed              3.24 us         3.75 us     +503 ns
launch.launch_512_bools                        3.30 us        58.02 us   +54715 ns
launch.launch_512_bytes                        3.30 us        60.30 us   +56997 ns
launch.launch_512_doubles                      3.25 us        86.70 us   +83457 ns
launch.launch_512_ints                         3.12 us        62.23 us   +59110 ns
launch.launch_512_longlongs                    3.23 us        65.29 us   +62061 ns
launch.launch_empty_kernel                     1.51 us         1.85 us     +334 ns
launch.launch_small_kernel                     1.51 us         2.21 us     +700 ns
memory.mem_alloc_async_free_async               390 ns          731 ns     +342 ns
memory.mem_alloc_free                          1.61 us         1.98 us     +368 ns
memory.memcpy_dtod                             2.09 us         2.26 us     +168 ns
memory.memcpy_dtoh                             5.00 us         5.37 us     +369 ns
memory.memcpy_htod                             3.93 us         3.96 us      +26 ns
module.func_get_attribute                        14 ns          213 ns     +199 ns
module.module_get_function                       32 ns          180 ns     +148 ns
module.module_load_unload                      7.58 us         8.21 us     +636 ns
nvrtc.nvrtc_compile_program                 7239.55 us      7254.48 us   +14929 ns
nvrtc.nvrtc_create_program                       69 ns          681 ns     +612 ns
nvrtc.nvrtc_create_program_100_headers        11.42 us        13.13 us    +1710 ns
pointer_attributes.pointer_get_attribute         27 ns          487 ns     +460 ns
stream.stream_create_destroy                   3.16 us         3.46 us     +306 ns
stream.stream_query                              88 ns          224 ns     +136 ns
stream.stream_synchronize                       114 ns          243 ns     +129 ns
----------------------------------------------------------------------------------

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2026-04-21T19:50:11Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

mdboom · 2026-04-21T21:00:55Z

All of these look good.

I continue to be worried by the high stddev on some of the C++ benchmarks:

nvrtc.nvrtc_compile_program: Mean +- std dev: 5999775 ns +- 207702 ns
stream.stream_create_destroy: Mean +- std dev: 3207 ns +- 495 ns

With stddev so high, it means that these differences don't really tell us anything:

nvrtc.nvrtc_compile_program                 5999.77 us      6002.39 us    +2611 ns
stream.stream_create_destroy                   3.21 us         3.51 us     +302 ns

rwgk · 2026-04-21T21:54:22Z

I continue to be worried by the high stddev on some of the C++ benchmarks:

For a follow-up PR, I’d suggest a calibrated "slow/data-collection" mode: instead of fixed loop counts, first estimate the repetitions needed to reach a target sample time (similar to pyperf --min-time or Google Benchmark MinTime). That’s usually a good way to reduce relative noise while still keeping a fast developer pass.

danielfrg · 2026-04-21T22:05:58Z

Yes, agree that we should take that next. we are pretty much done with the P0s and P1s i wanted to target for bindings so I'll work on improving that collection in the next PR.

danielfrg · 2026-04-21T22:06:18Z

/ok to test f5b1621

rwgk · 2026-04-21T22:39:29Z

(Linux runners appear to have a major outage.)

mdboom · 2026-04-22T13:18:15Z

For a follow-up PR, I’d suggest a calibrated "slow/data-collection" mode: instead of fixed loop counts, first estimate the repetitions needed to reach a target sample time (similar to pyperf --min-time or Google Benchmark MinTime). That’s usually a good way to reduce relative noise while still keeping a fast developer pass.

Yes, that may be the issue. The way the data looks (random, rather than converging) also suggests it may be due to reusing the same process continuously. (The Python benchmarks start a number of fresh processes).

github-actions · 2026-04-22T14:24:18Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

cuda.bindings latency benchmarks part 4

329f14b

danielfrg added this to the cuda.core v1.0.0 milestone Apr 21, 2026

danielfrg requested review from mdboom and rwgk April 21, 2026 19:50

danielfrg self-assigned this Apr 21, 2026

danielfrg added cuda.bindings Everything related to the cuda.bindings module performance labels Apr 21, 2026

rwgk reviewed Apr 21, 2026

View reviewed changes

Comment thread benchmarks/cuda_bindings/benchmarks/cpp/bench_launch.cpp Outdated

Fix C++ benchmark name launch_2048B -> launch_2048b to match Python

f5b1621

rwgk approved these changes Apr 21, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

danielfrg merged commit 9863ea8 into main Apr 22, 2026
115 of 140 checks passed

danielfrg deleted the cuda-bindings-bench-4 branch April 22, 2026 14:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda.bindings latency benchmarks part 4#1959

cuda.bindings latency benchmarks part 4#1959
danielfrg merged 2 commits intomainfrom
cuda-bindings-bench-4

danielfrg commented Apr 21, 2026

Uh oh!

copy-pr-bot Bot commented Apr 21, 2026

Uh oh!

mdboom commented Apr 21, 2026

Uh oh!

Uh oh!

rwgk commented Apr 21, 2026

Uh oh!

danielfrg commented Apr 21, 2026 •

edited

Loading

Uh oh!

danielfrg commented Apr 21, 2026

Uh oh!

rwgk commented Apr 21, 2026

Uh oh!

This comment has been minimized.

mdboom commented Apr 22, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

danielfrg commented Apr 21, 2026

Description

Checklist

Uh oh!

copy-pr-bot Bot commented Apr 21, 2026

Uh oh!

mdboom commented Apr 21, 2026

Uh oh!

Uh oh!

rwgk commented Apr 21, 2026

Uh oh!

danielfrg commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielfrg commented Apr 21, 2026

Uh oh!

rwgk commented Apr 21, 2026

Uh oh!

This comment has been minimized.

mdboom commented Apr 22, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danielfrg commented Apr 21, 2026 •

edited

Loading