Skip to content

cuda.bindings latency benchmarks part 4#1959

Merged
danielfrg merged 2 commits intomainfrom
cuda-bindings-bench-4
Apr 22, 2026
Merged

cuda.bindings latency benchmarks part 4#1959
danielfrg merged 2 commits intomainfrom
cuda-bindings-bench-4

Conversation

@danielfrg
Copy link
Copy Markdown
Contributor

Description

Follow up #1580

Mostly all generated by AI agents.

It also migrated a most of the existing pytest-benchmarks, will finish that in the next PR.

Results in my dev station:

----------------------------------------------------------------------------------
Benchmark                                   C++ (mean)   Python (mean)    Overhead
----------------------------------------------------------------------------------
ctx_device.ctx_get_current                        8 ns          113 ns     +105 ns
ctx_device.ctx_get_device                        10 ns          115 ns     +105 ns
ctx_device.ctx_set_current                       11 ns          101 ns      +90 ns
ctx_device.device_get                             6 ns          126 ns     +120 ns
ctx_device.device_get_attribute                   8 ns          194 ns     +186 ns
event.event_create_destroy                       99 ns          302 ns     +203 ns
event.event_query                                83 ns          204 ns     +121 ns
event.event_record                               92 ns          220 ns     +128 ns
event.event_synchronize                          97 ns          224 ns     +127 ns
launch.launch_16_args                          1.57 us         3.08 us    +1511 ns
launch.launch_16_args_pre_packed               1.57 us         1.97 us     +399 ns
launch.launch_2048B                            1.58 us         2.46 us     +877 ns
launch.launch_256_args                         2.35 us        16.57 us   +14221 ns
launch.launch_512_args                         3.25 us        31.50 us   +28245 ns
launch.launch_512_args_pre_packed              3.24 us         3.75 us     +503 ns
launch.launch_512_bools                        3.30 us        58.02 us   +54715 ns
launch.launch_512_bytes                        3.30 us        60.30 us   +56997 ns
launch.launch_512_doubles                      3.25 us        86.70 us   +83457 ns
launch.launch_512_ints                         3.12 us        62.23 us   +59110 ns
launch.launch_512_longlongs                    3.23 us        65.29 us   +62061 ns
launch.launch_empty_kernel                     1.51 us         1.85 us     +334 ns
launch.launch_small_kernel                     1.51 us         2.21 us     +700 ns
memory.mem_alloc_async_free_async               390 ns          731 ns     +342 ns
memory.mem_alloc_free                          1.61 us         1.98 us     +368 ns
memory.memcpy_dtod                             2.09 us         2.26 us     +168 ns
memory.memcpy_dtoh                             5.00 us         5.37 us     +369 ns
memory.memcpy_htod                             3.93 us         3.96 us      +26 ns
module.func_get_attribute                        14 ns          213 ns     +199 ns
module.module_get_function                       32 ns          180 ns     +148 ns
module.module_load_unload                      7.58 us         8.21 us     +636 ns
nvrtc.nvrtc_compile_program                 7239.55 us      7254.48 us   +14929 ns
nvrtc.nvrtc_create_program                       69 ns          681 ns     +612 ns
nvrtc.nvrtc_create_program_100_headers        11.42 us        13.13 us    +1710 ns
pointer_attributes.pointer_get_attribute         27 ns          487 ns     +460 ns
stream.stream_create_destroy                   3.16 us         3.46 us     +306 ns
stream.stream_query                              88 ns          224 ns     +136 ns
stream.stream_synchronize                       114 ns          243 ns     +129 ns
----------------------------------------------------------------------------------

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@danielfrg danielfrg added this to the cuda.core v1.0.0 milestone Apr 21, 2026
@danielfrg danielfrg requested review from mdboom and rwgk April 21, 2026 19:50
@danielfrg danielfrg self-assigned this Apr 21, 2026
@danielfrg danielfrg added cuda.bindings Everything related to the cuda.bindings module performance labels Apr 21, 2026
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented Apr 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@mdboom
Copy link
Copy Markdown
Contributor

mdboom commented Apr 21, 2026

All of these look good.

I continue to be worried by the high stddev on some of the C++ benchmarks:

nvrtc.nvrtc_compile_program: Mean +- std dev: 5999775 ns +- 207702 ns
stream.stream_create_destroy: Mean +- std dev: 3207 ns +- 495 ns

With stddev so high, it means that these differences don't really tell us anything:

nvrtc.nvrtc_compile_program                 5999.77 us      6002.39 us    +2611 ns
stream.stream_create_destroy                   3.21 us         3.51 us     +302 ns

Comment thread benchmarks/cuda_bindings/benchmarks/cpp/bench_launch.cpp Outdated
@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented Apr 21, 2026

I continue to be worried by the high stddev on some of the C++ benchmarks:

For a follow-up PR, I’d suggest a calibrated "slow/data-collection" mode: instead of fixed loop counts, first estimate the repetitions needed to reach a target sample time (similar to pyperf --min-time or Google Benchmark MinTime). That’s usually a good way to reduce relative noise while still keeping a fast developer pass.

@danielfrg
Copy link
Copy Markdown
Contributor Author

danielfrg commented Apr 21, 2026

Yes, agree that we should take that next. we are pretty much done with the P0s and P1s i wanted to target for bindings so I'll work on improving that collection in the next PR.

@danielfrg
Copy link
Copy Markdown
Contributor Author

/ok to test f5b1621

@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented Apr 21, 2026

(Linux runners appear to have a major outage.)

@github-actions

This comment has been minimized.

@mdboom
Copy link
Copy Markdown
Contributor

mdboom commented Apr 22, 2026

For a follow-up PR, I’d suggest a calibrated "slow/data-collection" mode: instead of fixed loop counts, first estimate the repetitions needed to reach a target sample time (similar to pyperf --min-time or Google Benchmark MinTime). That’s usually a good way to reduce relative noise while still keeping a fast developer pass.

Yes, that may be the issue. The way the data looks (random, rather than converging) also suggests it may be due to reusing the same process continuously. (The Python benchmarks start a number of fresh processes).

@danielfrg danielfrg merged commit 9863ea8 into main Apr 22, 2026
115 of 140 checks passed
@danielfrg danielfrg deleted the cuda-bindings-bench-4 branch April 22, 2026 14:04
@github-actions
Copy link
Copy Markdown

Doc Preview CI
Preview removed because the pull request was closed or merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.bindings Everything related to the cuda.bindings module performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants