My query on the Nvidia forums brought me here.
Here's the data point from the GH200 and H100-SXM5, but I don't quite understand if this measurement is reliable since there's not a significant difference between PCIe and the HW coherent configurations with p50 fault latency. However, p99 seem to highlight the differences better but that is not the common case behavior.
Would it be possible to infer the number of page faults from the CUPTI counters to verify that the access patterns are consistent with expectation?
| Metric |
gh200.txt |
h100 sxm.txt |
| GPU |
NVIDIA GH200 120GB |
NVIDIA H100 80GB HBM3 |
| Platform |
HARDWARE_COHERENT_UMA |
DISCRETE_PCIE |
| Coherent |
yes, hardware |
no |
| Atomic GPU scope p50 |
14.6 ns, 29 cycles |
11.1 ns, 22 cycles |
| Atomic GPU scope p99 |
20.7 ns |
13.6 ns |
| Atomic SYS scope p50 |
15.2 ns, 30 cycles |
11.1 ns, 22 cycles |
| Atomic SYS scope p99 |
21.2 ns |
13.6 ns |
| Atomic contention p50 |
14.6 ns, 29 cycles |
11.1 ns, 22 cycles |
| Atomic contention p99 |
21.2 ns |
13.6 ns |
| Atomic SYS to GPU p50 ratio |
1.03x |
1.00x |
| Atomic SYS to GPU p99 ratio |
1.02x |
1.00x |
| Coherence cost |
0.5 ns overhead |
0.0 ns overhead |
| Fault COLD p50 |
12.6 ns, 25 cycles |
13.1 ns, 26 cycles |
| Fault COLD p99 |
485.4 ns |
52335.4 ns |
| Fault WARM p50 |
12.6 ns, 25 cycles |
12.1 ns, 24 cycles |
| Fault WARM p99 |
28.3 ns |
27.8 ns |
| Fault PRESSURE p50 |
12.6 ns, 25 cycles |
12.6 ns, 25 cycles |
| Fault PRESSURE p99 |
181.8 ns |
33068.2 ns |
| Fault COLD to WARM p50 ratio |
1.00x |
1.08x |
| Fault COLD to WARM p99 ratio |
17.15x |
1882.57x |
My query on the Nvidia forums brought me here.
Here's the data point from the GH200 and H100-SXM5, but I don't quite understand if this measurement is reliable since there's not a significant difference between PCIe and the HW coherent configurations with p50 fault latency. However, p99 seem to highlight the differences better but that is not the common case behavior.
Would it be possible to infer the number of page faults from the CUPTI counters to verify that the access patterns are consistent with expectation?