What's the cost of the UMA fault latency?

My query on the [Nvidia forums](https://forums.developer.nvidia.com/t/interpreting-migration-cause-for-page-faults/362356)  brought me here. 

Here's the data point from the [GH200](https://github.com/user-attachments/files/27657829/gh200.txt) and [H100-SXM5](https://github.com/user-attachments/files/27657832/h100-sxm.txt), but I don't quite understand if this measurement is reliable since there's not a significant difference between PCIe and the HW coherent configurations with p50 fault latency. However, p99 seem to highlight the differences better but that is not the common case behavior. 

Would it be possible to infer the number of page faults from the CUPTI counters to verify that the access patterns are consistent with expectation?

| Metric | gh200.txt | h100 sxm.txt |
|---|---:|---:|
| GPU | NVIDIA GH200 120GB | NVIDIA H100 80GB HBM3 |
| Platform | HARDWARE_COHERENT_UMA | DISCRETE_PCIE |
| Coherent | yes, hardware | no |
| Atomic GPU scope p50 | 14.6 ns, 29 cycles | 11.1 ns, 22 cycles |
| Atomic GPU scope p99 | 20.7 ns | 13.6 ns |
| Atomic SYS scope p50 | 15.2 ns, 30 cycles | 11.1 ns, 22 cycles |
| Atomic SYS scope p99 | 21.2 ns | 13.6 ns |
| Atomic contention p50 | 14.6 ns, 29 cycles | 11.1 ns, 22 cycles |
| Atomic contention p99 | 21.2 ns | 13.6 ns |
| **Atomic SYS to GPU p50 ratio** | **1.03x** | **1.00x** |
| **Atomic SYS to GPU p99 ratio** | **1.02x** | **1.00x** |
| Coherence cost | 0.5 ns overhead | 0.0 ns overhead |
| Fault COLD p50 | 12.6 ns, 25 cycles | 13.1 ns, 26 cycles |
| Fault COLD p99 | 485.4 ns | 52335.4 ns |
| Fault WARM p50 | 12.6 ns, 25 cycles | 12.1 ns, 24 cycles |
| Fault WARM p99 | 28.3 ns | 27.8 ns |
| Fault PRESSURE p50 | 12.6 ns, 25 cycles | 12.6 ns, 25 cycles |
| Fault PRESSURE p99 | 181.8 ns | 33068.2 ns |
| **Fault COLD to WARM p50 ratio** | **1.00x** | **1.08x** |
| **Fault COLD to WARM p99 ratio** | **17.15x** | **1882.57x** |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's the cost of the UMA fault latency? #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Metric	gh200.txt	h100 sxm.txt
GPU	NVIDIA GH200 120GB	NVIDIA H100 80GB HBM3
Platform	HARDWARE_COHERENT_UMA	DISCRETE_PCIE
Coherent	yes, hardware	no
Atomic GPU scope p50	14.6 ns, 29 cycles	11.1 ns, 22 cycles
Atomic GPU scope p99	20.7 ns	13.6 ns
Atomic SYS scope p50	15.2 ns, 30 cycles	11.1 ns, 22 cycles
Atomic SYS scope p99	21.2 ns	13.6 ns
Atomic contention p50	14.6 ns, 29 cycles	11.1 ns, 22 cycles
Atomic contention p99	21.2 ns	13.6 ns
Atomic SYS to GPU p50 ratio	1.03x	1.00x
Atomic SYS to GPU p99 ratio	1.02x	1.00x
Coherence cost	0.5 ns overhead	0.0 ns overhead
Fault COLD p50	12.6 ns, 25 cycles	13.1 ns, 26 cycles
Fault COLD p99	485.4 ns	52335.4 ns
Fault WARM p50	12.6 ns, 25 cycles	12.1 ns, 24 cycles
Fault WARM p99	28.3 ns	27.8 ns
Fault PRESSURE p50	12.6 ns, 25 cycles	12.6 ns, 25 cycles
Fault PRESSURE p99	181.8 ns	33068.2 ns
Fault COLD to WARM p50 ratio	1.00x	1.08x
Fault COLD to WARM p99 ratio	17.15x	1882.57x

What's the cost of the UMA fault latency? #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions