|
40 | 40 | # Overview |
41 | 41 | # -------- |
42 | 42 | # |
43 | | -# When you capture operations into a CUDA graph, the profiler shows all |
44 | | -# kernels executing on a single stream. This makes it hard to distinguish |
45 | | -# between different logical components of your model (e.g., attention vs MLP). |
46 | | -# |
47 | | -# Kernel annotations solve this by: |
48 | | -# |
49 | | -# 1. **Marking** kernels during graph capture with semantic labels |
50 | | -# 2. **Profiling** the graph replay to collect execution traces |
51 | | -# 3. **Post-processing** traces to merge annotations and create custom lanes |
52 | | -# |
53 | | -# The result is a trace where kernels are organized into meaningful groups, |
54 | | -# making it much easier to identify performance bottlenecks. |
| 43 | +# CUDA graph kernel annotations allow you to add semantic labels to kernels |
| 44 | +# during graph capture. These labels help you understand what each kernel does |
| 45 | +# when profiling, making it easy to identify which parts of your model (e.g., |
| 46 | +# attention, MLP, normalization) are executing at any given time. |
| 47 | +# |
| 48 | +# Without annotations, profiler traces show all kernels on a single stream with |
| 49 | +# auto-generated names, making it difficult to understand the logical structure |
| 50 | +# of your computation. With annotations, you can: |
| 51 | +# |
| 52 | +# 1. **Label kernel groups** with meaningful names during capture |
| 53 | +# 2. **Assign custom stream IDs** for visual organization |
| 54 | +# 3. **Merge labels into profiler traces** for semantic visualization |
| 55 | +# |
| 56 | +# The result is a profiler trace where kernels are labeled and organized by |
| 57 | +# their function, making it much easier to identify performance bottlenecks |
| 58 | +# and understand execution flow. |
55 | 59 | # |
56 | 60 | # .. image:: /_static/img/cuda_graph_annotations_before_after.png |
57 | 61 | # |
|
0 commit comments