Skip to content

Commit db1c55d

Browse files
committed
Refocus overview on annotation capabilities
Changed the overview to emphasize: - Ability to add semantic labels to kernels - Understanding what each kernel does during profiling - Labeling and organizing kernels by function Rather than focusing on splitting kernels across streams, the overview now centers on the annotation feature itself.
1 parent 3ef9d30 commit db1c55d

1 file changed

Lines changed: 16 additions & 12 deletions

File tree

advanced_source/cuda_graph_annotations_tutorial.py

Lines changed: 16 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -40,18 +40,22 @@
4040
# Overview
4141
# --------
4242
#
43-
# When you capture operations into a CUDA graph, the profiler shows all
44-
# kernels executing on a single stream. This makes it hard to distinguish
45-
# between different logical components of your model (e.g., attention vs MLP).
46-
#
47-
# Kernel annotations solve this by:
48-
#
49-
# 1. **Marking** kernels during graph capture with semantic labels
50-
# 2. **Profiling** the graph replay to collect execution traces
51-
# 3. **Post-processing** traces to merge annotations and create custom lanes
52-
#
53-
# The result is a trace where kernels are organized into meaningful groups,
54-
# making it much easier to identify performance bottlenecks.
43+
# CUDA graph kernel annotations allow you to add semantic labels to kernels
44+
# during graph capture. These labels help you understand what each kernel does
45+
# when profiling, making it easy to identify which parts of your model (e.g.,
46+
# attention, MLP, normalization) are executing at any given time.
47+
#
48+
# Without annotations, profiler traces show all kernels on a single stream with
49+
# auto-generated names, making it difficult to understand the logical structure
50+
# of your computation. With annotations, you can:
51+
#
52+
# 1. **Label kernel groups** with meaningful names during capture
53+
# 2. **Assign custom stream IDs** for visual organization
54+
# 3. **Merge labels into profiler traces** for semantic visualization
55+
#
56+
# The result is a profiler trace where kernels are labeled and organized by
57+
# their function, making it much easier to identify performance bottlenecks
58+
# and understand execution flow.
5559
#
5660
# .. image:: /_static/img/cuda_graph_annotations_before_after.png
5761
#

0 commit comments

Comments
 (0)