You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Modern GPUs are no longer "just" wide SIMD machines. They are increasingly asynchronous systems with {{< highlight-text >}}**heterogeneous resources**{{< /highlight-text >}} that each operate on their own timelines: tensor cores run independently, memory pipelines have their own queues, and async copy engines allow data movement to proceed concurrently with computation. Programming should adapt to this asynchronous style --- and the performance rewards for doing so are real.
29
29
@@ -37,7 +37,7 @@ This coupling amplifies complexity. Performance features like prefetching, pipel
37
37
38
38
> We adopt the key principle of how [software systems](https://en.wikipedia.org/wiki/Actor_model) controls the complexity of asynchonous: **Resource/state isolation** and **asynchronous through message passing**, and rebuild GPU SMs to **decoupled cores**.
In the VDCores model, virtual cores are the unit of execution and composition. Instead of a single monolithic kernel, execution is decomposed into independent instruction streams executed by loosely coupled cores.
43
43
@@ -57,13 +57,13 @@ We build VDCores by composing only 5 basic compute instructions and 23 memory/co
57
57
58
58
VDCores do not get this edge by hand-tunning better kernels, but instead through decouopled runtime and flexbile programming interface. We illustrate this with two exmples in this process.
### Example 1: **Free** "Prefetch" Non-Dependent Memory Buffers
63
63
64
64
Consider an attention kernel followed by a linear projection with residual addition. In VDCores, we connect them by dependencies rather than manually fusing/staging: (Also note that in VDCores we do not have the notion of kernel boundary; we mark the original kernel boundary in the example for easy to understand.)
65
65
66
-
{{< dae/example1 >}}
66
+
{{< vdcores/example1 >}}
67
67
68
68
This is the key shift: {{< highlight-text >}} Overlap **emerges automatically**{{< /highlight-text >}} from runtime dependency resolving, without humans splitting code into explicit "prefetch" stages or manually fusing kernels to force concurrency.
69
69
@@ -73,7 +73,7 @@ Another secret sause of VDCores is the **composbility** of it's components. Same
73
73
74
74
Consider a MLP block: GEMV (Up + Gate) followed by SiLU activation and GEMV (Down). Input is shape [1, 4096], Up and Gate outputs are [1, 12288], and Down output is [1, 4096]. We can tile Gate and Up along the N dimension and Down along the K dimension.
75
75
76
-
<imgsrc="../../images/dae/example2.jpg"alt="flexible core composition: two schedules"style="width: 100%;" />
76
+
<imgsrc="../../images/vdcores/example2.jpg"alt="flexible core composition: two schedules"style="width: 100%;" />
77
77
78
78
**Schedule 1** executes the operations in order and fuses SiLU with Up—straightforward and amenable to kernel fusion for optimization.
0 commit comments