Skip to content

Commit 6cbbd7c

Browse files
committed
fix vdcores path
1 parent 61851fb commit 6cbbd7c

2 files changed

Lines changed: 10 additions & 10 deletions

File tree

content/posts/vdcores.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ In this post, we will cover:
2323

2424
## 1. GPUs Are Becoming Asynchronous, Kernel Programming Is Becoming Messy
2525

26-
<img src="../../images/dae/simd_vs_decouple.png" alt="comparison" />
26+
<img src="../../images/vdcores/simd_vs_decouple.png" alt="comparison" />
2727

2828
Modern GPUs are no longer "just" wide SIMD machines. They are increasingly asynchronous systems with {{< highlight-text >}}**heterogeneous resources**{{< /highlight-text >}} that each operate on their own timelines: tensor cores run independently, memory pipelines have their own queues, and async copy engines allow data movement to proceed concurrently with computation. Programming should adapt to this asynchronous style --- and the performance rewards for doing so are real.
2929

@@ -37,7 +37,7 @@ This coupling amplifies complexity. Performance features like prefetching, pipel
3737

3838
> We adopt the key principle of how [software systems](https://en.wikipedia.org/wiki/Actor_model) controls the complexity of asynchonous: **Resource/state isolation** and **asynchronous through message passing**, and rebuild GPU SMs to **decoupled cores**.
3939
40-
<img src="../../images/dae/rt-overview.jpg" alt="runtime" />
40+
<img src="../../images/vdcores/rt-overview.jpg" alt="runtime" />
4141

4242
In the VDCores model, virtual cores are the unit of execution and composition. Instead of a single monolithic kernel, execution is decomposed into independent instruction streams executed by loosely coupled cores.
4343

@@ -57,13 +57,13 @@ We build VDCores by composing only 5 basic compute instructions and 23 memory/co
5757

5858
VDCores do not get this edge by hand-tunning better kernels, but instead through decouopled runtime and flexbile programming interface. We illustrate this with two exmples in this process.
5959

60-
<img src="../../images/dae/performance.png" style="width: 55%;display: flex;justify-content: center;" alt="QWen-8B Performance" />
60+
<img src="../../images/vdcores/performance.png" style="width: 55%;display: flex;justify-content: center;" alt="QWen-8B Performance" />
6161

6262
### Example 1: **Free** "Prefetch" Non-Dependent Memory Buffers
6363

6464
Consider an attention kernel followed by a linear projection with residual addition. In VDCores, we connect them by dependencies rather than manually fusing/staging: (Also note that in VDCores we do not have the notion of kernel boundary; we mark the original kernel boundary in the example for easy to understand.)
6565

66-
{{< dae/example1 >}}
66+
{{< vdcores/example1 >}}
6767

6868
This is the key shift: {{< highlight-text >}} Overlap **emerges automatically**{{< /highlight-text >}} from runtime dependency resolving, without humans splitting code into explicit "prefetch" stages or manually fusing kernels to force concurrency.
6969

@@ -73,7 +73,7 @@ Another secret sause of VDCores is the **composbility** of it's components. Same
7373

7474
Consider a MLP block: GEMV (Up + Gate) followed by SiLU activation and GEMV (Down). Input is shape [1, 4096], Up and Gate outputs are [1, 12288], and Down output is [1, 4096]. We can tile Gate and Up along the N dimension and Down along the K dimension.
7575

76-
<img src="../../images/dae/example2.jpg" alt="flexible core composition: two schedules" style="width: 100%;" />
76+
<img src="../../images/vdcores/example2.jpg" alt="flexible core composition: two schedules" style="width: 100%;" />
7777

7878
**Schedule 1** executes the operations in order and fuses SiLU with Up—straightforward and amenable to kernel fusion for optimization.
7979

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,11 @@
1818
</div>
1919
</div>
2020
<div style="flex: 6;">
21-
<img class="step-img" data-gallery="gallery-ex1" data-step="0" src="{{ "images/dae/example1-1.jpg" | relURL }}" alt="overview: two kernels with color-coded dependencies" style="width: 100%; display: none;" />
22-
<img class="step-img" data-gallery="gallery-ex1" data-step="1" src="{{ "images/dae/example1-2.jpg" | relURL }}" alt="step 1: compute core waits for buffer 0 and 1" style="width: 100%; display: none;" />
23-
<img class="step-img" data-gallery="gallery-ex1" data-step="2" src="{{ "images/dae/example1-3.jpg" | relURL }}" alt="step 2: prefetch weights during kernel 0" style="width: 100%; display: none;" />
24-
<img class="step-img" data-gallery="gallery-ex1" data-step="3" src="{{ "images/dae/example1-4.jpg" | relURL }}" alt="step 3: proceed with matrix when buffer 3 ready" style="width: 100%; display: none;" />
25-
<img class="step-img" data-gallery="gallery-ex1" data-step="4" src="{{ "images/dae/example1-5.jpg" | relURL }}" alt="step 4: prefetch residual before vec_add" style="width: 100%; display: none;" />
21+
<img class="step-img" data-gallery="gallery-ex1" data-step="0" src="{{ "images/vdcores/example1-1.jpg" | relURL }}" alt="overview: two kernels with color-coded dependencies" style="width: 100%; display: none;" />
22+
<img class="step-img" data-gallery="gallery-ex1" data-step="1" src="{{ "images/vdcores/example1-2.jpg" | relURL }}" alt="step 1: compute core waits for buffer 0 and 1" style="width: 100%; display: none;" />
23+
<img class="step-img" data-gallery="gallery-ex1" data-step="2" src="{{ "images/vdcores/example1-3.jpg" | relURL }}" alt="step 2: prefetch weights during kernel 0" style="width: 100%; display: none;" />
24+
<img class="step-img" data-gallery="gallery-ex1" data-step="3" src="{{ "images/vdcores/example1-4.jpg" | relURL }}" alt="step 3: proceed with matrix when buffer 3 ready" style="width: 100%; display: none;" />
25+
<img class="step-img" data-gallery="gallery-ex1" data-step="4" src="{{ "images/vdcores/example1-5.jpg" | relURL }}" alt="step 4: prefetch residual before vec_add" style="width: 100%; display: none;" />
2626
<div style="display: flex; align-items: center; justify-content: center; gap: 1rem; margin-top: 0.4rem;">
2727
<span class="step-prev" data-gallery="gallery-ex1" style="cursor: pointer; color: #268BD2; font-size: 1rem; user-select: none; padding: 0.2rem 0.5rem; border: 1px solid #268BD2; border-radius: 4px; transition: background 0.2s;" onmouseover="this.style.background='#268BD2';this.style.color='#fff'" onmouseout="this.style.background='transparent';this.style.color='#268BD2'">&#8592; prev</span>
2828
<span class="step-counter" data-gallery="gallery-ex1" style="font-size: 0.8rem; color: #888;">1 / 5</span>

0 commit comments

Comments
 (0)