WukLab
diff --git a/‎assets/css/custom.css‎
Lines changed: 6 additions & 0 deletions b/‎assets/css/custom.css‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎content/posts/vdcores.md‎
Lines changed: 120 additions & 0 deletions b/‎content/posts/vdcores.md‎
Lines changed: 120 additions & 0 deletions
diff --git a/‎layouts/shortcodes/dae/example1.html‎
Lines changed: 69 additions & 0 deletions b/‎layouts/shortcodes/dae/example1.html‎
Lines changed: 69 additions & 0 deletions
diff --git a/‎layouts/shortcodes/highlight-text.html‎
Lines changed: 1 addition & 0 deletions b/‎layouts/shortcodes/highlight-text.html‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎layouts/shortcodes/placeholder.html‎
Lines changed: 6 additions & 0 deletions b/‎layouts/shortcodes/placeholder.html‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎static/images/vdcores/animation-dae.gif‎
1.27 MB b/‎static/images/vdcores/animation-dae.gif‎
1.27 MB
diff --git a/‎static/images/vdcores/example1-1.jpg‎
126 KB b/‎static/images/vdcores/example1-1.jpg‎
126 KB
diff --git a/‎static/images/vdcores/example1-2.jpg‎
110 KB b/‎static/images/vdcores/example1-2.jpg‎
110 KB
diff --git a/‎static/images/vdcores/example1-3.jpg‎
110 KB b/‎static/images/vdcores/example1-3.jpg‎
110 KB
diff --git a/‎static/images/vdcores/example1-4.jpg‎
113 KB b/‎static/images/vdcores/example1-4.jpg‎
113 KB
@@ -1 +1,7 @@
 /* Empty css file that users can override in their own /assets/css/custom.css file */
+
+.highlight-text,
+.highlight-text strong {
+  color: #268BD2;
+  font-weight: 600;
+}
@@ -0,0 +1,120 @@
+---
+title: "VDCores: A Runtime for Modern Async GPUs"
+date: 2026-02-18
+draft: false
+hideToc: false
+tags: ["GPU Programming", "Warp Specialization", "Compiler"]
+truncated: false
+summary: "
+Modern GPUs increasingly expose asynchronous execution engines, yet today's kernels must still linearize memory movement, computation, and control into a single SIMT program. **Virtual Decoupled Cores (VDCores)** decouples memory, compute, and control, reconnecting them only through explicit dependencies. VDCores virtualizes warps into software-defined memory/compute cores that communicate via queues/ports, enabling the runtime and compiler to safely schedule overlap as emergent behavior rather than hand-tuned tricks. VDCores reduces kernel code by **~70%**, enables **~90%** kernel reuse across variants, and delivers **~10%** performance gains over existing solutions.
+<br/><br/>
+[Read More...](https://mlsys.wuklab.io/posts/VDCores/)
+"
+---
+Author: Zhiyuan Guo, Zijian He, Adrian Sampson and Yiying Zhang
+
+Modern GPUs increasingly expose asynchronous execution engines, but today's kernels still linearize memory movement, computation, and control into a single SIMT program, making overlap and composability brittle and architecture-specific. **Virtual Decoupled Cores (VDCores)** decouples memory, compute, and control, reconnecting them only through explicit dependencies.  VDCores reduces kernel code by {{< highlight-text>}}~67%{{</highlight-text>}}, enables kernel reuse across variants, and delivers {{< highlight-text>}}~12%{{</highlight-text>}} performance gains compared to existing solutions.
+
+In this post, we will cover:
+
+1. **Why GPU programming needs a new model**: GPU resources are increasingly heterogeneous and asynchronous; programming must adapt, but doing so under the current model adds significant complexity.
+2. **The principle and practice of decoupling**: We introduce decoupled cores, a new programming model that untangles GPU kernels into state isolated, asynchronous execution units, and show how this enables composability (without performance overhead).
+3. **What flexibility enables**: Beyond faster and simpler programming, what new system-level resource patterns and optimizations that VDCores make possible.
+
+## 1. GPUs Are Becoming Asynchronous, Kernel Programming Is Becoming Messy
+
+<img src="../../images/dae/simd_vs_decouple.png" alt="comparison" />
+
+Modern GPUs are no longer "just" wide SIMD machines. They are increasingly asynchronous systems with {{< highlight-text >}}**heterogeneous resources**{{< /highlight-text >}} that each operate on their own timelines: tensor cores run independently, memory pipelines have their own queues, and async copy engines allow data movement to proceed concurrently with computation. Programming should adapt to this asynchronous style --- and the performance rewards for doing so are real.
+
+But the prevailing GPU programming model {{< highlight-text >}}**couples memory movement, computation, and control**{{< /highlight-text >}} into a single linearized thread program. A typical hand-tuned kernel ends up as a monolithic SIMT program that manually interleaves: (1) memory management, (2) async memory movement, (3) tensor core scheduling, and (4) CUDA core SIMT operations. Every kernel is forced to simultaneously express *what data must move*, *what compute must run*, and *how to schedule and synchronize the two*.
+
+This coupling amplifies complexity. Performance features like prefetching, pipelining, and computation-communication overlap become manual, fragile to workload and environment change, and architecture-specific responsibilities. The result: kernels that are hard to read, hard to maintain across GPU generations, and hard to compose into larger pipelines.
+
+
+
+## 2. How to Make Programmer's Life Easier With a Decoupled Mind
+
+> We adopt the key principle of how [software systems](https://en.wikipedia.org/wiki/Actor_model) controls the complexity of asynchonous: **Resource/state isolation** and **asynchronous through message passing**, and rebuild GPU SMs to **decoupled cores**.
+
+<img src="../../images/dae/rt-overview.jpg" alt="runtime" />
+
+In the VDCores model, virtual cores are the unit of execution and composition. Instead of a single monolithic kernel, execution is decomposed into independent instruction streams executed by loosely coupled cores.
+
+Each decoupled core specializes in a single resource type (e.g., compute or memory) and executes instructions solely based on its local state. For example, within a decoupled memory core, a load instruction can be issued as soon as its dependencies are satisfied and sufficient local shared memory is available, regardless of the state of the compute core.
+When dependency information must be exchanged between decoupled cores, they communicate via asynchronous message queues. Messages do not need to be processed immediately; the receiver can handle them at any time, enabling flexible and loosely synchronized coordination between cores.
+
+This design fully decoupling memory, compute, and control, and reconnecting them only through {{< highlight-text >}}**explicit dependencies**{{< /highlight-text >}}. Once dependencies are first-class, the runtime (and compiler) can safely exploit concurrency: prefetching, latency hiding, double-buffering, and overlap are no longer hand-tuned tricks but emergent behaviors that come “for free” from the decoupled execution model.
+
+This structure makes dependencies explicit and enables safe parallelism among instructions that are not truly dependent. For example, memory core in VDCores naturally covers the common dependency patterns we see in real kernels without requiring programmers to manually linearize them: (Compute cores works in a similar dependency-driven style)
+
+1. **Read-read without dependency** are allowed to be reordered in memory core.
+2. **Read-after-write without dependency** are allowed to be overlapped: read could execute first and write to be committed later when dependency satisfied.
+3. **Read/write with control flow:** decouple control decisions (issue) from memory completion so that control does not unnecessarily block data movement.
+
+Here's a quick example of how VDCores simplify the programming and at the same time covers the common performance pitfall for you.
+We build VDCores by composing only 5 basic compute instructions and 23 memory/control instructions, and use them to compose all operators used in QWen-8B inference. Compared to a state-of-the-art megakernel implementation, [Mirage Persistent Kernel](https://github.com/mirage-project/mirage), VDCores use **67% fewer** lines of code and achieves over **12% performance** gain.
+
+VDCores do not get this edge by hand-tunning better kernels, but instead through decouopled runtime and flexbile programming interface. We illustrate this with two exmples in this process.
+
+<img src="../../images/dae/performance.png" style="width: 55%;display: flex;justify-content: center;" alt="QWen-8B Performance" />
+
+### Example 1: **Free** "Prefetch" Non-Dependent Memory Buffers
+
+Consider an attention kernel followed by a linear projection with residual addition. In VDCores, we connect them by dependencies rather than manually fusing/staging: (Also note that in VDCores we do not have the notion of kernel boundary; we mark the original kernel boundary in the example for easy to understand.)
+
+{{< dae/example1 >}}
+
+This is the key shift: {{< highlight-text >}} Overlap **emerges automatically**{{< /highlight-text >}} from runtime dependency resolving, without humans splitting code into explicit "prefetch" stages or manually fusing kernels to force concurrency.
+
+### Example 2: **Flexible and Zero-Overhead** Core Composition
+
+Another secret sause of VDCores is the **composbility** of it's components. Same set of computation instructions could be composed with different memory instructions, different memory dependencies, thus allowing programmer to quick experimenting with different plans without manual kernel rewrite and fusion.
+
+Consider a MLP block: GEMV (Up + Gate) followed by SiLU activation and GEMV (Down). Input is shape [1, 4096], Up and Gate outputs are [1, 12288], and Down output is [1, 4096]. We can tile Gate and Up along the N dimension and Down along the K dimension.
+
+<img src="../../images/dae/example2.jpg" alt="flexible core composition: two schedules" style="width: 100%;" />
+
+**Schedule 1** executes the operations in order and fuses SiLU with Up—straightforward and amenable to kernel fusion for optimization.
+
+**Schedule 2** exploits output-tile-level dependencies and lets the runtime automatically achieve more overlap: as soon as a Gate+Up tile pair completes, SiLU runs on 4 spare SMs without stalling the Down projection, which can begin consuming earlier tiles immediately.
+
+Hard to tell which one is faster Huh?
+Manually morphing between these two schedules requires significant changes to the kernel implementation. With decoupled cores abstraction, switching between them requires **instruction flow level change**, all tasks remain composable, without sacrificing performance.
+We try both with in 10 minutes with VDCores, and get a quick 7% performance gain in this operator.
+
+## 3. Turning GPU SMs into Virtual Decoupled Cores
+
+> We turn every SM on H200 into a pair of Memory/Compute decoupeld cores, connected by message queues, all run at the speed of GPU!
+
+We materialize the concept of decoupled cores on top of single GPU SM's hardware, and call them **Virtual** Decoupled Cores.
+Making these virtual components keep up with raw GPU speed remains a major performance-engineering challenge. To reach PFLOPs of compute and multi-terabytes-per-second memory bandwidth, every SM cycle counts, and there is only limited headroom for virtual-core overheads.
+
+The main idea is to build {{< highlight-text >}}**virtual software memory cores and compute cores on top of warps**{{< /highlight-text >}}, and let them communicate through explicit queues and ports. VDCores assembles the warps within a single SM into two kinds of "cores" (memory cores and compute cores), implementing a small, software-defined superscalar processor. On the memory side, we expose (i) an **allocation & branch / control unit**, and (ii) **configurable load and store units**, all running asynchronously.
+
+<!-- {{< placeholder "VDCores overview with divided responsibility [programmer, runtime]" >}}  -->
+
+VDCores also draws on ideas from microarchitectural design. Its core approach is to rebalance responsibilities between the runtime and the programmer/compiler:
+- **Programmers/Compiler** focus on specifying *what* must happen and being precise about dependencies (what consumes what), rather than hand-inlining a separate "prefetch" phase and then a "compute" phase.
+- **The runtime** owns control flow and runtime management: scheduling memory/compute operations, managing in-flight instructions, and allocating local memory spaces as they become available.
+
+Under this principle, some designs emerges to further optimize the performance while keeping the flexibility:
+- Instruction issue is ordered but completion can be out-of-order. Control flow keeps program order when needed, while the load dispatch unit (LDU) can complete loads out of order (with compiler hints) to unlock overlap.
+- Programmable dependencies with software-controlled virtual ports. Control logic routes instructions to load/store “engines” without baking scheduling policy into every kernel.
+- And so much more!
+
+
+
+
+## 4. Decoupled Cores: In Live Action and in the Wild
+
+> We are working to bring VDCores to the open-source community and to a wider range of cores and hardware platforms. Stay tuned!
+
+VDCores's decoupled model goes beyond cleaner way to write one kernel, it is {{< highlight-text >}}**a substrate for systematic overlap and composition**{{< /highlight-text >}}. Once memory, compute, and control communicate only through explicit dependencies, we can schedule and explore pipelines at a higher level and let the runtime safely exploit concurrency and performance. Here's exciting directions we are exploring and may reveal the full potential of VDCores system:
+
+- **Communication Virtual Cores:** VDCores are all about isolating and decomposing work into separate, decoupled components, and communication/networking should be one of them. Whether it is inter-rack communication over InfiniBand or communication between GPUs, it composes naturally with existing memory and compute decoupled cores.
+- **Adapting to Tiered Memory and new Arch:** By decoupling issue/completion and separating memory/compute concerns, the same kernel structure can adapt to evolving GPU mechanisms (e.g., new [async memory](https://developer.nvidia.com/blog/inside-nvidia-blackwell-ultra-the-chip-powering-the-ai-factory-era/#streaming_multiprocessors_compute_engines_for_the_ai_factory) types on Blackwell), and wider range of memory locations and operations (e.g., ) without changing the VDCores application.
+- **Composition at Runtime:** VDCores is designed to be a runtime substrate that makes it easier to compose kernels into larger pipelines, coordinate resource allocation, and reason about end-to-end overlap beyond traditional kernel boundaries. Given the power to compose any memory/compute operations dynamically at runtime, fine-grained, runtime-aware policies could be further explored.
+
+We’ll cover these topics in future posts in this series. Before that, we’re excited to bring the runtime to the public and let you all give it a spin very soon. Stay tuned!
+
@@ -0,0 +1,69 @@
+<div class="step-gallery" id="gallery-ex1" style="margin: 1.5rem 0;">
+  <div style="display: flex; gap: 1rem; align-items: flex-start;">
+    <div style="flex: 4; font-size: 0.85rem;">
+      <div class="step-text" data-gallery="gallery-ex1" data-step="0" style="padding: 0.5rem 0.75rem; margin-bottom: 0.4rem; border-left: 3px solid #555; cursor: pointer; opacity: 0.4; transition: opacity 0.2s;">
+        <strong>Overview:</strong> The colors indicate which memory buffers each compute operation depends on.
+      </div>
+      <div class="step-text" data-gallery="gallery-ex1" data-step="1" style="padding: 0.5rem 0.75rem; margin-bottom: 0.4rem; border-left: 3px solid #555; cursor: pointer; opacity: 0.4; transition: opacity 0.2s;">
+        <strong>Step 1:</strong> Starting from the first compute task, the compute core waits for the readiness of <span style="color: #E67E22; font-weight: 600;">buffer 0</span> and <span style="color: #E67E22; font-weight: 600;">1</span>.
+      </div>
+      <div class="step-text" data-gallery="gallery-ex1" data-step="2" style="padding: 0.5rem 0.75rem; margin-bottom: 0.4rem; border-left: 3px solid #555; cursor: pointer; opacity: 0.4; transition: opacity 0.2s;">
+        <strong>Step 2:</strong> During <span style="color: #E67E22; font-weight: 600;">compute task 0</span>'s execution, the runtime can start to prefetch <span style="color: #5DADE2; font-weight: 600;">weights of the linear layer</span> — no need to wait for the first kernel to finish.
+      </div>
+      <div class="step-text" data-gallery="gallery-ex1" data-step="3" style="padding: 0.5rem 0.75rem; margin-bottom: 0.4rem; border-left: 3px solid #555; cursor: pointer; opacity: 0.4; transition: opacity 0.2s;">
+        <strong>Step 3:</strong> When <span style="color: #5DADE2; font-weight: 600;">buffer 3</span> is finally available, the compute core can proceed with the <span style="color: #5DADE2; font-weight: 600;">matmul</span> without stalling on the weight buffer.
+      </div>
+      <div class="step-text" data-gallery="gallery-ex1" data-step="4" style="padding: 0.5rem 0.75rem; margin-bottom: 0.4rem; border-left: 3px solid #555; cursor: pointer; opacity: 0.4; transition: opacity 0.2s;">
+        <strong>Step 4:</strong> Likewise, the <span style="color: #2471A3; font-weight: 600;">residual input</span> can be prefetched before the <span style="color: #2471A3; font-weight: 600;">vec_add</span>
+      </div>
+    </div>
+    <div style="flex: 6;">
+      <img class="step-img" data-gallery="gallery-ex1" data-step="0" src="{{ "images/dae/example1-1.jpg" | relURL }}" alt="overview: two kernels with color-coded dependencies" style="width: 100%; display: none;" />
+      <img class="step-img" data-gallery="gallery-ex1" data-step="1" src="{{ "images/dae/example1-2.jpg" | relURL }}" alt="step 1: compute core waits for buffer 0 and 1" style="width: 100%; display: none;" />
+      <img class="step-img" data-gallery="gallery-ex1" data-step="2" src="{{ "images/dae/example1-3.jpg" | relURL }}" alt="step 2: prefetch weights during kernel 0" style="width: 100%; display: none;" />
+      <img class="step-img" data-gallery="gallery-ex1" data-step="3" src="{{ "images/dae/example1-4.jpg" | relURL }}" alt="step 3: proceed with matrix when buffer 3 ready" style="width: 100%; display: none;" />
+      <img class="step-img" data-gallery="gallery-ex1" data-step="4" src="{{ "images/dae/example1-5.jpg" | relURL }}" alt="step 4: prefetch residual before vec_add" style="width: 100%; display: none;" />
+      <div style="display: flex; align-items: center; justify-content: center; gap: 1rem; margin-top: 0.4rem;">
+        <span class="step-prev" data-gallery="gallery-ex1" style="cursor: pointer; color: #268BD2; font-size: 1rem; user-select: none; padding: 0.2rem 0.5rem; border: 1px solid #268BD2; border-radius: 4px; transition: background 0.2s;" onmouseover="this.style.background='#268BD2';this.style.color='#fff'" onmouseout="this.style.background='transparent';this.style.color='#268BD2'">&#8592; prev</span>
+        <span class="step-counter" data-gallery="gallery-ex1" style="font-size: 0.8rem; color: #888;">1 / 5</span>
+        <span class="step-next" data-gallery="gallery-ex1" style="cursor: pointer; color: #268BD2; font-size: 1rem; user-select: none; padding: 0.2rem 0.5rem; border: 1px solid #268BD2; border-radius: 4px; transition: background 0.2s;" onmouseover="this.style.background='#268BD2';this.style.color='#fff'" onmouseout="this.style.background='transparent';this.style.color='#268BD2'">next &#8594;</span>
+      </div>
+    </div>
+  </div>
+</div>
+
+<script>
+(function() {
+  var gallery = document.getElementById('gallery-ex1');
+  var id = 'gallery-ex1';
+  var texts = gallery.querySelectorAll('.step-text[data-gallery="'+id+'"]');
+  var imgs = gallery.querySelectorAll('.step-img[data-gallery="'+id+'"]');
+  var counter = gallery.querySelector('.step-counter[data-gallery="'+id+'"]');
+  var current = 0;
+
+  function show(idx) {
+    texts.forEach(function(t, i) {
+      t.style.opacity = i === idx ? '1' : '0.4';
+      t.style.borderLeftColor = i === idx ? '#268BD2' : '#555';
+    });
+    imgs.forEach(function(img, i) {
+      img.style.display = i === idx ? 'block' : 'none';
+    });
+    counter.textContent = (idx + 1) + ' / ' + texts.length;
+    current = idx;
+  }
+
+  texts.forEach(function(t, i) {
+    t.addEventListener('click', function() { show(i); });
+  });
+
+  gallery.querySelector('.step-prev').addEventListener('click', function() {
+    show(current > 0 ? current - 1 : texts.length - 1);
+  });
+  gallery.querySelector('.step-next').addEventListener('click', function() {
+    show(current < texts.length - 1 ? current + 1 : 0);
+  });
+
+  show(0);
+})();
+</script>
@@ -0,0 +1 @@
+<span class="highlight-text" {{ with .Get "color" }}style="color: {{ . }};"{{ end }}>{{ .Inner | markdownify }}</span>
@@ -0,0 +1,6 @@
+<figure style="margin: 2rem auto; text-align: center;">
+  <div style="width: 100%; height: {{ with .Get "height" }}{{ . }}{{ else }}300px{{ end }}; background: #ffffff; border: 2px dashed #ccc; border-radius: 8px; display: flex; align-items: center; justify-content: center;">
+    <span style="color: #999; font-size: 1.1rem; font-style: italic;">{{ .Get 0 }}</span>
+  </div>
+  {{ with .Get 0 }}<figcaption style="margin-top: 0.5rem; color: #666; font-size: 0.9rem;">{{ . }}</figcaption>{{ end }}
+</figure>
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+<span class="highlight-text" {{ with .Get "color" }}style="color: {{ . }};"{{ end }}>{{ .Inner \| markdownify }}</span>`