[prf/dec] Update GPU execution plans to clarify prefill/decode structure and KV cache handling

orionpapadakis · orionpapadakis · commit 1fdc3c0d7fb0 · 2026-05-04T13:52:12.000+03:00
diff --git a/src/main/java/org/beehive/gpullama3/tornadovm/TornadoVMMasterPlan.java b/src/main/java/org/beehive/gpullama3/tornadovm/TornadoVMMasterPlan.java
@@ -8,18 +8,23 @@
 /**
  * Common contract for all TornadoVM GPU execution plans.
  *
- * <p>Two concrete implementations exist:</p>
+ * <p>Three concrete implementations exist:</p>
  * <ul>
- *   <li>{@link TornadoVMMasterPlanStandard} — single-token forward pass; used for the
- *       baseline GPU path and Phase 2 sequential prefill/decode.</li>
- *   <li>{@link TornadoVMMasterPlanWithBatchPrefillDecode} — unified plan for Phase 4 batched
- *       prefill + single-token decode within one {@code TornadoExecutionPlan}.</li>
+ *   <li>{@link TornadoVMMasterPlanStandard} — baseline single-token forward pass
+ *       (preprocessing + N layers + logits).</li>
+ *   <li>{@link TornadoVMMasterPlanWithPrefillDecode} — sequential prefill/decode separation;
+ *       reuses the same N layer graphs for both phases, skipping logits during prefill.</li>
+ *   <li>{@link TornadoVMMasterPlanWithBatchPrefillDecode} — batched prefill + single-token
+ *       decode; holds 2N+3 graphs in one plan to keep the KV cache on device across phases.</li>
  * </ul>
  *
- * <p>The {@link #initializeTornadoVMPlan} factory selects the appropriate implementation
- * based on {@code llama.prefillBatchSize}: if {@code > 1}, returns a
- * {@link TornadoVMMasterPlanWithBatchPrefillDecode}; otherwise returns a
- * {@link TornadoVMMasterPlanStandard}.</p>
+ * <p>The {@link #initializeTornadoVMPlan} factory selects the implementation based on
+ * {@code llama.withPrefillDecode} and {@code llama.prefillBatchSize}:</p>
+ * <ul>
+ *   <li>{@code withPrefillDecode=false} → {@link TornadoVMMasterPlanStandard}</li>
+ *   <li>{@code withPrefillDecode=true}, {@code prefillBatchSize=1} → {@link TornadoVMMasterPlanWithPrefillDecode}</li>
+ *   <li>{@code withPrefillDecode=true}, {@code prefillBatchSize>1} → {@link TornadoVMMasterPlanWithBatchPrefillDecode}</li>
+ * </ul>
  */
 public interface TornadoVMMasterPlan {
 
diff --git a/src/main/java/org/beehive/gpullama3/tornadovm/TornadoVMMasterPlanStandard.java b/src/main/java/org/beehive/gpullama3/tornadovm/TornadoVMMasterPlanStandard.java
@@ -14,8 +14,8 @@
  * Standard (single-token) GPU execution plan.
  *
  * <p>Processes one token at a time through preprocessing + N transformer layers +
- * logits projection.  Used for both the baseline GPU path and the Phase 2
- * sequential prefill/decode path.</p>
+ * logits projection.
+ * </p>
  */
 public class TornadoVMMasterPlanStandard implements TornadoVMMasterPlan {
 
diff --git a/src/main/java/org/beehive/gpullama3/tornadovm/TornadoVMMasterPlanWithBatchPrefillDecode.java b/src/main/java/org/beehive/gpullama3/tornadovm/TornadoVMMasterPlanWithBatchPrefillDecode.java
@@ -29,20 +29,25 @@
 /**
  * GPU execution plan for batched prefill + single-token decode.
  *
- * <p>A single {@link TornadoExecutionPlan} holds all graphs so that the KV cache
- * ({@code wrapKeyCache}, {@code wrapValueCache}) is shared on device via
- * {@code persistOnDevice}/{@code consumeFromDevice}.  Two separate plans would
- * allocate independent device buffers and lose the prefill KV state.</p>
+ * <p>A single {@link TornadoExecutionPlan} holds all {@link TaskGraph} for
+ * batched prefill and single-token decode phases with the following structure:</p>.
  *
- * <p>Graph layout (2N+3 graphs total):</p>
+ * <p>TaskGraph layout (2N+3 TaskGraphs total):</p>
  * <pre>
- *   [0]         batch activation     B×dim FP16 → FP32
- *   [1..N]      batch layer graphs   B tokens, all transformer ops
- *   [N+1]       decode activation    single-token FP16 → FP32 + KV-cache pass-through
- *   [N+2..2N+1] decode layer graphs  single-token, standard kernels
+ *   [0]         prefill batch activation       B×dim FP16 → FP32
+ *   [1..N]      prefill batch layer graphs     B tokens, all transformer ops
+ *   [N+1]       decode activation              single-token FP16 → FP32 + KV-cache pass-through
+ *   [N+2..2N+1] decode layer graphs            single-token, standard kernels
  *   [2N+2]      logits graph
  * </pre>
  *
+ * <p>
+ * Incorporating cross-phase {@link TaskGraph}s withing a single {@link TornadoExecutionPlan}
+ * is necessary to enable KV cache ({@code wrapKeyCache}, {@code wrapValueCache}) sharing
+ * across prefill and decode phases. The KV cache pointers are chained across {@link TaskGraph}s
+ * via the {@code persistOnDevice}/{@code consumeFromDevice} API within the {@link TornadoExecutionPlan}.
+ * </p>
+ *
  * <p>KV cache pointer chain across phases:</p>
  * <pre>
  *   batchLayer[N-1]  --persistOnDevice(wrapKeyCache)-→
diff --git a/src/main/java/org/beehive/gpullama3/tornadovm/TornadoVMMasterPlanWithPrefillDecode.java b/src/main/java/org/beehive/gpullama3/tornadovm/TornadoVMMasterPlanWithPrefillDecode.java
@@ -25,26 +25,26 @@
 import java.util.List;
 
 /**
- * GPU execution plan for single-token prefill/decode separation.
+ * GPU execution plan for sequential (single-token) prefill/decode separation.
  *
- * <p>Uses dedicated layer classes that carry correct cross-graph
- * {@code consumeFromDevice} source names for both CUDA-graph and interpreter
- * (no-CUDA-graph) mode.  All graphs are owned by this plan and built from scratch —
- * no reuse of the standard execution path.</p>
+ * <p>A single {@link TornadoExecutionPlan} holds all graphs so that the KV cache
+ * ({@code wrapKeyCache}, {@code wrapValueCache}) is allocated once and remains on
+ * device across both phases.  Prefill and decode reuse the same N layer graphs;
+ * only the logits graph is skipped during prefill.</p>
  *
  * <p>Graph layout (N+2 graphs total):</p>
  * <pre>
- *   [0]      decodeActivation   single-token FP16 → FP32; KV-cache allocated on first execution
- *   [1..N]   layer_0..layer_N-1 transformer layers (attention + FFN)
- *   [N+1]    logits             final RMSNorm + wcls matmul
+ *   [0]      decodeActivation    single-token FP16 → FP32; KV-cache allocated on first execution
+ *   [1..N]   layer_0..layer_N-1  transformer layers (attention + FFN)
+ *   [N+1]    logits              final RMSNorm + wcls matmul
  * </pre>
  *
- * <p>Two distinct forward passes:</p>
+ * <p>Two forward passes:</p>
  * <ul>
- *   <li>{@link #tornadoVMForwardPrefill} — runs graphs 0..N, skips logits.
- *       KV cache is populated for each prompt token; logits are discarded.</li>
+ *   <li>{@link #tornadoVMForwardPrefill} — graphs 0..N (activation + layers), logits skipped.
+ *       Called once per prompt token; populates the KV cache.</li>
  *   <li>{@link #tornadoVMForwardDecode} — full pass including logits.
- *       Called for each generated token.</li>
+ *       Called once per generated token; returns logits for sampling.</li>
  * </ul>
  */
 public class TornadoVMMasterPlanWithPrefillDecode implements TornadoVMMasterPlan {