Add workload-weighted kernel analysis and vLLM deployment model

TimDettmers · claude · TimDettmers · commit df55daf88b39 · 2026-02-16T10:28:55.000-05:00
- token_analysis.md: workload analysis using 397 sessions of real token
  distributions. Single-user: M=1 decode is 80-84% of GEMM time.
  Multi-user vLLM simulation (1-64 users): bimodal M distribution
  (decode-only vs decode+prefill chunk), crossover at ~16 users.
- token_distributions.json: per-turn frequency distributions for
  prefill and decode token counts (power-of-two buckets, sum to 1.0).
- kbit-kernel-spec.md: updated dequant section (single kernel launch,
  ncu-measured times), added practical kernel importance table showing
  scalar GEMV dominates at 1-4 users, dq+cuBLAS at 16+, MMA has
  minimal impact in either regime.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/kbit-kernel-spec.md b/kbit-kernel-spec.md
@@ -110,6 +110,14 @@ The batch size M seen by each kernel varies:
 - **M=1-32+**: dense layers (full batch)
 - **M=32-512+**: prefill / prompt processing
 
+See `token_analysis.md` for a detailed workload analysis using real
+token distributions from 397 Claude Code sessions. The analysis shows
+that in single-user inference, M=1 decode accounts for 80-84% of total
+GEMM time. In multi-user vLLM serving, the M distribution is bimodal
+(M=num_users for decode-only iterations, M=num_users+chunk for prefill
+iterations), and the crossover where quantized kernels become slower
+than fp16 is at ~16 concurrent users.
+
 ---
 
 ## Four-kernel strategy
@@ -137,6 +145,24 @@ Why four kernels instead of one:
 - MoE experts launched individually waste 88-97% of SMs. Grouping
   all active experts into one kernel launch solves this.
 
+**Practical importance (from workload analysis in `token_analysis.md`):**
+
+In real deployments, the M distribution is bimodal — not uniform. With
+vLLM continuous batching, iterations are either pure-decode (M=num_users)
+or decode+prefill (M=num_users+chunk_size). The MMA kernel's M=5-16
+range falls in the gap between these modes.
+
+| Scenario | Scalar share | MMA share | dq+cuBLAS share |
+|----------|-------------|-----------|-----------------|
+| 1 user | 87% | 0% | 13% |
+| 4 users | 59% | 0% | 41% |
+| 8 users | 0% | 45% | 55% |
+| 16 users | 0% | 24% | 76% |
+| 32+ users | 0% | 6% | 94% |
+
+Optimization priority: scalar GEMV (1-4 users) > dequant overhead
+reduction (16+ users) > MMA kernel (8-16 users only, narrow range).
+
 ---
 
 ## 1. Scalar GEMV (`kbit_scalar_gemv`)
@@ -306,20 +332,32 @@ the MMA dequant kernel takes ~68 us (instruction-limited, only 1.3%
 of execution is MMA). A fused dequant kernel would take ~5 us for
 this shape, so dequant + cuBLAS ~27 us would beat 68 us.
 
-**Current dequant implementation is not fused.** `dequantize_kbit`
-dispatches ~15 PyTorch elementwise kernels per call, giving a constant
-~800 us overhead regardless of shape. This makes dequant + cuBLAS
-non-competitive at M<64. A fused dequant CUDA kernel is needed for
-strategy 3 to be viable.
+**Dequant kernel** (`kDequantizeBlockwise_kbit_vec`): a single CUDA
+kernel that reads k-bit packed data + absmax and writes fp16 output.
+Templated on absmax type: float32 (from `quantize_kbit` directly),
+uint8 E4M4, or fp16. The float32 absmax path was added to eliminate
+a previous Python-side E4M4 conversion that launched ~15 PyTorch
+elementwise kernels (~800 us). Now it is a single kernel launch.
+
+Dequant GPU kernel times (ncu-measured, k=4):
+
+| Shape | Elements | Kernel time |
+|-------|----------|-------------|
+| gateup/down | 10.5M | ~30 us |
+| Q/O | 8.4M | ~25 us |
+| KV | 1.0M | ~5 us |
+
+Times scale linearly with element count and k.
 
-The crossover point depends on shape. For DRAM-bound shapes (Llama3-8B
-gate/up at 4096x14336), the MMA dequant kernel wins at 1.5x over
-cuBLAS because the 3.2x bandwidth savings dominate. For L2-resident
-shapes (MoE experts, small dense layers), cuBLAS wins because the
-kernel is instruction-limited, not bandwidth-limited.
+**Crossover vs MMA:** At M<=16, MMA beats dequant+cuBLAS on most
+shapes because the fixed dequant cost (~25-30 us) is large relative
+to the matmul. At M>=64, dequant+cuBLAS wins because cuBLAS scales
+efficiently while MMA is instruction-limited. The crossover is
+M=32-64 depending on shape.
 
 **Data format:** Uses flat layout (same as scalar GEMV). The
-`dequantize_kbit` launcher handles both uint8 E4M4 and float32 absmax.
+`dequantize_kbit` launcher handles float32, uint8 E4M4, and fp16
+absmax via the `_KBIT_ABSMAX_SUFFIX` dispatch map.
 
 ---
 
diff --git a/token_analysis.md b/token_analysis.md
@@ -0,0 +1,171 @@
+# Claude Code Token Analysis
+
+## Session data location
+
+Session JSONL files are stored at:
+```
+~/.claude/projects/<project-path>/<session-id>.jsonl
+```
+
+Each file contains one JSON object per line with types: `user`, `assistant`, `system`, `progress`, `file-history-snapshot`.
+
+## Methodology
+
+### Input tokens (prefill)
+
+Input = user prompts + tool results. These are measured from `user`-type messages in the JSONL:
+- `content[].type == "text"` entries give user prompt text
+- `content[].type == "tool_result"` entries give tool outputs (file reads, grep, bash)
+
+Token count estimated at chars/4. System prompt, system injections, and the model's own prior output re-read as context are excluded — we only count new content the user/tools provide.
+
+### Generated tokens (decode)
+
+Generated = `output_tokens` from the `usage` field on `assistant`-type messages. This includes all model generation: text responses, tool call arguments, and thinking tokens (thinking content is encrypted so can't be separated).
+
+### Per-turn grouping
+
+A "turn" = one user message + all assistant API calls until the next user message. A single user turn may trigger multiple API calls (model calls a tool, gets result, calls another tool, etc.). Input for a turn = content in that user message. Output for a turn = sum of `output_tokens` across all API calls in that turn.
+
+### Histogram bucketing
+
+Values are bucketed to nearest power of two: `2^round(log2(n))`.
+
+## Aggregate results: 397 sessions, 25,162 user turns
+
+Data collected from 472 session files across all projects (75 empty/skipped). 41,537 total API calls.
+
+| | Est. tokens |
+|---|---:|
+| Input (prefill) | ~31.8M |
+| Generated (decode) | ~2.3M |
+| **Ratio** | **13.7:1 input to output** |
+
+### Frequency distributions
+
+Per-turn frequency distributions (summing to 1.0) are stored in `token_distributions.json`. The file contains two distributions:
+
+- `input_tokens_per_turn.freq` — estimated prefill tokens per user turn (user text + tool results). 24,155 non-empty turns.
+- `generated_tokens_per_turn.freq` — decode tokens per user turn (from API `output_tokens`). 20,911 non-empty turns.
+
+Keys are power-of-two bucket sizes (as strings), values are frequencies.
+
+### Interpretation
+
+- Input peaks at 16-32 tokens (short prompts, small tool results) with a flat tail through 2048. Reflects a mix of user typing (small) and tool results (variable).
+- Output is bimodal: peaks at 2 tokens (20%, single short tool call) and 32 tokens (19%, tool call with moderate argument). Text responses and code blocks (128-2048) account for ~17% of turns.
+- Heavy generation (>4096 tokens) is rare (<0.5% of turns).
+
+## Kernel performance weighted by workload
+
+The token distributions in `token_distributions.json` serve as a workload model for estimating which GEMM kernels matter most in practice. The key mapping: **input tokens per turn = prefill M** (new tokens processed in a single forward pass with KV cache), **generated tokens per turn = number of decode steps at M=1** (or M=batch_size in multi-user serving).
+
+### Single-user inference (M=1 decode)
+
+In single-user autoregressive generation, each turn involves:
+- **1 prefill pass** at M = input_tokens (prompt/tool results, distributed by `input_tokens_per_turn`)
+- **N decode passes** at M = 1, where N is the number of generated tokens (distributed by `generated_tokens_per_turn`)
+
+The average generated tokens per turn is ~114. So a typical turn has 1 prefill pass + 114 decode passes. Even though large prefills are individually expensive (a single M=32768 pass costs ~23,000 us/layer), they are rare enough (~1.4% frequency) that decode at M=1 dominates total wall-clock time at **80-84%** across k=2..5.
+
+Per-layer time breakdown (k=4, Qwen3-Coder-Next shapes):
+
+| Component | Time/turn/layer | % of total |
+|-----------|----------------:|------------|
+| Decode (114 steps x 55.6 us) | 6,347 us | 83.4% |
+| Prefill (distributed) | 1,260 us | 16.6% |
+
+The scalar GEMV kernel (M=1) is faster than fp16 cuBLAS because it reads 3-4x less data (k-bit compressed weights vs fp16). Overall weighted slowdown vs fp16: **0.57x** (43% faster) at k=4.
+
+### Multi-user serving with vLLM
+
+Production deployments use continuous batching (vLLM), which changes the M distribution fundamentally. The vLLM V1 scheduler (`vllm/v1/core/sched/scheduler.py`) works as follows:
+
+1. **Decode-first**: all running (decoding) requests are scheduled first, each contributing 1 token. M starts at num_decoding_users.
+2. **Chunked prefill**: remaining token budget is used for at most one prefill chunk from a waiting request. Default chunk size is `max_model_len * 0.04` (e.g., 1280 for 32K context, 5120 for 128K).
+3. **Token budget cap**: total tokens per step is bounded by `max_num_batched_tokens` (default 8192).
+4. **One partial prefill at a time**: `max_num_partial_prefills` defaults to 1.
+
+This creates a **bimodal M distribution**: iterations are either pure-decode (M = num_users) or decode + prefill chunk (M = num_users + chunk_size). The MMA kernel's effective range (M=8-32) falls in the gap between these modes and is rarely used.
+
+Simulation results (k=4, chunk_size=512, token distributions from `token_distributions.json`):
+
+| Users | Avg M | Decode-only iters | Dominant kernel | vs fp16 |
+|------:|------:|------------------:|-----------------|--------:|
+| 1 | 8 | 98.6% | scalar (87%) | 0.57x |
+| 4 | 41 | 92.6% | scalar (59%) + dq+cuBLAS (41%) | 0.76x |
+| 8 | 77 | 86.1% | MMA (45%) + dq+cuBLAS (55%) | 0.85x |
+| 16 | 163 | 70.2% | dq+cuBLAS (76%) | 1.00x |
+| 32 | 364 | 30.9% | dq+cuBLAS (93%) | 1.17x |
+| 64 | 495 | 5.1% | dq+cuBLAS (98%) | 1.23x |
+
+The crossover where quantized kernels become slower than fp16 is at **~16 concurrent users**. Below that, bandwidth savings from k-bit compression outweigh the dequant overhead. Above that, the dequant cost (~30 us/shape at k=4) dominates because most iterations include a large prefill chunk where cuBLAS is highly efficient.
+
+### Optimization priorities
+
+The analysis identifies two regimes with different optimization targets:
+
+**1-4 users (agents, local inference, code assistants):**
+The scalar GEMV at M=1..4 accounts for 59-87% of total GEMM time. This kernel is already bandwidth-bound and faster than fp16. Further optimization (better ILP in the M-loop, wider vector loads) has the highest leverage. The dq+cuBLAS path handles the occasional prefill chunk (~41% of time at 4 users) with moderate overhead (1.25x vs fp16). The MMA kernel is effectively unused.
+
+**16+ users (serving, API endpoints):**
+dq+cuBLAS dominates (75-98% of time). The ~30 us dequant overhead per shape at k=4 is the primary cost. Reducing this — through a faster dequant kernel, fusing dequant into the matmul, or accepting float32 absmax to skip format conversion — would directly reduce the 1.17-1.23x slowdown vs fp16.
+
+**The MMA kernel has minimal impact in either regime.** Its effective range (M=8-32) corresponds to pure-decode batches at 8-32 users, which is a shrinking slice of iterations as user count grows. At 4 users, M never reaches the MMA range. At 32 users, only 31% of iterations are pure-decode at M=32, and MMA accounts for just 5.8% of total weighted time.
+
+## Script
+
+```python
+import json, math
+
+SESSION = "~/.claude/projects/<project>/<session-id>.jsonl"
+
+with open(SESSION) as f:
+    lines = [json.loads(l) for l in f]
+
+timeline = [l for l in lines if l.get('type') in ('user', 'assistant')]
+
+turns = []
+for i, msg in enumerate(timeline):
+    if msg['type'] != 'user':
+        continue
+    content = msg.get('message', {}).get('content', '')
+    input_chars = 0
+    if isinstance(content, list):
+        for c in content:
+            if c.get('type') == 'text':
+                input_chars += len(c.get('text', ''))
+            elif c.get('type') == 'tool_result':
+                rc = c.get('content', '')
+                if isinstance(rc, str):
+                    input_chars += len(rc)
+                elif isinstance(rc, list):
+                    input_chars += sum(len(json.dumps(x)) for x in rc)
+    elif isinstance(content, str):
+        input_chars += len(content)
+
+    total_output = 0
+    for j in range(i + 1, len(timeline)):
+        if timeline[j]['type'] == 'user':
+            break
+        if timeline[j]['type'] == 'assistant':
+            total_output += timeline[j]['message']['usage'].get('output_tokens', 0)
+
+    turns.append({'input_est': input_chars // 4, 'output': total_output})
+
+def bucket(n):
+    if n <= 0: return 0
+    return 2 ** round(math.log2(max(n, 1)))
+
+for label, key in [("Input", "input_est"), ("Generated", "output")]:
+    vals = [t[key] for t in turns if t[key] > 0]
+    buckets = {}
+    for v in vals:
+        b = bucket(v)
+        buckets[b] = buckets.get(b, 0) + 1
+    mx = max(buckets.values())
+    print(f"\n=== {label} tokens per turn ({len(vals)} turns) ===")
+    for b in sorted(buckets):
+        bar = "#" * max(1, round(buckets[b] / mx * 40))
+        print(f"{b:>8}  {buckets[b]:>5}  {bar}")
+```
diff --git a/token_distributions.json b/token_distributions.json
@@ -0,0 +1,50 @@
+{
+  "description": "Token count frequency distributions across 397 Claude Code sessions (472 files, 75 empty). Buckets are nearest power of two. Frequencies sum to 1.0.",
+  "sessions": 397,
+  "input_tokens_per_turn": {
+    "description": "Estimated input tokens per user turn (user text + tool results, chars/4). Only non-empty turns included.",
+    "num_turns": 24229,
+    "freq": {
+      "1": 0.00388,
+      "2": 0.006108,
+      "4": 0.049858,
+      "8": 0.062198,
+      "16": 0.152256,
+      "32": 0.167733,
+      "64": 0.100045,
+      "128": 0.093566,
+      "256": 0.085022,
+      "512": 0.086219,
+      "1024": 0.064221,
+      "2048": 0.050147,
+      "4096": 0.035742,
+      "8192": 0.014157,
+      "16384": 0.013785,
+      "32768": 0.013703,
+      "65536": 0.001279,
+      "131072": 4.1e-05,
+      "262144": 4.1e-05
+    }
+  },
+  "generated_tokens_per_turn": {
+    "description": "Output tokens per user turn (from API usage.output_tokens, includes text + tool calls + thinking). Only non-empty turns included.",
+    "num_turns": 20946,
+    "freq": {
+      "1": 0.065502,
+      "2": 0.201518,
+      "4": 0.109902,
+      "8": 0.100449,
+      "16": 0.112766,
+      "32": 0.19302,
+      "64": 0.04144,
+      "128": 0.055619,
+      "256": 0.048219,
+      "512": 0.034947,
+      "1024": 0.022057,
+      "2048": 0.010312,
+      "4096": 0.003533,
+      "8192": 0.000668,
+      "16384": 4.8e-05
+    }
+  }
+}