Commit 1122e9d
authored
feat(inference): add DeepSeek V4 Pro model architecture (#432)
* feat(inference): add DeepSeek V4 Pro model architecture
Add a MODEL_ARCHITECTURES entry for DeepSeek V4 Pro (1.6T/49B MoE, 61
layers, 1M context) following the existing per-model pattern. Attention
is modeled as a Hybrid stack of two interleaved compressed variants —
Heavily Compressed Attention (31 layers) and Compressed Sparse Attention
(30 layers) — each carrying a 128-token sliding-window branch and a
learnable attention sink.
Surface sliding-window attention on both alternating blocks: the diagram's
window note now derives from a per-spec slidingWindow field instead of the
block index, so hybrid models show window=128 on every attention variant
(gpt-oss behavior preserved). The specs-bar attention cell now derives from
attentionType so it reads "Hybrid" instead of the hardcoded "Sink/Full GQA".
Sourced from deepseek-ai/DeepSeek-V4-Pro (config.json, inference/model.py,
DeepSeek_V4.pdf).
* feat(inference): surface SWA as an explicit block; fix residual + glyph centering
DeepSeek V4 hybrid attention now drills down: expanding the CSA/HCA
attention block reveals the sliding-window branch as its own block
alongside the compressed branch (lightning indexer for CSA, heavy
compression for HCA), converging at shared-KV MQA + sink + output
projection. Gated to attentionType === 'Hybrid' so gpt-oss is unchanged.
Also fixes two diagram issues affecting all models:
- Residual bypass tapped at the RMSNorm's top edge, so its horizontal
connector ran across the norm block. Tap from the arrow gap above the
norm instead.
- Circle glyphs (+, ×, −) rendered off-center because
dominant-baseline: central is unreliable (Safari falls back to the
alphabetic baseline). Use dy=0.35em, which centers consistently across
browsers and matches central where it already worked.
* fix(inference): balance hybrid attention drill-down, center circle glyphs
Addresses three rendering issues in the model architecture diagram:
- Hybrid (CSA/HCA) attention now drills down into symmetric 2x2 columns:
Local (Sliding Window + Attention Sink) beside the two-stage Compressed
branch (compression + selector). This removes the lonely long connector
that made the expanded box look unbalanced and promotes the attention
sink to an explicit block; the merge block is now plain "Shared-KV MQA".
- The +, -, and x symbols inside merge/expand circles are drawn as
geometric strokes instead of <text>. Font baseline drift (even with dy
tuning) left the glyph sitting slightly low, which also made the residual
bypass line read as misaligned with the "+". The strokes are centered on
the circle's center, so the residual line is now co-linear with the arm.
* fix(inference): represent the hybrid attention sink accurately
The sliding-window and compressed branches are two KV *sources* whose
selected indices are unioned into a single shared-KV MQA softmax — not two
attentions merged after the fact. The attention sink is a per-head learnable
softmax-denominator bias on that MQA (model.py attn_sink / kernel.py
sum_exp += exp(attn_sink - max)), not literal "first tokens" in the local
branch.
- Local branch is just the sliding-window source (one block); the sink moves
back onto the merge block as "Shared-KV MQA + Sink".
- CSA compressed branch = Token Compression -> Lightning Indexer (2 stages);
HCA = a single Heavy Compression source. This makes CSA a 1-vs-2 split
again.
- Center each column within the shared column area in drawParallelFlow so an
unequal split reads as an intentional branch merge instead of leaving the
shorter column's connector dangling as a long unattached line. Also
improves the 2-vs-1 SwiGLU expert merge.
* feat(inference): caption hybrid attention drill-down as one fused softmax
When the DeepSeek V4 hybrid attention drill-down is expanded, show a short
note clarifying that the Local (sliding-window) and Compressed (CSA/HCA)
columns are two KV *sources* unioned into a single shared-KV MQA softmax —
not two separate attentions that get summed — with the attention sink being
a learnable per-head softmax-denominator bias. Prevents the parallel-column
schematic from being read as two independent attention paths.
Shown only while a hybrid attention block is expanded; covered by an e2e
assertion.
* fix(inference): rename SWA feature label so it isn't read as a separate attention type
The 128-token sliding window is the shared local base of every hybrid layer
(both HCA and CSA extend it; the final layer runs it alone) — not a third
attention type alongside CSA/HCA. Rename the features badge from
"Sliding Window Attention (128 tokens)" to "Sliding window (128 tokens)" so it
reads as a windowing mechanism rather than a standalone attention. The
drill-down's "Sliding Window" KV-source block is unchanged.
* feat(inference): draw hash-routed prefix block and mHC hyper-connections for V4
Two DeepSeek V4 architectural facts were only feature badges; surface them in
the diagram structure.
Hash-routed layers (num_hash_layers=3): the first 3 MoE layers route by token
id, not a learned gate. They now render as a separate stacked prefix block
(between embedding and the alternating blocks) with a "Hash Router" instead of
"MoE Router". Alternating HCA/CSA counts drop 31/30 → 29/29 so they describe the
learned-router layers (3 + 29 + 29 = 61); drawExpertGrid gains optional
routerLabel / routerSub params.
mHC (hc_mult=4): residuals are replaced by 4 parallel hyper-connection streams
with learned, Sinkhorn-normalized A/B/C mixing. Residual merges now render as an
"mHC ×N" mixer node instead of a plain "+" when arch.hyperConnections > 1, plus
a caption shown while a block exposing the nodes is expanded. Models without
hyper-connections keep the "+" residual.
Adds arch fields hashRoutedLayers and hyperConnections; unit + e2e coverage.
* fix(inference): connect arrow into RMSNorm and gap attention drill-down
Two diagram glitches in expanded transformer blocks:
- The incoming arrow stopped at the dashed container border, leaving no line
above the first RMSNorm. Route each block's incoming arrow to its first
RMSNorm when the block is expanded (through the border), so there is a
continuous connector; collapsed blocks still target the block top.
- The attention drill-down rect sat flush against the attention block's bottom
border, reading as an overlap. Add a small gap (drillGap) between an
attention block and its expansion flow.
* fix(inference): count the shared expert in the specs-bar active count
The specs bar showed "6/385", but the always-on shared expert is active too,
so 7 experts run per token (6 routed + 1 shared). Show "6+1/385" for
shared-expert MoE models (e.g. R1 → "8+1/257") so the active count isn't
undersold; the title's "N active" params and the router subtitle already
account for the shared expert.
* final1 parent 33fed6d commit 1122e9d
4 files changed
Lines changed: 768 additions & 101 deletions
File tree
- packages/app
- cypress/e2e
- src
- components/inference/ui
- lib
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
316 | 316 | | |
317 | 317 | | |
318 | 318 | | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
319 | 409 | | |
0 commit comments