Traps that cost real debugging time on the Ascend 910B (a2a3) cube + vec mix-mode path. Each entry is a silent misbehavior — the kernel compiles, runs, and produces deterministic-looking numbers that are wrong. Add new entries here when you find one, before paging the context out.
Organized roughly by hardware layer (top) → PTO ISA layer (bottom).
On dav_c220 (910B) in mix mode KERNEL_TYPE_MIX_AIC_1_2, the high-level
AscendC::GetBlockIdx() wrapper returns:
- AIC:
get_block_idx()→ cluster idx (0..N-1, where N = blockDim) - AIV:
get_block_idx() * g_taskRation + get_subblockid()→ 0..2N-1
So vec's "block_idx" is 2× what cube sees for the same cluster.
If you compute GM slice offsets in both branches as block_idx * tile_size,
cube writes cluster K's slice but vec K's two subblocks see "block_idx"
2K and 2K+1, both slicing the wrong GM regions. Symptoms: numerical
garbage, sometimes out-of-bounds reads, output magnitude is right
(~real-scale random) but elementwise wrong.
Source: /usr/local/Ascend/.../dav_c220/kernel_operator_common_impl.h:47-65.
Fix: use the CCE intrinsic get_block_idx() directly in both AIC
and AIV branches — it returns the cluster index on both. For sub-cluster
indexing within AIV, use get_subblockid() (which returns 0 or 1 in
mix 1:2). This is what the existing row_off = kVecM * get_subblockid()
pattern already does.
Diagnostic: if multi-block kernel passes single-block (blockDim=1)
but fails multi-block (blockDim≥2) with reasonable-magnitude wrong
output (not zero, not NaN), and your slicing offsets use
AscendC::GetBlockIdx() in vec, this is it.
A2/A3 cube and vec have physically isolated on-chip storage (L0C vs UB) — there is no GPU-SMEM-style addressable shared scratch. So cross-core data goes through GM addresses. But L2 caches those addresses for both cube's FIX-pipe TSTORE and vec's MTE2 TLOAD; cube just wrote, vec immediately reads, the line is hot in L2 and never hits HBM.
Implication for design:
- Size cube↔vec FIFO buffers by L2 working set, not GM offset distance. Comparing slot bytes against HBM bandwidth is the wrong axis.
- The slot-count upper bound is L2 capacity minus what act/wgt tiles need re-resident; not GM capacity.
- PTO FA reference:
pto-isa/tests/npu/a2a3/src/st/testcase/tfa/tfa_kernel.cppusesqkGlobalTensorNBuffers = 1 + qkPreloadNum = 6exactly to keep the cube→vec ring inside L2. - svdquant
gemm_w4a4per-K-block int32 acc handoff at BM=128, BN=256 → 128 KB/slot. 4–8 slots fits L2 comfortably; the optimization axis is keeping act+wgt+ring total ≤ L2, not reducing slot count.
If you can't model L2 hit-rate by inspection, run cce profiler / hccl-perf for the L2 hit ratio before tuning the ring layout.
The L0C accumulator buffer on Atlas A2/A3 family parts (910B series, including 910B3) is 128 KB, not 256 KB. The "256 KB" figure floats around in scattered AscendC training material and even one internal table — it refers to A5, a newer/different architecture family. Authoritative source: PTO ISA buffer_limits.
pto-isa/include/pto/common/buffer_limits.hpp:#elif defined(PTO_NPU_ARCH_KIRINX90) || defined(PTO_NPU_ARCH_A2A3) #define PTO_L0C_SIZE_BYTES (128u * 1024u)
pto-isa/docs/isa/tile/ops/sync-and-config/tassign.mdtable:Acc | L0C | 128 KB (A2A3) | 256 KB (A5) | 64 KB (Kirin9030) | 128 KB (KirinX90)
A [128, 256] int32 acc tile is exactly 128 KB → fills L0C; no
room for L0C ping-pong at the current SVDQuant W4A4 cube tile
shape. Any "ping-pong BUF0 @ 0, BUF1 @ 128 KB" pattern lifted from
A5-targeted code will OOB on 910B.
Fix: size L0C tiles against 128 KB on 910B. For drain overlap,
prefer drain-batching (accumulate N K-blocks of mad_s4 into the
single L0C tile, drain once — halves PIPE_FIX TSTORE count) over
ping-pong. mad_s4 is init=true only on the first K-block of a
batch; subsequent calls accumulate into the same L0C buffer.
Diagnostic — what the symptom looks like:
aicore exception type: L0CADDRESSOVERFLOW
The operation address of L0C exceeds the maximum range of L0C.
CCU instruction address check error.
When the FIXP reads l0c, the read and write operations occur at the
same address.
If you see this and your code uses any L0C offset ≥ 128 KB on a 910B, this is it. The "FIXP reads l0c at same address" wording is misleading — it's actually the offset-128KB-or-greater TileAccC that overflows.
PTO TLoad / TExtract whitelists accept int8_t / half / bf16 / float. They do not accept pto::int4b_t, int4b_x2_t, or any
4-bit-typed tile. This is not a PTO oversight — Ascend cube
L1/L0 data movement minimum addressable unit is 1 byte. INT4
values have no offset / pointer representation; nibble decoding
happens only inside mad_s4, whose ABI takes __ca__/__cb__ void*
and unpacks 2 signed INT4s per byte at issue time.
Implication:
- The W4A4 cube path uses
Tile<Mat, int8_t, M, K_packed>+TileLeft/Right<int8_t, ...>+TileAcc<int32_t, M, N>+ rawmad_s4(c, void*, void*, m, k_logical, n, ...).K_packedis bytes (=K_logical / 2). This is canonical, not a workaround. - Do not use
uint8_tfor L1/L0 tiles. TLoad accepts it but TExtract's whitelist omits it. Bit-pattern is identical toint8_t— pickint8_tto pass both gates. - Do not propose an
int4b_x2_ttwin type / TMATMUL_S4 wrapper to PTO upstream. Even if accepted, it would be cosmetic typing over int8 storage — runtime behavior unchanged. - ascale / wscale INT4 dequant is per-K-block (default 64 nibbles) on the vec side. K_packed=32 → one mad_s4 issue ↔ one K-block.
Ascend cube mad (and mad_s4) is a macro MAC: one issue
takes m/k/n up to MMAD_MAX_SUPPORT_LENGTH = 4095
(pto-isa/include/pto/npu/a2a3/TMatmul.hpp:17); the cube array
finishes the entire tile in hardware. NVIDIA mma.sync.aligned.*
is a warp-level fragment instruction (typical m16n8k64 for INT4);
a full GEMM tile needs tens-to-thousands of mma issues.
Concrete differences:
- Work per issue: one
mad≈ hundreds-to-thousands ofmmas. - K direction: NVIDIA expands K outside the instruction
(scheduler streams multiple
mmas); Ascend expands K inside (mad's k parameter directly hits thousands). - Pipeline granularity: NVIDIA worries about warp-scheduler ILP;
Ascend worries about 4-stage TMATMUL/TLOAD/TEXTRACT/TSTORE
overlap (see
pto-isa/kernels/manual/a2a3/gemm_performance/README_zh.md). - Packing / reorder cost amortizes over the entire macro issue, not per fragment — can be cheaper than NVIDIA fragment-level amortization.
Implications for design:
- A
TMatmulS4-style wrapper should NOT expose "warp-fragment shape" parameters. The right signature mirrors PTO'sTMatmul:(cTile, aTile, bTile)+ a thinmad_s4(...)adapter. Fractal layout is hardware-determined inside the cube array; the wrapper only handles ABI + static checks. - Tile selection anchor: PTO GEMM example uses
[baseM, baseN, baseK] = [128, 256, 64]for fp16 (saturates L0B 32 KiB). INT4 halves element bytes, so[128, 256, 128]is theoretically headroom — confirm against CCE doc INT4 fractal constraints before picking. - MFU comparison against nunchaku: single-point single-point is
meaningless. Ascend side wants pipeline-stage occupancy
(TMATMUL/TLOAD/TEXTRACT/TSTORE %) and msprof, a different
axis from NVIDIA
ncu sm__pipe_tensor_subpipe_*.
nunchaku's INT4 GEMM is hand-written PTX (see gpu.md § perf
context). It targets NVIDIA mma.sync.aligned.*.s4.s4.s32,
fragment layout per NVIDIA tcgen / ldmatrix conventions. That is a
completely different hardware ABI from the Ascend cube unit's L0A
/ L0B fractal layout (described by pto-isa CheckMadValid:
left-RowMajor/ColMajor SFractal, right-ColMajor/RowMajor SFractal,
Acc-RowMajor SFractal, K aligned to cube preferred granularity).
Rule:
- Ascend INT4 packing / fractal questions: CCE intrinsic docs +
PTO INT8 reference path + AscendC docs. Skip
tmp/nunchaku/. - The reverse also holds: NVIDIA NVFP4 / SM_100 tcgen path questions: don't reach into PTO / AscendC.
- Boundary: math (GEMM + LoRA epilogue + quant math) is cross-backend referencable — that's why CLAUDE.md says "keep math and tensor shapes in sync". Hardware ABI (instructions, fragment layout, packing order, scale arrangement) must follow the target hardware's own docs.
Settled 2026-05-04. The earlier "add a TMATMUL_S4 wrapper to PTO" plan (RFC #332) is closed. Path inside svdquant's Ascend pod:
- A/B matrix tiles:
Tile<Mat, int8_t, M, K_packed>,TileLeft/Right<int8_t, ...>— PTO byte-typed path, native. - Cube issue: inline
mad_s4(c, (__ca__ void*)a.data(), (__cb__ void*)b.data(), m, k_logical, n, unitFlag, false /*kDirAlign*/, src, init), bypassing the PTO type wrapper. - Everything else (activation TLoad, scale broadcast, bias, LoRA epilogue, TStore) continues to go through PTO abstractions.
Why the wrapper path was abandoned:
- A2/A3's
TLoad/TMov/TExtractleaf CCE intrinsics are byte-primitive-typed (signed char *,__bf16 *, …). PTO's existingint4b_tis a vec-only struct (only works viaTCvt'sis_same_v<DType, int4b_t>specialization on conversions). Cube path has no equivalent specialization; struct types are rejected at the intrinsic boundary. - Making
TMATMUL_S4real requires a dtype-aware TLoad/TMov/TExtract pass on a2a3 (mirroring how a5 added FP4) — 30+ stride patches plus byte-primitive intrinsic wiring. Not a single-commit change. - PTO SIG response timeline (#115 silent 3 months; #332 self-closed in < 4 days) made waiting infeasible.
- svdquant is an operator, not a public bottom-layer library —
input validation lands in PyTorch / vLLM, not inside the kernel.
PTO's
CheckStaticMadS4static guard is not a value-add here. - The internal-raw-cce pattern matches AscendC's own
MmadCals4 branch indav_c220/kernel_operator_mm_impl.h— standard practice, not a hack.
Apply:
- When writing mmad calls in the Ascend pod, inline
mad_s4(...)directly. Do not search for / create a PTO wrapper. - Do not patch PTO's a2a3 to add dtype-aware INT4 plumbing. That is SIG's surface and is outside the svdquant scope.
- If SIG follow-up asks: backup branch
feat-mad-s4onqq_42927189/pto-isa(gitcode) andultranationalism/pto-isa(github), commit724b973a, is a single-commit unblock (int4b_x2_t=uint8_talias +mad_s4kDirectionAlignparameter fix + ST testcase). Clean PR ready but not pushed by default. - ABI sanity check anchor: if a raw
mad_s4call gets argument count wrong, the CCE ABI is 10 parameters. Cross-reference AscendCdav_c220/kernel_operator_mm_impl.hMmadCals4 branch for the exact parameter order.
Discovered Phase 3a 2026-05-11 cycle 15.
TLoad of Tile<Vec, T, N, 1, BLayout::ColMajor, N, 1> from any
GlobalTensor (both ND-layout AND DN-layout) only writes element
[0, 0] of the UB tile. Elements [1..N-1, 0] retain whatever was
in UB before the load (zero if just zeroed, otherwise stale).
The kernel compiles cleanly, raises no TASSIGN / TLOAD /
PTO_ASSERT, and produces a deterministic-looking number at
element 0. The bug is discoverable only by dumping the post-TLoad
/ post-TCvt tile to GM and comparing against expected per-row
values.
Why: PTO's ColMajor [N, 1] tile is intended as the computed
result of a row-wise reduction (TRowMax, TRowSum, TSub
across rows) or as a constant-fill destination of TExpandS.
It is not a supported TLoad destination for variable per-row
scalars from GM. FlashAttention's
ReduceTileF_T = Tile<Vec, float, Vec_S0, 1, ColMajor, Vec_S0, 1>
(pto-isa/tests/npu/a2a3/src/st/testcase/tfa/) is the canonical
user — and it never TLoads from GM; exp_max is computed in
UB via TReshape + TSub + TExp from a separate RowMajor input.
Downstream symptom in Phase 3a: TRowExpandMul with this "loaded"
ColMajor [32, 1] src1 silently used garbage scalars for rows
1..31, producing per-row scale multipliers like 5.92e-06 instead
of 0.07135.
Fix template for per-row variable scales loaded from GM:
- Load as a RowMajor flat row:
Tile<Vec, T, 1, N, RowMajor, 1, N>withGlobalTensor<T, Shape<1,1,1,1,N>, Stride<1,1,1,N,1>>. Same pattern as wscale loads, well-tested. - After
TCvtto fp32, broadcast manually to[N, 8]RowMajor:Each rowvbrcb(broadcast_ub_ptr, ascale_f32_ptr, /*dstBlockStride=*/1, /*dstRepeatStride=*/8, /*repeats=*/CeilDivision(N, 8)); pipe_barrier(PIPE_V);
rends up as[s_r] × 8(one 32-byte block). - Feed the RowMajor
[N, 8]toTRowExpandMulas src1. PTO takes the RowMajor src1 path (assertionsrc1ValidCol == 32/sizeof(T) = 8), skips its internal vbrcb scratch dance, goes directly to vmul.
Alternative if TRowExpandMul still misbehaves on the RowMajor
path: broadcast to full RowMajor [N, kBN] and use plain TMul
elementwise. Costs N*kBN*4 UB bytes but removes PTO's expand
machinery from the equation.
Diagnostic: when per-row scaled output looks "scaled by something
random", dump the reduce tile post-TCvt / pre-TRowExpandMul
to a side GM region. If only row 0 matches expected → it's a
ColMajor [N, 1] TLoad bug, not a TRowExpandMul or vbrcb
bug.
Generalization: don't TLoad into ColMajor reduce tiles. Load
flat as RowMajor [1, N]; broadcast / reshape in UB to whatever
shape downstream ops need.
Discovered Phase 3a 2026-05-11 cycle 17.
After pto::TRowExpand (or any PTO op whose internal vbrcb
sets a count-mode mask without restoring norm mode), the vector
mask register is left in a state that causes downstream vector ops
to silently process only a fraction of their declared repeats
argument.
Mechanism in
pto-isa/include/pto/npu/a2a3/TRowExpandBinOp.hpp's
TRowExpandBinaryNormModeTail:
if (DstRowStride < elementsPerRepeat || ...) {
SetContMaskByDType<T>(validCol); // explicit mask setup
Op::RowExpandBinInstr(...);
SetFullVecMaskByDType<T>();
} else {
for (i = 0; i < numLoop; i++) {
Op::RowExpandBinInstr(...); // NO mask setup
...
}
}When DstRowStride ≥ elementsPerRepeat, the else-branch executes
without SetContMask. It inherits whatever mask the caller left
in place. If the caller just ran TRowExpand, the internal
vbrcb left a count-mode mask of ceil(target_size / 8) — so
vmul silently processes only the first 4 repeats (for
target_size=32), leaving rows 0..3 of each AIV's row band
untouched.
Fix template:
pto::TRowExpand(bcast_tile, flat_tile);
pipe_barrier(PIPE_V); // PIPE_V serializes anyway,
// but documents intent
set_mask_norm(); // restore norm mode
set_vector_mask(-1, -1); // full vec mask for dtype
pto::TRowExpandMul(dst, src0, bcast_tile);Apply:
- Wrap any
pto::TRowExpand,pto::TBrcb, or other broadcast-style PTO op in this pattern before chaining into another vector op (vmul,TRowExpandMulon RowMajor path, anything that entersNormModeTail's else-branch). TColExpandMulis suspected to have the same risk after a contaminated mask; apply the same reset.- Note: this is independent of the mask-reset documented for AIV mix-mode kernel entry (PTO issue #218). The kernel-entry reset handles the initial mask state; this rule handles during-execution contamination.
Diagnostic: when a chained PTO vector op covers most rows/cols of
its tile correctly but a fixed prefix is wrong/skipped/zero,
suspect mask contamination. Skipped count typically =
ceil(target_size / 8) of the broadcast op that left mask in
count mode.
Discovered Phase 3a 2026-05-11.
On AscendC mix-mode AIV, when a vec K-loop reuses the same UB
region across iterations for a partial (TLoad partial_int32 →
TCvt i32→f32 → TRowExpandMul / TColExpandMul → TMov / TAdd
→ reuse on next iter), you must add an explicit
PIPE_V → PIPE_MTE2 cross-iter flag.
The race: PIPE_MTE2 (TLoad) and PIPE_V (TRowExpandMul etc.)
are independent pipes. Without explicit V→MTE2 sync, iter N+1's
TLoad may fire before iter N's PIPE_V writes drain. The PIPE_V
writes then land after the new TLoad, overwriting the
freshly-loaded int32 bytes with stale fp32 dequant bit patterns.
Iter N+1's TCvt i32→f32 reinterprets the fp32 bit patterns as
int32 — e.g. 0.1 = 0x3DCCCCCD = 1036831949 reads as int32
1.036e9, then casts back to fp32 1.036e9. Output magnitude near
INT32_MAX is the dead giveaway.
The MTE2→V flag at the top of each iter only enforces "V waits for MTE2's TLoad". The reverse direction ("next iter's MTE2 waits for prior iter's V drain") needs its own flag. Other UB regions (running accumulator, scales) don't trigger this because they're only written by V and never re-loaded by MTE2 — the partial region is the unique cross-pipe-write spot.
Fix template:
// Seed before the K-loop so iter 0's wait is satisfied trivially
set_flag(PIPE_V, PIPE_MTE2, EVENT_ID_X);
for (kb = 0; kb < kNumKBlocks; ++kb) {
wait_flag(PIPE_V, PIPE_MTE2, EVENT_ID_X); // gate next TLoad
// on prior V drain
TLoad(partI32, ...);
// ... TCvt, TRowExpandMul, TColExpandMul, TMov/TAdd on PIPE_V ...
set_flag(PIPE_V, PIPE_MTE2, EVENT_ID_X); // signal V drained
}
wait_flag(PIPE_V, PIPE_MTE2, EVENT_ID_X); // drain seed on exitPick an EVENT_ID not already used by other flags on the AIV
section (Phase 3a used EVENT_ID2; 0/1 were taken).
Diagnostic: TStore the post-TCvt fp32 tile to a side GM,
compare against partial_int32.float() from the caller. Iter 0
looks correct; iter 1+ is off by ~INT32_MAX when this race fires.
Generalization: any UB region that is BOTH a PIPE_MTE2 TLoad
target AND a PIPE_V output target across loop iterations needs
V→MTE2 sync. UB regions written by V only (running accumulator,
scratch) don't.
Host / PyTorch: tensor.cpu() of a kernel-written tensor can return the host's prior fill, not the cube's fixpipe output
Discovered Phase 3b 2026-05-25.
L2 residency of cube↔vec hand-off (first gotcha in this file) has a
host-side companion symptom. If you pre-fill a PyTorch NPU tensor on
the host (e.g. at::full(99.0f) or tensor.fill_(0x42424242)) and
then a cube kernel writes to the same GM via TSTORE, the kernel's
output reaches L2 but may not flush to HBM before .cpu() reads
back. The host fill_ did land in HBM (the host-side allocator
path writes through). Result: vec (reading L2) sees cube's real
values and produces a correct downstream output; Python
tensor.cpu() reads HBM and returns the fill value, making
intermediate buffers look "untouched" even though they were
overwritten.
This produces a particularly nasty class of false-debug signals:
outtensor passes vs.refto 0.001 — kernel is correct.- "intermediate" debug tensor read back via
.cpu()shows the pre-fill value (0, NaN, 99.0f, sentinel pattern), suggesting "cube didn't write". - You spend cycles convinced cube TSTORE is broken when it's the measurement that's broken.
Concrete chain that triggered Phase 3b's #111 ladder:
at::full({M,N}, 99.0f, fp32_options) → lora_buf.data_ptr() →
cube TSTORE(loraBufGm, loraAccTile) writes correct values to L2 →
vec TLOAD reads correct values from L2 → final out is correct →
Python lora_buf.cpu() reads HBM → all 99.0f.
Apply:
- Trust
out(the final tensor that vec writes via its own fixpipe-equivalent path). That path goes through fixpipe + GM- the consumer's read; correctness there means the whole chain worked.
- Don't trust intermediate
cube→GMbuffers read via.cpu(). If you need to inspect them, either:- Have vec read the buffer (via UB MTE2) and re-TSTORE it to a
designated "observation" GM region that you then read from
host. Vec's read pulls L2 into UB; the subsequent TSTORE
forces a write-through that lands in HBM and is
.cpu()- visible. - Call
aclrtSynchronizeStream+ a deliberate device-side cache-flush op (if available on your CANN version) before the.cpu()D2H.
- Have vec read the buffer (via UB MTE2) and re-TSTORE it to a
designated "observation" GM region that you then read from
host. Vec's read pulls L2 into UB; the subsequent TSTORE
forces a write-through that lands in HBM and is
- Don't pre-fill a kernel-write target as a "sentinel" to detect whether the kernel wrote. The sentinel will appear preserved whether or not the kernel actually wrote, because the host write reached HBM and the kernel's write didn't.
Diagnostic: if Python sees the pre-fill / pre-zero value AND the downstream computation that consumes that buffer matches the reference, the kernel's write is real; you're observing a stale HBM view. The "downstream consumer matches ref" signal is the authoritative one.
Doesn't apply when: the GM region is read only by the host
(e.g. final out produced by vec's TSTORE that the test reads).
out going through .cpu() works because vec's TSTORE flushes
the line (or aclrtSynchronizeStream's path picks it up — empirical,
either way out is reliable). The issue is specifically when
cube writes a hand-off buffer and the test reads that buffer
directly instead of through vec.
Companion finding: use at::zeros (not at::empty) for any GM
that cube writes via fixpipe and vec reads back via MTE2. With
at::empty, the GM line for the LoRA hand-off slot reproducibly
fails to land — out comes out without the LoRA contribution (off
by max_abs(lora_term) from ref). Switching to at::zeros
fixes it. PyTorch's NPU allocator appears to defer physical commit
or skip L2 line establishment for at::empty, in a way that
matters for cube fixpipe → vec MTE2 hand-offs but not for cube
fixpipe → host .cpu() (which goes through a different read path
and tolerates the deferred state). Cost is one extra D2H init —
negligible vs. the LoRA bring-up cycles burnt diagnosing this.
Recidivism warning: the 3c-7 VDEQF16 fold attempt
(docs/kernels/gemm_w4a4-vdeqf16-attempt.md) walked right into
this trap a second time, even though this gotcha was written
during 3b-6. The 3c-5 production baseline ALSO shows
lora_buf.cpu() == 0 — that print was normal, not a bug. Five
NPU probes (≈ 30 min Space credit) chased a phantom while the
real signal — assert_close(out, ref) FAIL with out magnitude
~1/100 of ref — was sitting in the same log file ignored.
Tests must assert on out (the final tensor vec writes), not
on lora_buf or other cube-write/host-read intermediates. Use
assertLess(out_vs_ref, out_vs_no_lora) to prove the LoRA path
contributes, instead of trusting lora_buf.cpu() directly. See
tests/test_gemm_w4a4.py::test_phase3b_int4_lora_path for the
template.