|
| 1 | +constraints: |
| 2 | + - id: outputs_must_match |
| 3 | + name: "Outputs must match original" |
| 4 | + severity: info |
| 5 | + description: | |
| 6 | + The verification tool will check that outputs match the original. |
| 7 | + If it fails, try a different optimization approach. |
| 8 | +
|
| 9 | + - id: streamk_output_must_be_prezeroed |
| 10 | + name: "Pre-zero output buffer when using atomic accumulation (Stream K)" |
| 11 | + severity: critical |
| 12 | + description: | |
| 13 | + When partial tiles use tl.atomic_add to accumulate results, the output |
| 14 | + tensor MUST be initialized to zero (torch.zeros, NOT torch.empty). |
| 15 | + Otherwise partial sums will include garbage values. |
| 16 | +
|
| 17 | + WRONG: |
| 18 | + ```python |
| 19 | + c = torch.empty((M, N), device=a.device, dtype=torch.float32) |
| 20 | + first_wave[grid](a, b, c, ...) # atomic_add onto garbage |
| 21 | + ``` |
| 22 | +
|
| 23 | + CORRECT: |
| 24 | + ```python |
| 25 | + c = torch.zeros((M, N), device=a.device, dtype=torch.float32) |
| 26 | + first_wave[grid](a, b, c, ...) # atomic_add safely onto zeros |
| 27 | + ``` |
| 28 | +
|
| 29 | + - id: streamk_atomic_add_needs_mask |
| 30 | + name: "Atomic adds on partial tiles must be masked for boundary safety" |
| 31 | + severity: critical |
| 32 | + description: | |
| 33 | + When falling back to tl.atomic_add for partial tiles, you MUST apply |
| 34 | + boundary masks (rm < M, rn < N) to avoid writing out-of-bounds. |
| 35 | +
|
| 36 | + ```python |
| 37 | + rm = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M) |
| 38 | + rn = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N) |
| 39 | + mask = (rm < M)[:, None] & (rn < N)[None, :] |
| 40 | + tl.atomic_add(c_ptr_, acc, mask=mask, sem='relaxed') |
| 41 | + ``` |
| 42 | +
|
| 43 | + - id: int64_cast_for_large_batch_offsets |
| 44 | + name: "Cast batch/stride products to int64 to prevent pointer overflow" |
| 45 | + severity: critical |
| 46 | + description: | |
| 47 | + When computing pointer offsets for batched operations, the product of |
| 48 | + a batch index and a stride can exceed int32 range for large tensors. |
| 49 | + Triton program_id returns int32 by default. You MUST cast to int64 |
| 50 | + before multiplying by strides. |
| 51 | +
|
| 52 | + WRONG (silent int32 overflow → wrong memory addresses): |
| 53 | + ```python |
| 54 | + bid = tl.program_id(axis=1) |
| 55 | + offset_a = bid * stride_az # int32 * int32 → overflow for large tensors |
| 56 | + a_ptrs = a_ptr + offset_a + ... |
| 57 | + ``` |
| 58 | +
|
| 59 | + CORRECT: |
| 60 | + ```python |
| 61 | + bid = tl.program_id(axis=1) |
| 62 | + offset_a = bid.to(tl.int64) * stride_az # safe for large tensors |
| 63 | + a_ptrs = a_ptr + offset_a + ... |
| 64 | + ``` |
| 65 | +
|
| 66 | + This applies whenever a program_id or loop index is multiplied by a |
| 67 | + stride that could produce values > 2^31 (≈2 billion elements). Common |
| 68 | + in batched GEMM, multi-head attention, and any kernel with a batch |
| 69 | + dimension over large tensors. |
| 70 | +
|
| 71 | + - id: autotune_no_defaults |
| 72 | + name: "Do not put default values on @triton.autotune meta-parameters" |
| 73 | + severity: critical |
| 74 | + description: | |
| 75 | + When using @triton.autotune, the meta-parameters (BLOCK_M, BLOCK_N, etc.) |
| 76 | + must NOT have default values in the kernel signature. Default values cause |
| 77 | + a "Conflicting meta-parameters" error at runtime. |
| 78 | +
|
| 79 | + WRONG: |
| 80 | + ```python |
| 81 | + @triton.autotune(configs=[...], key=['M', 'N', 'K']) |
| 82 | + @triton.jit |
| 83 | + def kernel(..., BLOCK_M: tl.constexpr = 128, ...): |
| 84 | + ... |
| 85 | + ``` |
| 86 | +
|
| 87 | + CORRECT: |
| 88 | + ```python |
| 89 | + @triton.autotune(configs=[...], key=['M', 'N', 'K']) |
| 90 | + @triton.jit |
| 91 | + def kernel(..., BLOCK_M: tl.constexpr, ...): |
| 92 | + ... |
| 93 | + ``` |
| 94 | +
|
| 95 | + - id: model_class_pattern |
| 96 | + name: "Model class must be compatible with ai-bench loading" |
| 97 | + severity: critical |
| 98 | + description: | |
| 99 | + ai-bench creates Model via direct `__init__()` and uses standard |
| 100 | + `load_state_dict()` for weight synchronization between reference |
| 101 | + and optimized models. |
| 102 | +
|
| 103 | + The Model class should use standard nn.Module patterns: |
| 104 | +
|
| 105 | + ```python |
| 106 | + class Model(nn.Module): |
| 107 | + def __init__(self, input_size, hidden_size, ...): |
| 108 | + super().__init__() |
| 109 | + self.gemm = nn.Linear(input_size, hidden_size) |
| 110 | + self._packed = False |
| 111 | +
|
| 112 | + def _pack_weights(self): |
| 113 | + device = torch.device("xpu") |
| 114 | + w = self.gemm.weight.data.detach() |
| 115 | + b = self.gemm.bias.data.detach() |
| 116 | + self.weight_t = w.to(device, torch.float16).t().contiguous() |
| 117 | + self.bias_xpu = b.to(device, torch.float16).contiguous() |
| 118 | + self._packed = True |
| 119 | +
|
| 120 | + def forward(self, x): |
| 121 | + if not self._packed: |
| 122 | + self._pack_weights() |
| 123 | + # ... launch triton kernel ... |
| 124 | + ``` |
| 125 | +
|
| 126 | + - id: descriptor_no_boundary_check_arg |
| 127 | + name: "Tensor descriptor .load() does NOT accept boundary_check" |
| 128 | + severity: critical |
| 129 | + description: | |
| 130 | + Tensor descriptors are the preferred memory access API on XPU. |
| 131 | + Unlike block pointers which use tl.load(ptr, boundary_check=(0, 1)), |
| 132 | + tensor descriptors handle boundaries internally. The .load() method |
| 133 | + takes only a coordinate list. |
| 134 | +
|
| 135 | + WRONG: |
| 136 | + ```python |
| 137 | + desc = tl.make_tensor_descriptor(base=ptr, shape=(M, K), ...) |
| 138 | + data = desc.load([row, col], boundary_check=(0, 1)) |
| 139 | + ``` |
| 140 | +
|
| 141 | + CORRECT: |
| 142 | + ```python |
| 143 | + desc = tl.make_tensor_descriptor(base=ptr, shape=(M, K), ...) |
| 144 | + data = desc.load([row, col]) |
| 145 | + ``` |
0 commit comments