|
8 | 8 | Workload: Single matmul T1[m,n1] = T0[m,n0] * W0[n0,n1] (M=4, KN=4) |
9 | 9 | bits_per_value = 8 |
10 | 10 |
|
11 | | -Mapping: |
| 11 | +Mapping (uneven — two Storage nodes for GlobalBuffer): |
12 | 12 | Storage [W0, T0, T1] @ MainMemory |
| 13 | + Storage [T1] @ GlobalBuffer ← T1 pegged above m (output accumulation) |
13 | 14 | Temporal m=1 ← m is IRRELEVANT to W0[n0,n1] |
14 | | - Storage [W0] @ GlobalBuffer ← W0 lives here, below the m loop |
| 15 | + Storage [W0] @ GlobalBuffer ← W0 pegged below m (weight reuse) |
15 | 16 | Temporal n0=1 |
16 | 17 | Temporal n1=1 |
17 | 18 | Compute Matmul0 @ MAC |
18 | 19 |
|
19 | | -The m loop sits above GlobalBuffer[W0], but m does not appear in W0's |
20 | | -dimensions [n0, n1]. The model should recognize this and fill W0 only |
21 | | -ONCE rather than once per m iteration. |
| 20 | +T1 (the output) depends on m and must accumulate across the inner |
| 21 | +loops, so it is stored at GlobalBuffer above the m loop. W0 (the |
| 22 | +weight matrix) does not depend on m but is forced below the m loop |
| 23 | +because T1 already claims the above-m slot at GlobalBuffer. This is |
| 24 | +the same split-storage pattern used in fused_matmuls_to_simple.yaml |
| 25 | +and eyeriss-style architectures. |
22 | 26 |
|
23 | | -Note: in this simple example, reordering W0 above the m loop would |
24 | | -avoid the issue entirely. In real architectures (e.g. eyeriss), the |
25 | | -mapper may place a tensor below an irrelevant loop because the overall |
26 | | -mapping is globally optimal across all tensors and buffer capacities. |
27 | | -This test validates the model's temporal reuse computation for such |
28 | | -mappings. |
| 27 | +The model should recognize that m is irrelevant to W0 and fill W0 only |
| 28 | +ONCE rather than once per m iteration. |
29 | 29 |
|
30 | 30 | Action counts are in bits (elements * bits_per_value). |
31 | 31 | W0 shape = [n0, n1] = [4, 4] = 16 elements = 128 bits. |
|
0 commit comments