Skip to content

MultiRamp: a better representation of affine vectors (nested ramps)#9113

Open
abadams wants to merge 61 commits intomainfrom
abadams/multiramp
Open

MultiRamp: a better representation of affine vectors (nested ramps)#9113
abadams wants to merge 61 commits intomainfrom
abadams/multiramp

Conversation

@abadams
Copy link
Copy Markdown
Member

@abadams abadams commented Apr 29, 2026

Nested vectors are increasingly important as we look at multidimensional SIMD architectures like nvidia's Tile IR. A matrix multiple can be expressed by vectorizing three nested loops - one of them a reduction. This is current very brittle in the compiler. If you don't get things just right, the vector reduction, or the load, or the store scalarizes.

This PR introduces a MultiRamp class, which represents a vector Expr where the lanes are an affine function of a multi-dimensional index. It's capable of representing sums of nested ramps of compatible shapes. E.g. it matches things like ramp(ramp(x, 1, 4), 8, 4) + ramp(broadcast(y, 2), s, 8) (i.e. a 4x4 index plus a 2x8 index) and treats it as ramp(ramp(ramp(ramp(x + y, 1, 2), s+2, 2), 2*s + 8, 2), 4*s + 16, 2) (a 2x2x2x2 box - hopefully I got the strides right).

MultiRamps have a pretty interesting algebra to them. They're almost a vector space. MultiRamps are closed under scalar multiplication, but not MultiRamp addition: The sum of two MultiRamps may or may not be a MultiRamp. You can also sometimes integer-divide or mod them by scalars and get a MultiRamp with one more dimension: ramp(x, 1, 8) / 2 -> ramp(broadcast(x,0 2), 1, 4). These situations do happen in codegen, e.g. if you vectorize f(r/2) over r by a factor of 8 you get the previous example.

Another useful operation on MultiRamps is getting the shuffles that correspond to manipulating the dimensions of a MultiRamp. E.g. if you take some prefix of the dimensions of a MultiRamp and move them to the back, that's equivalent to transposing the vector. This is used to do things like shuffle loads and stores of multiramps around so that any stride-1 dimension is innermost.

The final interesting operation on MultiRamps is asking whether or not they're alias-free - i.e. are there two lanes that are the same. This is useful because if you have a MultiRamp index being alias-free means each vector value gets stored to a distinct address.

With these tools in hand vector reduce pattern matching gets simpler and more general. You lift the store address to a MultiRamp (replacing the old is_interleaved_ramp). You then handle the reduction along each dimension separately. Stride zero dimensions correspond to vectorized vars that don't appear in the LHS index. E.g. 'r' in f(x) += g(x + r). These can be handled by within-vector reductions - if it's the innermost dimension you use VectorReduce, otherwise you rotate those dimensions outermost and then slice and add the vector using a small reduction tree. Any alias-free subset of the dimensions can just be carried through to the store index - we know those don't combine with each other so we can just store them as one big vector. Remaining dimensions, e.g. those with symbolic strides (e.g. due to vectorizing across y), are unrolled into a sequence of vector loads and stores of the alias-free subset. You don't know if the store address collide or not, so you have to just serialize that dimension. Previously if anything got confusing, we'd serialize all the dimensions, so this is much better, because we're still handling the known stride-0 dimensions with vector ops.

The new test correctness/transposed_vector_reduce.cpp asserts that we do something sane for all possible loop orderings and storage orderings for a triply-vectorized loop. Here is the IR it generates if I substitute in constant strides (the test uses unknown strides, so the IR is still vectorized, but quite complex):

 produce g {
  let halide_copy_to_host_result$2 = halide_copy_to_host((struct halide_buffer_t *)g.buffer)
  assert(halide_copy_to_host_result$2 == 0, halide_copy_to_host_result$2)
  g[ramp(0, 1, 8)] = x8(0)
  g[ramp(10, 1, 8)] = x8(0)
  g[ramp(20, 1, 8)] = x8(0)
  g[ramp(30, 1, 8)] = x8(0)
  g[ramp(40, 1, 8)] = x8(0)
  g[ramp(50, 1, 8)] = x8(0)
  g[ramp(60, 1, 8)] = x8(0)
  g[ramp(70, 1, 8)] = x8(0)
  _halide_buffer_set_host_dirty((struct halide_buffer_t *)g.buffer, true)
  let halide_copy_to_host_result$3 = halide_copy_to_host((struct halide_buffer_t *)p0.buffer)
  assert(halide_copy_to_host_result$3 == 0, halide_copy_to_host_result$3)
  let t847 = let t9690 = concat_vectors(shuffle(p0[ramp(((0 - (p0.min.2*100)) - (p0.min.1*10)) - p0.min.0, 1, 778)], ...)) in (let t9691 = (slice_vectors(t9690, 0, 1, 256) + slice_vectors(t9690, 256, 1, 256)) in (let t9692 = (slice_vectors(t9691, 0, 1, 128) + slice_vectors(t9691, 128, 1, 128)) in (slice_vectors(t9692, 64, 1, 64) + (shuffle(g[ramp(0, 1, 78)], ...) + slice_vectors(t9692, 0, 1, 64)))))
  g[ramp(0, 1, 8)] = slice_vectors(t847, 0, 1, 8)
  g[ramp(10, 1, 8)] = slice_vectors(t847, 8, 1, 8)
  g[ramp(20, 1, 8)] = slice_vectors(t847, 16, 1, 8)
  g[ramp(30, 1, 8)] = slice_vectors(t847, 24, 1, 8)
  g[ramp(40, 1, 8)] = slice_vectors(t847, 32, 1, 8)
  g[ramp(50, 1, 8)] = slice_vectors(t847, 40, 1, 8)
  g[ramp(60, 1, 8)] = slice_vectors(t847, 48, 1, 8)
  g[ramp(70, 1, 8)] = slice_vectors(t847, 56, 1, 8)
  let t1365 = let t9693 = concat_vectors(shuffle(p0[ramp(((0 - (p0.min.2*100)) - (p0.min.1*10)) - p0.min.0, 1, 778)], ...)) in (let t9694 = (slice_vectors(t9693, 0, 1, 256) + slice_vectors(t9693, 256, 1, 256)) in (let t9695 = (slice_vectors(t9694, 0, 1, 128) + slice_vectors(t9694, 128, 1, 128)) in (slice_vectors(t9695, 64, 1, 64) + (shuffle(g[ramp(0, 1, 78)], ...) + slice_vectors(t9695, 0, 1, 64)))))
  g[ramp(0, 1, 8)] = slice_vectors(t1365, 0, 1, 8)
  g[ramp(10, 1, 8)] = slice_vectors(t1365, 8, 1, 8)
  g[ramp(20, 1, 8)] = slice_vectors(t1365, 16, 1, 8)
  g[ramp(30, 1, 8)] = slice_vectors(t1365, 24, 1, 8)
  g[ramp(40, 1, 8)] = slice_vectors(t1365, 32, 1, 8)
  g[ramp(50, 1, 8)] = slice_vectors(t1365, 40, 1, 8)
  g[ramp(60, 1, 8)] = slice_vectors(t1365, 48, 1, 8)
  g[ramp(70, 1, 8)] = slice_vectors(t1365, 56, 1, 8)
  let t1883 = let t9696 = concat_vectors(transpose_vector(shuffle(p0[ramp(((((p0.min.2*10) + p0.min.1)*10) + p0.min.0)*-1, 1, 778)], ...), 8)) in (let t9697 = (slice_vectors(t9696, 0, 1, 256) + slice_vectors(t9696, 256, 1, 256)) in (let t9698 = (slice_vectors(t9697, 0, 1, 128) + slice_vectors(t9697, 128, 1, 128)) in (slice_vectors(t9698, 64, 1, 64) + (shuffle(g[ramp(0, 1, 78)], ...) + slice_vectors(t9698, 0, 1, 64)))))
  g[ramp(0, 1, 8)] = slice_vectors(t1883, 0, 1, 8)
  g[ramp(10, 1, 8)] = slice_vectors(t1883, 8, 1, 8)
  g[ramp(20, 1, 8)] = slice_vectors(t1883, 16, 1, 8)
  g[ramp(30, 1, 8)] = slice_vectors(t1883, 24, 1, 8)
  g[ramp(40, 1, 8)] = slice_vectors(t1883, 32, 1, 8)
  g[ramp(50, 1, 8)] = slice_vectors(t1883, 40, 1, 8)
  g[ramp(60, 1, 8)] = slice_vectors(t1883, 48, 1, 8)
  g[ramp(70, 1, 8)] = slice_vectors(t1883, 56, 1, 8)
  let t2401 = let t9699 = shuffle(shuffle(p0[ramp(((0 - (p0.min.2*100)) - (p0.min.1*10)) - p0.min.0, 1, 778)], ...), ...) in (let t9700 = (slice_vectors(t9699, 0, 1, 256) + slice_vectors(t9699, 256, 1, 256)) in (let t9701 = (slice_vectors(t9700, 0, 1, 128) + slice_vectors(t9700, 128, 1, 128)) in (slice_vectors(t9701, 64, 1, 64) + (shuffle(g[ramp(0, 1, 78)], ...) + slice_vectors(t9701, 0, 1, 64)))))
  g[ramp(0, 1, 8)] = slice_vectors(t2401, 0, 1, 8)
  g[ramp(10, 1, 8)] = slice_vectors(t2401, 8, 1, 8)
  g[ramp(20, 1, 8)] = slice_vectors(t2401, 16, 1, 8)
  g[ramp(30, 1, 8)] = slice_vectors(t2401, 24, 1, 8)
  g[ramp(40, 1, 8)] = slice_vectors(t2401, 32, 1, 8)
  g[ramp(50, 1, 8)] = slice_vectors(t2401, 40, 1, 8)
  g[ramp(60, 1, 8)] = slice_vectors(t2401, 48, 1, 8)
  g[ramp(70, 1, 8)] = slice_vectors(t2401, 56, 1, 8)
  let t2919 = let t9702 = shuffle(shuffle(p0[ramp(((0 - (p0.min.2*100)) - (p0.min.1*10)) - p0.min.0, 1, 778)], ...), ...) in (let t9703 = (slice_vectors(t9702, 0, 1, 256) + slice_vectors(t9702, 256, 1, 256)) in (let t9704 = (slice_vectors(t9703, 0, 1, 128) + slice_vectors(t9703, 128, 1, 128)) in (slice_vectors(t9704, 64, 1, 64) + (shuffle(g[ramp(0, 1, 78)], ...) + slice_vectors(t9704, 0, 1, 64)))))
  g[ramp(0, 1, 8)] = slice_vectors(t2919, 0, 1, 8)
  g[ramp(10, 1, 8)] = slice_vectors(t2919, 8, 1, 8)
  g[ramp(20, 1, 8)] = slice_vectors(t2919, 16, 1, 8)
  g[ramp(30, 1, 8)] = slice_vectors(t2919, 24, 1, 8)
  g[ramp(40, 1, 8)] = slice_vectors(t2919, 32, 1, 8)
  g[ramp(50, 1, 8)] = slice_vectors(t2919, 40, 1, 8)
  g[ramp(60, 1, 8)] = slice_vectors(t2919, 48, 1, 8)
  g[ramp(70, 1, 8)] = slice_vectors(t2919, 56, 1, 8)
  let t3437 = let t9705 = shuffle(transpose_vector(shuffle(p0[ramp(((((p0.min.2*10) + p0.min.1)*10) + p0.min.0)*-1, 1, 778)], ...), 64), ...) in (let t9706 = (slice_vectors(t9705, 0, 1, 256) + slice_vectors(t9705, 256, 1, 256)) in (let t9707 = (slice_vectors(t9706, 0, 1, 128) + slice_vectors(t9706, 128, 1, 128)) in (slice_vectors(t9707, 64, 1, 64) + (shuffle(g[ramp(0, 1, 78)], ...) + slice_vectors(t9707, 0, 1, 64)))))
  g[ramp(0, 1, 8)] = slice_vectors(t3437, 0, 1, 8)
  g[ramp(10, 1, 8)] = slice_vectors(t3437, 8, 1, 8)
  g[ramp(20, 1, 8)] = slice_vectors(t3437, 16, 1, 8)
  g[ramp(30, 1, 8)] = slice_vectors(t3437, 24, 1, 8)
  g[ramp(40, 1, 8)] = slice_vectors(t3437, 32, 1, 8)
  g[ramp(50, 1, 8)] = slice_vectors(t3437, 40, 1, 8)
  g[ramp(60, 1, 8)] = slice_vectors(t3437, 48, 1, 8)
  g[ramp(70, 1, 8)] = slice_vectors(t3437, 56, 1, 8)
  let t3955 = (int32x64)vector_reduce_add(transpose_vector(shuffle(p0[ramp(((((p0.min.2*10) + p0.min.1)*10) + p0.min.0)*-1, 1, 778)], ...), 64)) + shuffle(g[ramp(0, 1, 78)], ...)
  g[ramp(0, 1, 8)] = slice_vectors(t3955, 0, 1, 8)
  g[ramp(10, 1, 8)] = slice_vectors(t3955, 8, 1, 8)
  g[ramp(20, 1, 8)] = slice_vectors(t3955, 16, 1, 8)
  g[ramp(30, 1, 8)] = slice_vectors(t3955, 24, 1, 8)
  g[ramp(40, 1, 8)] = slice_vectors(t3955, 32, 1, 8)
  g[ramp(50, 1, 8)] = slice_vectors(t3955, 40, 1, 8)
  g[ramp(60, 1, 8)] = slice_vectors(t3955, 48, 1, 8)
  g[ramp(70, 1, 8)] = slice_vectors(t3955, 56, 1, 8)
  let t4473 = (int32x64)vector_reduce_add(transpose_vector(shuffle(p0[ramp(((((p0.min.2*10) + p0.min.1)*10) + p0.min.0)*-1, 1, 778)], ...), 64)) + shuffle(g[ramp(0, 1, 78)], ...)
  g[ramp(0, 1, 8)] = slice_vectors(t4473, 0, 1, 8)
  g[ramp(10, 1, 8)] = slice_vectors(t4473, 8, 1, 8)
  g[ramp(20, 1, 8)] = slice_vectors(t4473, 16, 1, 8)
  g[ramp(30, 1, 8)] = slice_vectors(t4473, 24, 1, 8)
  g[ramp(40, 1, 8)] = slice_vectors(t4473, 32, 1, 8)
  g[ramp(50, 1, 8)] = slice_vectors(t4473, 40, 1, 8)
  g[ramp(60, 1, 8)] = slice_vectors(t4473, 48, 1, 8)
  g[ramp(70, 1, 8)] = slice_vectors(t4473, 56, 1, 8)
  let t4991 = (int32x64)vector_reduce_add(shuffle(p0[ramp(((0 - (p0.min.2*100)) - (p0.min.1*10)) - p0.min.0, 1, 778)], ...)) + shuffle(g[ramp(0, 1, 78)], ...)
  g[ramp(0, 1, 8)] = slice_vectors(t4991, 0, 1, 8)
  g[ramp(10, 1, 8)] = slice_vectors(t4991, 8, 1, 8)
  g[ramp(20, 1, 8)] = slice_vectors(t4991, 16, 1, 8)
  g[ramp(30, 1, 8)] = slice_vectors(t4991, 24, 1, 8)
  g[ramp(40, 1, 8)] = slice_vectors(t4991, 32, 1, 8)
  g[ramp(50, 1, 8)] = slice_vectors(t4991, 40, 1, 8)
  g[ramp(60, 1, 8)] = slice_vectors(t4991, 48, 1, 8)
  g[ramp(70, 1, 8)] = slice_vectors(t4991, 56, 1, 8)
  let t5509 = let t9708 = concat_vectors(transpose_vector(shuffle(p0[ramp(((((p0.min.2*10) + p0.min.1)*10) + p0.min.0)*-1, 1, 778)], ...), 64)) in (let t9709 = (slice_vectors(t9708, 0, 1, 256) + slice_vectors(t9708, 256, 1, 256)) in (let t9710 = (slice_vectors(t9709, 0, 1, 128) + slice_vectors(t9709, 128, 1, 128)) in transpose_vector(slice_vectors(t9710, 64, 1, 64) + (transpose_vector(shuffle(g[ramp(0, 1, 78)], ...), 8) + slice_vectors(t9710, 0, 1, 64)), 8)))
  g[ramp(0, 1, 8)] = slice_vectors(t5509, 0, 1, 8)
  g[ramp(10, 1, 8)] = slice_vectors(t5509, 8, 1, 8)
  g[ramp(20, 1, 8)] = slice_vectors(t5509, 16, 1, 8)
  g[ramp(30, 1, 8)] = slice_vectors(t5509, 24, 1, 8)
  g[ramp(40, 1, 8)] = slice_vectors(t5509, 32, 1, 8)
  g[ramp(50, 1, 8)] = slice_vectors(t5509, 40, 1, 8)
  g[ramp(60, 1, 8)] = slice_vectors(t5509, 48, 1, 8)
  g[ramp(70, 1, 8)] = slice_vectors(t5509, 56, 1, 8)
  let t6027 = let t9711 = concat_vectors(transpose_vector(shuffle(p0[ramp(((((p0.min.2*10) + p0.min.1)*10) + p0.min.0)*-1, 1, 778)], ...), 64)) in (let t9712 = (slice_vectors(t9711, 0, 1, 256) + slice_vectors(t9711, 256, 1, 256)) in (let t9713 = (slice_vectors(t9712, 0, 1, 128) + slice_vectors(t9712, 128, 1, 128)) in transpose_vector(slice_vectors(t9713, 64, 1, 64) + (transpose_vector(shuffle(g[ramp(0, 1, 78)], ...), 8) + slice_vectors(t9713, 0, 1, 64)), 8)))
  g[ramp(0, 1, 8)] = slice_vectors(t6027, 0, 1, 8)
  g[ramp(10, 1, 8)] = slice_vectors(t6027, 8, 1, 8)
  g[ramp(20, 1, 8)] = slice_vectors(t6027, 16, 1, 8)
  g[ramp(30, 1, 8)] = slice_vectors(t6027, 24, 1, 8)
  g[ramp(40, 1, 8)] = slice_vectors(t6027, 32, 1, 8)
  g[ramp(50, 1, 8)] = slice_vectors(t6027, 40, 1, 8)
  g[ramp(60, 1, 8)] = slice_vectors(t6027, 48, 1, 8)
  g[ramp(70, 1, 8)] = slice_vectors(t6027, 56, 1, 8)
  let t6545 = let t9714 = concat_vectors(transpose_vector(shuffle(p0[ramp(((((p0.min.2*10) + p0.min.1)*10) + p0.min.0)*-1, 1, 778)], ...), 8)) in (let t9715 = (slice_vectors(t9714, 0, 1, 256) + slice_vectors(t9714, 256, 1, 256)) in (let t9716 = (slice_vectors(t9715, 0, 1, 128) + slice_vectors(t9715, 128, 1, 128)) in transpose_vector(slice_vectors(t9716, 64, 1, 64) + (transpose_vector(shuffle(g[ramp(0, 1, 78)], ...), 8) + slice_vectors(t9716, 0, 1, 64)), 8)))
  g[ramp(0, 1, 8)] = slice_vectors(t6545, 0, 1, 8)
  g[ramp(10, 1, 8)] = slice_vectors(t6545, 8, 1, 8)
  g[ramp(20, 1, 8)] = slice_vectors(t6545, 16, 1, 8)
  g[ramp(30, 1, 8)] = slice_vectors(t6545, 24, 1, 8)
  g[ramp(40, 1, 8)] = slice_vectors(t6545, 32, 1, 8)
  g[ramp(50, 1, 8)] = slice_vectors(t6545, 40, 1, 8)
  g[ramp(60, 1, 8)] = slice_vectors(t6545, 48, 1, 8)
  g[ramp(70, 1, 8)] = slice_vectors(t6545, 56, 1, 8)
  let t7063 = let t9717 = shuffle(transpose_vector(shuffle(p0[ramp(((((p0.min.2*10) + p0.min.1)*10) + p0.min.0)*-1, 1, 778)], ...), 8), ...) in (let t9718 = (slice_vectors(t9717, 0, 1, 256) + slice_vectors(t9717, 256, 1, 256)) in (let t9719 = (slice_vectors(t9718, 0, 1, 128) + slice_vectors(t9718, 128, 1, 128)) in transpose_vector(slice_vectors(t9719, 64, 1, 64) + (transpose_vector(shuffle(g[ramp(0, 1, 78)], ...), 8) + slice_vectors(t9719, 0, 1, 64)), 8)))
  g[ramp(0, 1, 8)] = slice_vectors(t7063, 0, 1, 8)
  g[ramp(10, 1, 8)] = slice_vectors(t7063, 8, 1, 8)
  g[ramp(20, 1, 8)] = slice_vectors(t7063, 16, 1, 8)
  g[ramp(30, 1, 8)] = slice_vectors(t7063, 24, 1, 8)
  g[ramp(40, 1, 8)] = slice_vectors(t7063, 32, 1, 8)
  g[ramp(50, 1, 8)] = slice_vectors(t7063, 40, 1, 8)
  g[ramp(60, 1, 8)] = slice_vectors(t7063, 48, 1, 8)
  g[ramp(70, 1, 8)] = slice_vectors(t7063, 56, 1, 8)
  let t7581 = let t9720 = shuffle(transpose_vector(shuffle(p0[ramp(((((p0.min.2*10) + p0.min.1)*10) + p0.min.0)*-1, 1, 778)], ...), 8), ...) in (let t9721 = (slice_vectors(t9720, 0, 1, 256) + slice_vectors(t9720, 256, 1, 256)) in (let t9722 = (slice_vectors(t9721, 0, 1, 128) + slice_vectors(t9721, 128, 1, 128)) in transpose_vector(slice_vectors(t9722, 64, 1, 64) + (transpose_vector(shuffle(g[ramp(0, 1, 78)], ...), 8) + slice_vectors(t9722, 0, 1, 64)), 8)))
  g[ramp(0, 1, 8)] = slice_vectors(t7581, 0, 1, 8)
  g[ramp(10, 1, 8)] = slice_vectors(t7581, 8, 1, 8)
  g[ramp(20, 1, 8)] = slice_vectors(t7581, 16, 1, 8)
  g[ramp(30, 1, 8)] = slice_vectors(t7581, 24, 1, 8)
  g[ramp(40, 1, 8)] = slice_vectors(t7581, 32, 1, 8)
  g[ramp(50, 1, 8)] = slice_vectors(t7581, 40, 1, 8)
  g[ramp(60, 1, 8)] = slice_vectors(t7581, 48, 1, 8)
  g[ramp(70, 1, 8)] = slice_vectors(t7581, 56, 1, 8)
  let t8099 = let t9723 = shuffle(transpose_vector(shuffle(p0[ramp(((((p0.min.2*10) + p0.min.1)*10) + p0.min.0)*-1, 1, 778)], ...), 64), ...) in (let t9724 = (slice_vectors(t9723, 0, 1, 256) + slice_vectors(t9723, 256, 1, 256)) in (let t9725 = (slice_vectors(t9724, 0, 1, 128) + slice_vectors(t9724, 128, 1, 128)) in transpose_vector(slice_vectors(t9725, 64, 1, 64) + (transpose_vector(shuffle(g[ramp(0, 1, 78)], ...), 8) + slice_vectors(t9725, 0, 1, 64)), 8)))
  g[ramp(0, 1, 8)] = slice_vectors(t8099, 0, 1, 8)
  g[ramp(10, 1, 8)] = slice_vectors(t8099, 8, 1, 8)
  g[ramp(20, 1, 8)] = slice_vectors(t8099, 16, 1, 8)
  g[ramp(30, 1, 8)] = slice_vectors(t8099, 24, 1, 8)
  g[ramp(40, 1, 8)] = slice_vectors(t8099, 32, 1, 8)
  g[ramp(50, 1, 8)] = slice_vectors(t8099, 40, 1, 8)
  g[ramp(60, 1, 8)] = slice_vectors(t8099, 48, 1, 8)
  g[ramp(70, 1, 8)] = slice_vectors(t8099, 56, 1, 8)
  let t8617 = transpose_vector((int32x64)vector_reduce_add(transpose_vector(shuffle(p0[ramp(((((p0.min.2*10) + p0.min.1)*10) + p0.min.0)*-1, 1, 778)], ...), 8)) + transpose_vector(shuffle(g[ramp(0, 1, 78)], ...), 8), 8)
  g[ramp(0, 1, 8)] = slice_vectors(t8617, 0, 1, 8)
  g[ramp(10, 1, 8)] = slice_vectors(t8617, 8, 1, 8)
  g[ramp(20, 1, 8)] = slice_vectors(t8617, 16, 1, 8)
  g[ramp(30, 1, 8)] = slice_vectors(t8617, 24, 1, 8)
  g[ramp(40, 1, 8)] = slice_vectors(t8617, 32, 1, 8)
  g[ramp(50, 1, 8)] = slice_vectors(t8617, 40, 1, 8)
  g[ramp(60, 1, 8)] = slice_vectors(t8617, 48, 1, 8)
  g[ramp(70, 1, 8)] = slice_vectors(t8617, 56, 1, 8)
  let t9135 = transpose_vector((int32x64)vector_reduce_add(transpose_vector(shuffle(p0[ramp(((((p0.min.2*10) + p0.min.1)*10) + p0.min.0)*-1, 1, 778)], ...), 8)) + transpose_vector(shuffle(g[ramp(0, 1, 78)], ...), 8), 8)
  g[ramp(0, 1, 8)] = slice_vectors(t9135, 0, 1, 8)
  g[ramp(10, 1, 8)] = slice_vectors(t9135, 8, 1, 8)
  g[ramp(20, 1, 8)] = slice_vectors(t9135, 16, 1, 8)
  g[ramp(30, 1, 8)] = slice_vectors(t9135, 24, 1, 8)
  g[ramp(40, 1, 8)] = slice_vectors(t9135, 32, 1, 8)
  g[ramp(50, 1, 8)] = slice_vectors(t9135, 40, 1, 8)
  g[ramp(60, 1, 8)] = slice_vectors(t9135, 48, 1, 8)
  g[ramp(70, 1, 8)] = slice_vectors(t9135, 56, 1, 8)
  let t9653 = transpose_vector((int32x64)vector_reduce_add(shuffle(p0[ramp(((0 - (p0.min.2*100)) - (p0.min.1*10)) - p0.min.0, 1, 778)], ...)) + transpose_vector(shuffle(g[ramp(0, 1, 78)], ...), 8), 8)
  g[ramp(0, 1, 8)] = slice_vectors(t9653, 0, 1, 8)
  g[ramp(10, 1, 8)] = slice_vectors(t9653, 8, 1, 8)
  g[ramp(20, 1, 8)] = slice_vectors(t9653, 16, 1, 8)
  g[ramp(30, 1, 8)] = slice_vectors(t9653, 24, 1, 8)
  g[ramp(40, 1, 8)] = slice_vectors(t9653, 32, 1, 8)
  g[ramp(50, 1, 8)] = slice_vectors(t9653, 40, 1, 8)
  g[ramp(60, 1, 8)] = slice_vectors(t9653, 48, 1, 8)
  g[ramp(70, 1, 8)] = slice_vectors(t9653, 56, 1, 8)
 }

It's always some number of dense loads, some shuffles, maybe a vector reduce node if the innermost dimension happened to be stride zero. Some addition of slices to handle other stride 0 dimension, and then the store of a bunch of slices. We obey the schedule - the RHS is always materialized as one giant vector in the nesting order implied by the loop nesting order in the schedule - we just slice it up to do a 2D vector store.

If a stride is symbolic it ends up looking like this:

 let b58 = let t1479 = (p0.min.2*p0.stride.2) in (let t1480 = (p0.min.1*p0.stride.1) in (let t1481 = ((((1 - p0.min.2)*p0.stride.2) - t1480) - p0.min.0) in (let t1482 = ((2 - p0.min.2)*p0.stride.2) in (let t1483 = ((t1482 - t1480) - p0.min.0) in (let t1484 = ((3 - p0.min.2)*p0.stride.2) in (let t1485 = ((t1484 - t1480) - p0.min.0) in (let t1486 = ((4 - p0.min.2)*p0.stride.2) in (let t1487 = ((t1486 - t1480) - p0.min.0) in (let t1488 = ((5 - p0.min.2)*p0.stride.2) in (let t1489 = ((t1488 - t1480) - p0.min.0) in (let t1490 = ((6 - p0.min.2)*p0.stride.2) in (let t1491 = ((t1490 - t1480) - p0.min.0) in (let t1492 = ((7 - p0.min.2)*p0.stride.2) in (let t1493 = ((t1492 - t1480) - p0.min.0) in (let t1494 = ((2 - p0.min.1)*p0.stride.1) in (let t1495 = ((3 - p0.min.1)*p0.stride.1) in (let t1496 = ((4 - p0.min.1)*p0.stride.1) in (let t1497 = ((5 - p0.min.1)*p0.stride.1) in (let t1498 = ((6 - p0.min.1)*p0.stride.1) in (let t1499 = ((7 - p0.min.1)*p0.stride.1) in (int32x64)vector_reduce_add(concat_vectors( (lots of dense loads) ))))))))))))))))))))))
  g[ramp(0, 1, 8)] = slice_vectors(b58, 0, 8, 8) + g[ramp(0, 1, 8)]
  g[ramp(g.stride.1, 1, 8)] = slice_vectors(b58, 1, 8, 8) + g[ramp(g.stride.1, 1, 8)]
  g[ramp(g.stride.1*2, 1, 8) aligned(2, 0)] = slice_vectors(b58, 2, 8, 8) + g[ramp(g.stride.1*2, 1, 8) aligned(2, 0)]
  g[ramp(g.stride.1*3, 1, 8) aligned(3, 0)] = slice_vectors(b58, 3, 8, 8) + g[ramp(g.stride.1*3, 1, 8) aligned(3, 0)]
  g[ramp(g.stride.1*4, 1, 8) aligned(4, 0)] = slice_vectors(b58, 4, 8, 8) + g[ramp(g.stride.1*4, 1, 8) aligned(4, 0)]
  g[ramp(g.stride.1*5, 1, 8) aligned(5, 0)] = slice_vectors(b58, 5, 8, 8) + g[ramp(g.stride.1*5, 1, 8) aligned(5, 0)]
  g[ramp(g.stride.1*6, 1, 8) aligned(6, 0)] = slice_vectors(b58, 6, 8, 8) + g[ramp(g.stride.1*6, 1, 8) aligned(6, 0)]
  g[ramp(g.stride.1*7, 1, 8) aligned(7, 0)] = slice_vectors(b58, 7, 8, 8) + g[ramp(g.stride.1*7, 1, 8) aligned(7, 0)]


So it still obeys the schedule and computes the value being reduced as one giant vector, but it needs to slice it up and unroll across one of the dimensions because it doesn't statically know if g.stride.1 could be zero.

It's possible for there to be aliasing that requires this sort of unrolling even if all strides are known. Consider:

g(r.x + r.y) += f(r.x, r.y);

If we vectorized r.x and r.y by 8, the multiramp looks like: ramp(ramp(0, 1, 8), 1, 8). A within-vector reduction would be complex, because different numbers of values are being added for each output lane. In that vector 0 and 15 both occur once, 1 occurs twice, and 8 occurs 8 times. In principle you could shuffle a bunch of zeros into the RHS vector, shearing it, so that every index occurred 8 times, and then do a vector reduce. Currently this gets lowered as this:

   let b2 = f[ramp(0, 1, 64)]
   g[ramp(0, 1, 8)] = slice_vectors(b2, 0, 1, 8) + g[ramp(0, 1, 8)]
   g[ramp(1, 1, 8)] = slice_vectors(b2, 8, 1, 8) + g[ramp(1, 1, 8)]
   g[ramp(2, 1, 8)] = slice_vectors(b2, 16, 1, 8) + g[ramp(2, 1, 8)]
   g[ramp(3, 1, 8)] = slice_vectors(b2, 24, 1, 8) + g[ramp(3, 1, 8)]
   g[ramp(4, 1, 8)] = slice_vectors(b2, 32, 1, 8) + g[ramp(4, 1, 8)]
   g[ramp(5, 1, 8)] = slice_vectors(b2, 40, 1, 8) + g[ramp(5, 1, 8)]
   g[ramp(6, 1, 8)] = slice_vectors(b2, 48, 1, 8) + g[ramp(6, 1, 8)]
   g[ramp(7, 1, 8)] = slice_vectors(b2, 56, 1, 8) + g[ramp(7, 1, 8)]
  }

So again, the RHS is computed as one big vector (just a load from f), but the reduction is unrolled into individual vectorized steps that are each alias-free. When this is necessary, we prefer to keep innermost dimensions in the alias-free subset.

There's a mini fuzzer in one of the tests, but it didn't seem to belong in test/fuzz. This is not a combinatorialy large space that we need to slowly explore over time - I just wanted to test reductions over a fixed set of arbitrary quasi-affine (i.e. affine plus div and mod) index expressions, so it uses a fixed seed.

This branch was a collaboration between me and Claude Opus 4.7, with Claude doing a lot of the heavy lifting (e.g. figuring out a correct MultiRamp division algorithm and then repeatedly dumbing it down until a mere human could understand it - it was originally more general and still correct but even harder to understand).

abadams and others added 30 commits January 26, 2026 15:52
The previous comment reported a time that seemed to have regressed. It
was not 8.2ms on main - more like 11
Before:

Computing best tile sizes for each type
.................................................
bytes, tile width, tile height, bandwidth (GB/s):
1 8 8 20.9997
1 16 8 20.8329
1 8 16 18.5702
1 8 32 17.2463
1 8 64 14.312

2 8 16 19.2047
2 8 8 18.8368
2 16 8 17.0593
2 8 32 17.0591
2 4 8 15.7681

4 8 8 24.9364
4 4 16 22.9699
4 8 16 22.5743
4 4 32 22.255
4 4 8 20.4468

8 8 8 38.4094
8 16 4 28.4167
8 16 8 27.6184
8 8 4 27.6062
8 8 16 26.8693

After:

Computing best tile sizes for each type
.................................................
bytes, tile width, tile height, bandwidth (GB/s):
1 16 32 34.1921
1 16 16 31.8399
1 8 16 25.575
1 16 64 25.1665
1 32 16 25.0061

2 8 32 28.2635
2 8 16 27.7648
2 16 16 27.2126
2 16 32 23.9034
2 8 8 23.6345

4 8 16 34.5303
4 8 8 28.3653
4 16 8 26.8521
4 8 32 26.084
4 16 16 24.4519

8 8 8 33.7163
8 8 4 29.1339
8 4 16 26.418
8 16 4 25.4663
8 2 8 24.3949
Also better algorithm for innermost containing stmt
abadams and others added 29 commits March 6, 2026 14:25
Co-authored-by: Claude Code <noreply@anthropic.com>
Adds a MultiRamp IR helper that generalizes the old InterleavedRamp: a
nested ramp with a scalar base, a vector of strides (innermost first),
and a vector of per-dim lane counts. Supports in-place mul/add/div/mod
with symbolic strides where possible, reorder, slice, flatten into 1D
ramps, shuffle index construction for permutations and slices, and an
alias-free predicate.

Replaces InterleavedRamp recognition and handling in VectorizeLoops with
MultiRamp. The reduction-store path peels stride-zero and non-alias-free
dims (turning the latter into unrolled containing loops), computes the
per-iteration shuffle mask from the pre-peel shape via
shuffle_from_slice, and gracefully falls back when alias-freedom can't
be proven.

Wires MultiRamp into the simplifier: the Load and Store rules that
recognize a ramp-of-ramp index now use MultiRamp to rotate the stride-1
dim outermost via a single Shuffle::make_transpose, fixing a latent
correctness bug in the old rule for triply-nested ramps.

In FlattenNestedRamps, teaches the Load and Store visitors to recognize
multiramp indices and emit a concat of per-outer-multi-index 1D ramp
loads/stores, rather than a single flat-indexed load that downstream
passes struggle to combine. The existing bounded-span-to-dense-load
path runs first so strided-gather patterns (e.g. HVX vdelta) are
preserved.

Adds correctness tests for the MultiRamp API (test/correctness/multiramp.cpp)
and a nested-vectorize reduction test (transposed_vector_reduce.cpp).

Co-authored-by: Claude <noreply@anthropic.com>
API and doc cleanup on MultiRamp:
 - Add a real invariants block to the class header; clarify that 0-dim
   (scalar) multiramps are legal and all methods handle them.
 - Switch all class member comments to doxygen style.
 - Reword alias_free, alias_free_slice, rotate_stride_one_innermost, and
   the shuffle_from_* overloads so each leads with what it does.
 - Fix is_multiramp's Mul branch to use fresh local MultiRamps rather
   than letting a failed first attempt leak partial state into the
   second.
 - Drop the single-dim shuffle_from_slice overload in favour of the
   multi-dim version.
 - Relax add() so it handles 0-dim inputs trivially (base+base); this
   also makes operator== work for 0-dim via its existing add path.
 - Add accept/mutate methods (Function-idiom) so callers don't reach
   into base/strides to walk scalar subexpressions.
 - Add alias_free_slice (replaces ad-hoc in-caller peeling) and
   rotate_stride_one_innermost (replaces near-duplicate dance in the
   simplifier rules).

Use those APIs from VectorizeLoops, FlattenNestedRamps, Simplify_Exprs,
and Simplify_Stmts. In particular the atomic-store reduction block in
VectorizeLoops is restructured: one alias_free_slice call discovers
both stride-zero peels (handled via VectorReduce or a tree reduction
over a reordered b) and symbolic/overlapping aliasing peels (handled
via an unrolled cartesian-product loop block). The b's current lane
layout is tracked as a MultiRamp (b_shape_mr) so the per-iteration
slice of the reduced vector can be computed via shuffle_from_slice.

Extend the downsampling/atomic-vectorize test (downsampling_reduce.cpp)
to exercise MultiRamp::div through the vectorize path, and expand the
multiramp API tests.

Co-authored-by: Claude <noreply@anthropic.com>
Three places in Halide iterate the cartesian product of a box of
integer sizes — MultiRamp::shuffle_from_permuted, MultiRamp::flatten,
MultiRamp::shuffle_from_slice, and the unroll block in VectorizeLoops's
atomic-store reduction path. Replace each manual decompose-via-
rem%/rem/ loop with a call to a new for_each_coordinate helper in
Util.h that invokes a callback on each coordinate in lex order.

Co-authored-by: Claude <noreply@anthropic.com>
is_multiramp's recursive impl may leave its output in a partial state
on failure (e.g. after a successful Ramp::base recursion followed by a
failed stride check). The contract previously told callers "don't read
*result on failure," but honoring that required every recursive call
site to use a fresh local MultiRamp. Split is_multiramp into an
internal impl that keeps the old partial-state behavior plus a thin
public wrapper that commits to *result only on success. The public
contract is now the cleaner "untouched on failure." The Mul branch's
fresh-local workaround falls out.

VectorizeLoops's atomic-store reduction path needs to shuffle `b` from
its original lane order into a permuted order (so a subsequent
reduction tree can slice contiguous sub-vectors per stride-zero peel).
It was calling shuffle_from_permuted directly, but that method returns
indices for the opposite direction — "permuted → original," which is
what the Simplify_Exprs caller wants. Invert the result as a
permutation. Without this, any case with multiple stride-zero peels
that needed a real reorder summed lanes that should have gone to
different output addresses.

Simplify_Shuffle's "slice of concat" rule drops concat vectors that
don't overlap with the slice's range. The overlap check tested whether
the concat vector's start OR its last lane was inside the slice, but
missed the case where the slice is entirely contained within one
concat vector (neither endpoint inside). When every vector contained
the slice, new_concat_vectors came out empty and Shuffle::make_concat
tripped its empty-vector assert. Replaced with a standard interval-
overlap check. This is a pre-existing bug that the randomized test
exercised through multiramp-derived slice patterns.

Co-authored-by: Claude <noreply@anthropic.com>
Add hand-picked tests for MultiRamp API properties that weren't
previously covered: mul, operator==, alias_free_slice (unique lanes /
zero-stride peeling / degenerate scalar), rotate_stride_one_innermost
(rotation + transpose round-trip), and is_multiramp round-trips for a
handful of shapes.

Add test_random to transposed_vector_reduce.cpp: 1000 random
quasi-affine store/load index pairs over a 3-dim RDom, each compiled
scalarly and with .atomic().vectorize() across all three RVars and
compared. This test found all three bugs fixed in the preceding
commit.

Co-authored-by: Claude <noreply@anthropic.com>
Ran a weak subagent (Haiku) over the MultiRamp PR as an adversarial
comprehension test — asking it to explain the code in detail, then
fixing whatever it got wrong. The theory: if a weaker model
misreads something, the comment is probably unclear, not the model.

Fixes prompted by the review:

- Simplify_Exprs.cpp / Simplify_Stmts.cpp: stale "outermost" wording
  from before rotate_stride_one_outermost was renamed to
  rotate_stride_one_innermost. The comments contradicted the function
  name and Haiku echoed the contradiction.
- MultiRamp.h alias_free: state explicitly that the returned Expr is
  a sufficient (not necessary) condition for lane uniqueness.
- MultiRamp.h alias_free_slice: clarify that kept dims are a subset
  preserving order, not necessarily a prefix.
- VectorizeLoops.cpp: rename ContainingLoop -> UnrolledLoop and note
  that the peeled dims are fully unrolled into a flat Block, not a
  runtime loop nest (despite the old name).
- MultiRamp.h alias_free_slice: note that stride-zero and purely
  symbolic strides always peel (added by Andrew directly).

A second Haiku pass after these edits answered every question
correctly, including the ones it got wrong the first time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alexreinking alexreinking self-requested a review April 29, 2026 22:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants