Auto-generating fused broadcast kernels#121
Conversation
|
We'll definitely need ways to expose cuTile.jl to users rather than kernel developers, so this is definitely a good first step. It's a bit unfortunate we're not re-using Julia's broadcast fusion here. I wonder if we could have a macro to only redirect the broadcast style to something cuTile.jl-specific and otherwise reuse more of the existing machinery? A separate array type would be easiest, but I'm not sure we want that. Or maybe it would be fine if it's only a wrapper to influence dispatch (here, broadcaststyle), e.g., |
8fbee9d to
65519ec
Compare
|
Okay, I vibe coded something that looks like: Pretty nice, no? And reuses much more of the broadcast machinery, without the need to generate kernels. |
|
Interestingly, cuTile is already faster than our regular broadcast: Maybe not too surprising though, since we do some forced specialization on broadcast arguments. I wonder if we should consider backporting the |
c4b4757 to
3275bc3
Compare
|
I moved my code into #129, since it was entirely disconnected from the work on this branch. |
|
All good! Looks super sleek, and is going in the right direction. |
|
Superceded by #129 |
This is more of an issue-with-supplementary-code than a PR, and should not necessarily be in cuTile.jl, but I had this idea to procedurally generate cuTile kernels for fused broadcasting like Julia already does. My main motivation here though is leveraging tileiras for e.g. converting FP8 activations to arithmetic types then writing back to FP8 again, but it seems to also perform reasonably well for more mundane things:
Currently it creates 8 methods to accommodate different destination dimensionalities, so it dispatches based on
ndims(dest)at launch time (could probably be a@generatedfunction instead). Tile sizes default is simply(64, 64, 1, 1, ...). Singleton broadcasting, type conversions (e.g.BFloat16.(x)), scalars, etc. all work.I haven't tested it extensively, but it might be fairly robust. On it's own it isn't too useful outside of leveraging tileiras. I wonder if anything like this could be used for epilogues.
I suppose a macro-less version could be done by intercepting
Base.Broadcast. maybe with method tables by passing a function to a function. The macro is straight-forward enough. Curious to hear your thoughts though @maleadt.