Implement opt-in compute-based skin caching.#23255
Implement opt-in compute-based skin caching.#23255pcwalton wants to merge 20 commits intobevyengine:mainfrom
Conversation
At the moment, Bevy performs skinning and morph target deformation in the vertex shader. This is not what most production engines do, for multiple reasons: 1. Skinning each mesh in the vertex shader results in redundant work being done during multipass rendering, for example for prepasses, multiview, and shadow map rasterization. 2. Building raytracing acceleration structures for skinned meshes requires the skinning to be done ahead of time, because the only transform that's allowed to be done when building an acceleration structure is a single fixed-function matrix multiply. However, there is also an upside to skinning in the vertex shader as Bevy does: it results in decreased memory usage, especially when drawing multiple instances of a skinned mesh. In my tests, I've found that the performance of skinning in the vertex shader is extremely close to that of skinning in a compute shader, even when a prepass and four shadow cascades are being rendered. Because of this, the primary benefit of skin caching in practice, at least on high-end hardware, seems to be in regards to point (2): Solari can't render skinned meshes with Bevy as it currently stands. This PR fixes the issue by implementing skin caching: ahead-of-time evaluation of skins and morph targets in a compute shader. To opt in to skin caching for a mesh with skins and/or morph targets, add the `CacheSkin` component to it. This causes Bevy to perform one compute shader dispatch per vertex slab in order to skin all the skinned meshes in that vertex slab at once. Perhaps surprisingly, the skinning shader doesn't write the skinned vertices into standard vertex slabs and instead stores them in a different buffer. This is by design, because not all vertex attributes can be skinned: for instance, UVs are unaffected by skinning. It would be wasteful for memory consumption if UVs were duplicated into every instance of a skinned mesh. Therefore, Bevy stores only the data that's affected by skins and/or morph targets in a separate buffer. The vertex shader consults the original vertex slab for the data unaffected by skinning and the skin cache for the data affected by skinning. This is compatible with raytracing acceleration structure building. The `many_foxes` example, alongside several others, has been updated with a new `--cache-skin` switch. Note that *this is a performance regression* on the `many_foxes` example in particular. That's because that demo consists entirely of many instances of the same mesh, and so memory usage is greatly increased with skin caching. Skin caching is a double-edged sword, and while it's needed for some use cases, like raytracing, it's undesirable for others.
There was a problem hiding this comment.
I've done a shallow review - the overall design and a lot of the details are above my pay grade.
There were a bunch of issues with WebGL even without skin caching enabled. I was able to get it mostly working by hacking out a few things - see various comments. There clearly needs to be some extra logic to handle this, but I wasn't sure where that logic should go. Should individual systems handle it or should SkinCachePlugin never register those systems in the first place? Does there need to be a global flag similar to skins_use_uniform_buffers?
Separately, previous skinned vertices seem broken except for the first instance. Tested on Win10/Nvidia, DX12 and Vulkan: cargo run --example many_foxes -- --count 6 --cache-skins --motion-blur.
Maybe worth noting that #21926 will collide with this PR - skin_cache.wgsl will need to handle compressed vertices. Or maybe the practical short-term step is to only allow skin caching if the mesh has an uncompressed layout, similar to Solari's is_mesh_raytracing_compatible?
| /// plays an animation on a skinned glTF model of a fox | ||
| #[derive(FromArgs, Resource)] | ||
| struct Args { | ||
| /// enable skin caching | ||
| #[argh(switch)] | ||
| cache_skins: bool, | ||
| } | ||
|
|
There was a problem hiding this comment.
I don't think skin caching options should be added to the animated_mesh and morph_targets examples. Their role is to be minimal examples of animation APIs.
| &model, | ||
| skin, | ||
| prev_skin, | ||
| Some(skin_cache_buffers), |
There was a problem hiding this comment.
The Some(skin_cache_buffers) should depend on whether skin caching is supported? I could only get WebGL working by hacking this to be None.
| // If there are no morph targets, this will be a dummy buffer. | ||
| @group(0) @binding(7) var<storage> morph_descriptors: array<MorphDescriptor>; | ||
|
|
||
| @compute @workgroup_size(64, 1, 1) |
There was a problem hiding this comment.
| @compute @workgroup_size(64, 1, 1) | |
| // The workgroup size should match `SKIN_CACHE_WORKGROUP_SIZE`. | |
| @compute @workgroup_size(64, 1, 1) |
| // caching shader. | ||
| match *self { | ||
| ElementClass::Vertex => BufferUsages::VERTEX, | ||
| ElementClass::Vertex => BufferUsages::VERTEX | BufferUsages::STORAGE, |
There was a problem hiding this comment.
This can hurt some GPUs iirc. Hence why I added extra_buffer_usages to MeshAllocator for Solari. Maybe we can set this automatically when SkinCache is used?
There was a problem hiding this comment.
That would mean we would either have to:
- Copy all the mesh data for every single mesh in the system whenever
SkinCacheis added to any mesh, and also when the last one is removed. Or: - Put skin-cached meshes in different slabs, increasing drawcall count, increasing fragmentation, and regressing memory usage. And then have to duplicate the mesh data if the same mesh is used in a skin-cached instance and simultaneously in a non-skin-cached instance, which would break the assumption used throughout rendering that meshes and their mesh buffers are in 1:1. Or:
- Put skinned meshes in different slabs, increasing drawcall count, increasing fragmentation, and regressing memory usage, even if the skin cache isn't used.
I didn't think any of these 3 options were worth the cost unless we actually observe a major regression (and if we ever find such a GPU maybe we should just disable skin caching entirely on that GPU instead of trying to do fine-grained gymnastics on the buffers).
There was a problem hiding this comment.
I think it should not be added unconditionally, as it may be slower on mobile GPUs.
If users need compute skinning, this usage can be added to extra_buffer_usages manually. We just need to emit an error or panic if it will fail to use cached skin.
There was a problem hiding this comment.
The only realistic way I can see to do this would be to make the skin caching plugin not part of DefaultPlugins, so you have to add it manually. Does that work for people?
| } | ||
|
|
||
| #[derive(ShaderType, Clone)] | ||
| pub struct MeshUniform { |
There was a problem hiding this comment.
Shame to make this even more expensive :/.
We might want to consider splitting this up in the future.
There was a problem hiding this comment.
I was wondering yesterday if we could pack the new offsets into current_skin_index and morph_descriptor_index, capping everything at 65k. It's probably a bit too limiting though. Maybe we could push this new data into something indexed by current_skin_index?
There was a problem hiding this comment.
MeshUniform (as opposed to MeshInputUniform) is solely read and written on GPU in the GPU driven path. Mesh preprocessing is never a bottleneck in the profiles I've seen. The GPU's enormous memory bandwidth eats it up.
atlv24
left a comment
There was a problem hiding this comment.
I'm gonna give this a round of testing and then approve I think.
| } | ||
|
|
||
| #[derive(ShaderType, Clone)] | ||
| pub struct MeshUniform { |
There was a problem hiding this comment.
I was wondering yesterday if we could pack the new offsets into current_skin_index and morph_descriptor_index, capping everything at 65k. It's probably a bit too limiting though. Maybe we could push this new data into something indexed by current_skin_index?
| } | ||
| } | ||
|
|
||
| /// Adds `CacheSkin` components to skinned meshes if skin caching was requested |
There was a problem hiding this comment.
| /// Adds `CacheSkin` components to skinned meshes if skin caching was requested | |
| /// Adds [`CacheSkin`] components to skinned meshes if skin caching was requested |
There was a problem hiding this comment.
Do you have an opinion as to whether we should have the --cache-skins option at all? @greeble-dev objected to it. I don't have an opinion, I just want to make sure my reviewers are in agreement before I decide what to do.
There was a problem hiding this comment.
Just to clarify, I think adding it to the animated_mesh and morph_targets examples is bad. Adding it to many_foxes and other stress tests is good.
| /// The vertex slab ID. | ||
| pub vertex_slab_id: SlabId, |
There was a problem hiding this comment.
can you avoid writing tautological comments like this please? they're just noise
There was a problem hiding this comment.
I made the comment longer so that it adds information that's maybe not obvious from the name.
| /// The layout for the bind group. | ||
| bind_group_layout: &'a BindGroupLayout, |
There was a problem hiding this comment.
I made the comment longer so it doesn't just restate things that are in the names.
There was a problem hiding this comment.
I tested this and it looks like a regression in many_foxes, both with and without --release. Frame time is 3.4-4.1ms without skin caching, and 4.3-5.3ms with, on my rtx 5090. But at least without --skin-caching, performance matches main, so you can just not turn it on if its slower. Edit: this is noted in the PR description
|
Yep, skin caching is basically the worst case for |
|
In case there's any doubt about the utility of skin caching, I'm just gonna note that it sets the groundwork for other features and optimisations in addition to ray-tracing - so anything that wants to run compute on a whole vertex buffer rather than per-vertex. High-end face animation wants it for tangent recalcs and sparse morphs. Physics wants it for GPU cloth sims and deformations. High-end rigging wants it for muscle sims. Lots of fun stuff. |
|
Fixed motion blur. The problem was that we didn't copy over |
|
@atlv24 Regarding your comment about |
|
OK, I've addressed or responded to every review comment. WebGL 2 is currently broken on |
greeble-dev
left a comment
There was a problem hiding this comment.
Added a few minor comments.
| /// References to the buffers that contain cached skinned vertices for the mesh | ||
| /// instances corresponding to a vertex/morph target slab pair. | ||
| #[derive(Clone, Copy)] | ||
| pub struct SkinCacheBuffers<'a> { |
There was a problem hiding this comment.
I found it confusing to have SkinCacheBuffers when there's also CachedSkinBuffers. Not sure what a good rename would be - maybe SkinCacheBuffers -> CachedSkinBuffersForBindGroup?
There was a problem hiding this comment.
I'll rename CachedSkinBuffers to GlobalSkinCacheBuffers.
|
I left a comment in #23255 (comment)
|
greeble-dev
left a comment
There was a problem hiding this comment.
Did a fuller review - got some nitpicky/speculative/non-blocking comments, and one case where I think a binary search isn't quite right.
| >, | ||
| skinned_mesh_inverse_bindposes: &Assets<SkinnedMeshInverseBindposes>, | ||
| joints: &Query<&GlobalTransform>, | ||
| removed_cache_skin_query: &mut RemovedComponents<CacheSkin>, |
There was a problem hiding this comment.
I found that skins would break if I removed the CacheSkin component after the mesh had been rendered at least once. Smelt like stale specialization as I could see that no specializations occurred after the remove.
I tried changing the add_or_delete_skins within extract_skins to get the DirtySpecializations resource and do dirty_specialization.changed_rendered.insert(skinned_mesh_entity), but this alone did not fix it. The system order is extract_skins -> clear_dirty_specializations -> queue_material_meshes, so the dirty entry gets cleared before it's processed by queue_material_meshes.
I tried slapping in an extract_skins.after(DirtySpecializationSystems::CheckForRemovals). That did fix the issue, but doesn't feel like the right solution. Seems more like clear_dirty_specializations should be at the very beginning of extraction, but I'm not familiar enough with ECS scheduling to suggest a change.
In case it's useful, here's a commit with the probably wrong fix + debugging hacks: greeble-dev@690408a
I don't think this bug should be blocker. Reasoning:
- Removing the
CacheSkincomponent is a niche case. - Skin caching is a new opt-in feature and arguably experimental.
- I suspect there's similar specialization bugs with skinned meshes that will be fixed by adding dirties to
extract_skins, so the problem might not be specific to skin caching.
| ); | ||
| #endif // VERTEX_NORMALS | ||
|
|
||
| // Skin the mikktspace tangent of the vertex, if applicable. |
There was a problem hiding this comment.
| // Skin the mikktspace tangent of the vertex, if applicable. | |
| // Skin the tangent of the vertex, if applicable. |
Nitpicky, but there's no guarantee that the mesh is mikktspace.
| } | ||
|
|
||
| // Do this check because we don't want to read past the end of the buffer. | ||
| if (skin_task_mid == arrayLength(&skin_tasks)) { |
There was a problem hiding this comment.
| if (skin_task_mid == arrayLength(&skin_tasks)) { | |
| if (skin_task_mid == (arrayLength(&skin_tasks) - 1)) { |
I think this is needed? If skin_task_mid == arrayLength then the next statement's skin_tasks[skin_task_mid + 1u] will read out of bounds?
| sorted_cached_skin_entities.sort_unstable_by_key(|main_entity| match render_mesh_instances_gpu | ||
| .get(main_entity) | ||
| { | ||
| Some(render_mesh_instance) => render_mesh_instance.gpu_specific.current_uniform_index(), | ||
| None => u32::MAX, | ||
| }); |
There was a problem hiding this comment.
I was a bit worried about this per-entity sort, particularly with a hash lookup in the sort function. Although in many_foxes with 1000 meshes, the cost is dwarfed by the joint animation and transform propagation costs:
- Animation+propagation = ~23ms (total core time, not end to end)
prepare_skin_cache_buffer= 0.13ms, of which sorting is 0.054ms.
I do wonder if it would be better to make skin caching a property of the Mesh rather than per-instance. From the content side, I'm struggling to think of a case where I'd want different instances of the same mesh to make different skin cache choices - so it's arguably simpler and more robust to choose per-asset and not have to worry about getting the component right per-instance. And from the render side, making the caching choice per-asset seems like it would simplify a few things. So overall it could be both a performance and UX win.
That said, I don't feel strongly either way and don't think it should block this PR, but maybe worth exploring later.
(EDIT: Should clarify - I don't think making it per-Mesh would have a direct impact on that particular sort since the goal is to find the mesh instance, not the mesh. But making things coarser grained would probably be a win somewhere.)
(EDIT: Separately, would suggest adding a comment to the sort explaining why it's there - if I hadn't seen the compute shader first then I would have been confused.)
|
Oops, forgot one thing - hit a limit which I think means the compute dispatch has to be broken up into chunks? Tested on Win10/Nvidia, Vulkan and DX12. cargo run --example many_foxes --profile bench --features "debug" -- --count 3000 --cache-skins |
At the moment, Bevy performs skinning and morph target deformation in the vertex shader. This is not what most production engines do, for multiple reasons:
Skinning each mesh in the vertex shader results in redundant work being done during multipass rendering, for example for prepasses, multiview, and shadow map rasterization.
Building raytracing acceleration structures for skinned meshes requires the skinning to be done ahead of time, because the only transform that's allowed to be done when building an acceleration structure is a single fixed-function matrix multiply.
However, there is also an upside to skinning in the vertex shader as Bevy does: it results in decreased memory usage, especially when drawing multiple instances of a skinned mesh. In my tests, I've found that the performance of skinning in the vertex shader is extremely close to that of skinning in a compute shader, even when a prepass and four shadow cascades are being rendered. Because of this, the primary benefit of skin caching in practice, at least on high-end hardware, seems to be in regards to point (2): Solari can't render skinned meshes with Bevy as it currently stands.
This PR fixes the issue by implementing skin caching: ahead-of-time evaluation of skins and morph targets in a compute shader. To opt in to skin caching for a mesh with skins and/or morph targets, add the
CacheSkincomponent to it. This causes Bevy to perform one compute shader dispatch per vertex slab in order to skin all the skinned meshes in that vertex slab at once.Perhaps surprisingly, the skinning shader doesn't write the skinned vertices into standard vertex slabs and instead stores them in a different buffer. This is by design, because not all vertex attributes can be skinned: for instance, UVs are unaffected by skinning. It would be wasteful for memory consumption if UVs were duplicated into every instance of a skinned mesh. Therefore, Bevy stores only the data that's affected by skins and/or morph targets in a separate buffer. The vertex shader consults the original vertex slab for the data unaffected by skinning and the skin cache for the data affected by skinning. This is compatible with raytracing acceleration structure building.
The
many_foxesexample, alongside several others, has been updated with a new--cache-skinswitch. Note that this is a performance regression on themany_foxesexample in particular. That's because that demo consists entirely of many instances of the same mesh, and so memory usage is greatly increased with skin caching. Skin caching is a double-edged sword, and while it's needed for some use cases, like raytracing, it's undesirable for others.