Skip to content

Implement opt-in compute-based skin caching.#23255

Open
pcwalton wants to merge 20 commits intobevyengine:mainfrom
pcwalton:skin-caching
Open

Implement opt-in compute-based skin caching.#23255
pcwalton wants to merge 20 commits intobevyengine:mainfrom
pcwalton:skin-caching

Conversation

@pcwalton
Copy link
Copy Markdown
Contributor

@pcwalton pcwalton commented Mar 7, 2026

At the moment, Bevy performs skinning and morph target deformation in the vertex shader. This is not what most production engines do, for multiple reasons:

  1. Skinning each mesh in the vertex shader results in redundant work being done during multipass rendering, for example for prepasses, multiview, and shadow map rasterization.

  2. Building raytracing acceleration structures for skinned meshes requires the skinning to be done ahead of time, because the only transform that's allowed to be done when building an acceleration structure is a single fixed-function matrix multiply.

However, there is also an upside to skinning in the vertex shader as Bevy does: it results in decreased memory usage, especially when drawing multiple instances of a skinned mesh. In my tests, I've found that the performance of skinning in the vertex shader is extremely close to that of skinning in a compute shader, even when a prepass and four shadow cascades are being rendered. Because of this, the primary benefit of skin caching in practice, at least on high-end hardware, seems to be in regards to point (2): Solari can't render skinned meshes with Bevy as it currently stands.

This PR fixes the issue by implementing skin caching: ahead-of-time evaluation of skins and morph targets in a compute shader. To opt in to skin caching for a mesh with skins and/or morph targets, add the CacheSkin component to it. This causes Bevy to perform one compute shader dispatch per vertex slab in order to skin all the skinned meshes in that vertex slab at once.

Perhaps surprisingly, the skinning shader doesn't write the skinned vertices into standard vertex slabs and instead stores them in a different buffer. This is by design, because not all vertex attributes can be skinned: for instance, UVs are unaffected by skinning. It would be wasteful for memory consumption if UVs were duplicated into every instance of a skinned mesh. Therefore, Bevy stores only the data that's affected by skins and/or morph targets in a separate buffer. The vertex shader consults the original vertex slab for the data unaffected by skinning and the skin cache for the data affected by skinning. This is compatible with raytracing acceleration structure building.

The many_foxes example, alongside several others, has been updated with a new --cache-skin switch. Note that this is a performance regression on the many_foxes example in particular. That's because that demo consists entirely of many instances of the same mesh, and so memory usage is greatly increased with skin caching. Skin caching is a double-edged sword, and while it's needed for some use cases, like raytracing, it's undesirable for others.

@pcwalton pcwalton requested review from JMS55, atlv24 and tychedelia March 7, 2026 18:50
@pcwalton pcwalton added the A-Rendering Drawing game state to the screen label Mar 7, 2026
@github-project-automation github-project-automation bot moved this to Needs SME Triage in Rendering Mar 7, 2026
@pcwalton pcwalton added the A-Animation Make things move and change over time label Mar 7, 2026
@github-project-automation github-project-automation bot moved this to Needs SME Triage in Animation Mar 7, 2026
@pcwalton pcwalton added S-Needs-Review Needs reviewer attention (from anyone!) to move forward C-Feature A new feature, making something new possible C-Performance A change motivated by improving speed, memory usage or compile times labels Mar 7, 2026
At the moment, Bevy performs skinning and morph target deformation in
the vertex shader. This is not what most production engines do, for
multiple reasons:

1. Skinning each mesh in the vertex shader results in redundant work
   being done during multipass rendering, for example for prepasses,
   multiview, and shadow map rasterization.

2. Building raytracing acceleration structures for skinned meshes
   requires the skinning to be done ahead of time, because the only
   transform that's allowed to be done when building an acceleration
   structure is a single fixed-function matrix multiply.

However, there is also an upside to skinning in the vertex shader as
Bevy does: it results in decreased memory usage, especially when drawing
multiple instances of a skinned mesh. In my tests, I've found that the
performance of skinning in the vertex shader is extremely close to that
of skinning in a compute shader, even when a prepass and four shadow
cascades are being rendered. Because of this, the primary benefit of
skin caching in practice, at least on high-end hardware, seems to be in
regards to point (2): Solari can't render skinned meshes with Bevy as it
currently stands.

This PR fixes the issue by implementing skin caching: ahead-of-time
evaluation of skins and morph targets in a compute shader. To opt in to
skin caching for a mesh with skins and/or morph targets, add the
`CacheSkin` component to it. This causes Bevy to perform one compute
shader dispatch per vertex slab in order to skin all the skinned meshes
in that vertex slab at once.

Perhaps surprisingly, the skinning shader doesn't write the skinned
vertices into standard vertex slabs and instead stores them in a
different buffer. This is by design, because not all vertex attributes
can be skinned: for instance, UVs are unaffected by skinning. It would
be wasteful for memory consumption if UVs were duplicated into every
instance of a skinned mesh. Therefore, Bevy stores only the data that's
affected by skins and/or morph targets in a separate buffer. The vertex
shader consults the original vertex slab for the data unaffected by
skinning and the skin cache for the data affected by skinning. This is
compatible with raytracing acceleration structure building.

The `many_foxes` example, alongside several others, has been updated
with a new `--cache-skin` switch. Note that *this is a performance
regression* on the `many_foxes` example in particular. That's because
that demo consists entirely of many instances of the same mesh, and so
memory usage is greatly increased with skin caching. Skin caching is a
double-edged sword, and while it's needed for some use cases, like
raytracing, it's undesirable for others.
Copy link
Copy Markdown
Contributor

@greeble-dev greeble-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a shallow review - the overall design and a lot of the details are above my pay grade.

There were a bunch of issues with WebGL even without skin caching enabled. I was able to get it mostly working by hacking out a few things - see various comments. There clearly needs to be some extra logic to handle this, but I wasn't sure where that logic should go. Should individual systems handle it or should SkinCachePlugin never register those systems in the first place? Does there need to be a global flag similar to skins_use_uniform_buffers?

Separately, previous skinned vertices seem broken except for the first instance. Tested on Win10/Nvidia, DX12 and Vulkan: cargo run --example many_foxes -- --count 6 --cache-skins --motion-blur.

image

Maybe worth noting that #21926 will collide with this PR - skin_cache.wgsl will need to handle compressed vertices. Or maybe the practical short-term step is to only allow skin caching if the mesh has an uncompressed layout, similar to Solari's is_mesh_raytracing_compatible?

Comment on lines +14 to +21
/// plays an animation on a skinned glTF model of a fox
#[derive(FromArgs, Resource)]
struct Args {
/// enable skin caching
#[argh(switch)]
cache_skins: bool,
}

Copy link
Copy Markdown
Contributor

@greeble-dev greeble-dev Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think skin caching options should be added to the animated_mesh and morph_targets examples. Their role is to be minimal examples of animation APIs.

Comment thread crates/bevy_pbr/src/render/skin/cache.rs
Comment thread crates/bevy_pbr/src/render/mesh.rs Outdated
&model,
skin,
prev_skin,
Some(skin_cache_buffers),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Some(skin_cache_buffers) should depend on whether skin caching is supported? I could only get WebGL working by hacking this to be None.

Comment thread crates/bevy_pbr/src/render/mesh_bindings.rs Outdated
// If there are no morph targets, this will be a dummy buffer.
@group(0) @binding(7) var<storage> morph_descriptors: array<MorphDescriptor>;

@compute @workgroup_size(64, 1, 1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@compute @workgroup_size(64, 1, 1)
// The workgroup size should match `SKIN_CACHE_WORKGROUP_SIZE`.
@compute @workgroup_size(64, 1, 1)

Comment thread crates/bevy_pbr/src/render/forward_io.wgsl
// caching shader.
match *self {
ElementClass::Vertex => BufferUsages::VERTEX,
ElementClass::Vertex => BufferUsages::VERTEX | BufferUsages::STORAGE,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can hurt some GPUs iirc. Hence why I added extra_buffer_usages to MeshAllocator for Solari. Maybe we can set this automatically when SkinCache is used?

Copy link
Copy Markdown
Contributor Author

@pcwalton pcwalton Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would mean we would either have to:

  1. Copy all the mesh data for every single mesh in the system whenever SkinCache is added to any mesh, and also when the last one is removed. Or:
  2. Put skin-cached meshes in different slabs, increasing drawcall count, increasing fragmentation, and regressing memory usage. And then have to duplicate the mesh data if the same mesh is used in a skin-cached instance and simultaneously in a non-skin-cached instance, which would break the assumption used throughout rendering that meshes and their mesh buffers are in 1:1. Or:
  3. Put skinned meshes in different slabs, increasing drawcall count, increasing fragmentation, and regressing memory usage, even if the skin cache isn't used.

I didn't think any of these 3 options were worth the cost unless we actually observe a major regression (and if we ever find such a GPU maybe we should just disable skin caching entirely on that GPU instead of trying to do fine-grained gymnastics on the buffers).

Copy link
Copy Markdown
Member

@beicause beicause Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should not be added unconditionally, as it may be slower on mobile GPUs.

If users need compute skinning, this usage can be added to extra_buffer_usages manually. We just need to emit an error or panic if it will fail to use cached skin.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only realistic way I can see to do this would be to make the skin caching plugin not part of DefaultPlugins, so you have to add it manually. Does that work for people?

}

#[derive(ShaderType, Clone)]
pub struct MeshUniform {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shame to make this even more expensive :/.

We might want to consider splitting this up in the future.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering yesterday if we could pack the new offsets into current_skin_index and morph_descriptor_index, capping everything at 65k. It's probably a bit too limiting though. Maybe we could push this new data into something indexed by current_skin_index?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MeshUniform (as opposed to MeshInputUniform) is solely read and written on GPU in the GPU driven path. Mesh preprocessing is never a bottleneck in the profiles I've seen. The GPU's enormous memory bandwidth eats it up.

Copy link
Copy Markdown
Contributor

@atlv24 atlv24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm gonna give this a round of testing and then approve I think.

}

#[derive(ShaderType, Clone)]
pub struct MeshUniform {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering yesterday if we could pack the new offsets into current_skin_index and morph_descriptor_index, capping everything at 65k. It's probably a bit too limiting though. Maybe we could push this new data into something indexed by current_skin_index?

Comment thread examples/stress_tests/many_foxes.rs Outdated
}
}

/// Adds `CacheSkin` components to skinned meshes if skin caching was requested
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Adds `CacheSkin` components to skinned meshes if skin caching was requested
/// Adds [`CacheSkin`] components to skinned meshes if skin caching was requested

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an opinion as to whether we should have the --cache-skins option at all? @greeble-dev objected to it. I don't have an opinion, I just want to make sure my reviewers are in agreement before I decide what to do.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, I think adding it to the animated_mesh and morph_targets examples is bad. Adding it to many_foxes and other stress tests is good.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed comment

Comment thread examples/stress_tests/many_morph_targets.rs Outdated
Comment thread examples/animation/morph_targets.rs Outdated
Comment thread examples/animation/animated_mesh.rs Outdated
Comment thread crates/bevy_pbr/src/render/skin/cache.rs
Comment thread crates/bevy_pbr/src/render/skin/cache.rs Outdated
Comment thread crates/bevy_pbr/src/render/mesh.rs Outdated
Comment thread crates/bevy_pbr/src/render/mesh.rs Outdated
Comment on lines +255 to +256
/// The vertex slab ID.
pub vertex_slab_id: SlabId,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you avoid writing tautological comments like this please? they're just noise

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the comment longer so that it adds information that's maybe not obvious from the name.

Comment thread crates/bevy_pbr/src/render/skin/cache.rs Outdated
Comment on lines +846 to +847
/// The layout for the bind group.
bind_group_layout: &'a BindGroupLayout,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the comment longer so it doesn't just restate things that are in the names.

Copy link
Copy Markdown
Contributor

@atlv24 atlv24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this and it looks like a regression in many_foxes, both with and without --release. Frame time is 3.4-4.1ms without skin caching, and 4.3-5.3ms with, on my rtx 5090. But at least without --skin-caching, performance matches main, so you can just not turn it on if its slower. Edit: this is noted in the PR description

@pcwalton
Copy link
Copy Markdown
Contributor Author

pcwalton commented Mar 25, 2026

Yep, skin caching is basically the worst case for many_foxes. Honestly, I feel like the biggest win for skin caching in general is going to be the compatibility with Solari more than anything performance related. I think it should probably be off by default.

@greeble-dev
Copy link
Copy Markdown
Contributor

In case there's any doubt about the utility of skin caching, I'm just gonna note that it sets the groundwork for other features and optimisations in addition to ray-tracing - so anything that wants to run compute on a whole vertex buffer rather than per-vertex. High-end face animation wants it for tangent recalcs and sparse morphs. Physics wants it for GPU cloth sims and deformations. High-end rigging wants it for muscle sims. Lots of fun stuff.

@pcwalton pcwalton added S-Waiting-on-Author The author needs to make changes or address concerns before this can be merged and removed S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Apr 9, 2026
@pcwalton
Copy link
Copy Markdown
Contributor Author

Fixed motion blur. The problem was that we didn't copy over prev_cached_skin_offset in mesh_preprocess.wgsl.

@pcwalton
Copy link
Copy Markdown
Contributor Author

@atlv24 Regarding your comment about MeshUniform, as I mentioned before, I think its size doesn't matter much as it's entirely read and written on GPU which has enormous memory bandwidth. I don't really want to make a whole separate buffer and add extra indexing to it just to save a word or two on GPU, especially since I think we're already coming close to WebGPU storage limits.

@pcwalton
Copy link
Copy Markdown
Contributor Author

OK, I've addressed or responded to every review comment. WebGL 2 is currently broken on main, so I can't test it, but I verified that it isn't broken any worse than it already is. This patch is now properly gated off when we're using WebGL 2, so I don't anticipate any problems.

@pcwalton pcwalton added S-Needs-Review Needs reviewer attention (from anyone!) to move forward and removed S-Waiting-on-Author The author needs to make changes or address concerns before this can be merged labels Apr 19, 2026
Copy link
Copy Markdown
Contributor

@greeble-dev greeble-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few minor comments.

Comment thread crates/bevy_pbr/src/render/skin/cache.rs
Comment on lines +366 to +369
/// References to the buffers that contain cached skinned vertices for the mesh
/// instances corresponding to a vertex/morph target slab pair.
#[derive(Clone, Copy)]
pub struct SkinCacheBuffers<'a> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it confusing to have SkinCacheBuffers when there's also CachedSkinBuffers. Not sure what a good rename would be - maybe SkinCacheBuffers -> CachedSkinBuffersForBindGroup?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rename CachedSkinBuffers to GlobalSkinCacheBuffers.

@pcwalton pcwalton requested a review from greeble-dev April 20, 2026 00:49
@beicause
Copy link
Copy Markdown
Member

I left a comment in #23255 (comment)

I think it should not be added unconditionally, as it may be slower on mobile GPUs.

If users need compute skinning, this usage can be added to extra_buffer_usages manually. We just need to emit an error or panic if it will fail to use cached skin.

Copy link
Copy Markdown
Contributor

@greeble-dev greeble-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a fuller review - got some nitpicky/speculative/non-blocking comments, and one case where I think a binary search isn't quite right.

>,
skinned_mesh_inverse_bindposes: &Assets<SkinnedMeshInverseBindposes>,
joints: &Query<&GlobalTransform>,
removed_cache_skin_query: &mut RemovedComponents<CacheSkin>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that skins would break if I removed the CacheSkin component after the mesh had been rendered at least once. Smelt like stale specialization as I could see that no specializations occurred after the remove.

I tried changing the add_or_delete_skins within extract_skins to get the DirtySpecializations resource and do dirty_specialization.changed_rendered.insert(skinned_mesh_entity), but this alone did not fix it. The system order is extract_skins -> clear_dirty_specializations -> queue_material_meshes, so the dirty entry gets cleared before it's processed by queue_material_meshes.

I tried slapping in an extract_skins.after(DirtySpecializationSystems::CheckForRemovals). That did fix the issue, but doesn't feel like the right solution. Seems more like clear_dirty_specializations should be at the very beginning of extraction, but I'm not familiar enough with ECS scheduling to suggest a change.

In case it's useful, here's a commit with the probably wrong fix + debugging hacks: greeble-dev@690408a

I don't think this bug should be blocker. Reasoning:

  1. Removing the CacheSkin component is a niche case.
  2. Skin caching is a new opt-in feature and arguably experimental.
  3. I suspect there's similar specialization bugs with skinned meshes that will be fixed by adding dirties to extract_skins, so the problem might not be specific to skin caching.

);
#endif // VERTEX_NORMALS

// Skin the mikktspace tangent of the vertex, if applicable.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Skin the mikktspace tangent of the vertex, if applicable.
// Skin the tangent of the vertex, if applicable.

Nitpicky, but there's no guarantee that the mesh is mikktspace.

}

// Do this check because we don't want to read past the end of the buffer.
if (skin_task_mid == arrayLength(&skin_tasks)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (skin_task_mid == arrayLength(&skin_tasks)) {
if (skin_task_mid == (arrayLength(&skin_tasks) - 1)) {

I think this is needed? If skin_task_mid == arrayLength then the next statement's skin_tasks[skin_task_mid + 1u] will read out of bounds?

Comment on lines +640 to +645
sorted_cached_skin_entities.sort_unstable_by_key(|main_entity| match render_mesh_instances_gpu
.get(main_entity)
{
Some(render_mesh_instance) => render_mesh_instance.gpu_specific.current_uniform_index(),
None => u32::MAX,
});
Copy link
Copy Markdown
Contributor

@greeble-dev greeble-dev Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a bit worried about this per-entity sort, particularly with a hash lookup in the sort function. Although in many_foxes with 1000 meshes, the cost is dwarfed by the joint animation and transform propagation costs:

  • Animation+propagation = ~23ms (total core time, not end to end)
  • prepare_skin_cache_buffer = 0.13ms, of which sorting is 0.054ms.

I do wonder if it would be better to make skin caching a property of the Mesh rather than per-instance. From the content side, I'm struggling to think of a case where I'd want different instances of the same mesh to make different skin cache choices - so it's arguably simpler and more robust to choose per-asset and not have to worry about getting the component right per-instance. And from the render side, making the caching choice per-asset seems like it would simplify a few things. So overall it could be both a performance and UX win.

That said, I don't feel strongly either way and don't think it should block this PR, but maybe worth exploring later.

(EDIT: Should clarify - I don't think making it per-Mesh would have a direct impact on that particular sort since the goal is to find the mesh instance, not the mesh. But making things coarser grained would probably be a win somewhere.)

(EDIT: Separately, would suggest adding a comment to the sort explaining why it's there - if I hadn't seen the compute shader first then I would have been confused.)

@greeble-dev
Copy link
Copy Markdown
Contributor

Oops, forgot one thing - hit a limit which I think means the compute dispatch has to be broken up into chunks? Tested on Win10/Nvidia, Vulkan and DX12.

cargo run --example many_foxes --profile bench --features "debug" -- --count 3000 --cache-skins
ERROR RenderContextState::apply{system=bevy_pbr::render::skin::cache::skin_cache}: bevy_render::error_handler: Caught rendering error: Validation Error

Caused by:
  In a CommandEncoder
    In a dispatch command, indirect:false
      Each current dispatch group size dimension ([81000, 1, 1]) must be less or equal to 65535

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-Animation Make things move and change over time A-Rendering Drawing game state to the screen C-Feature A new feature, making something new possible C-Performance A change motivated by improving speed, memory usage or compile times S-Needs-Review Needs reviewer attention (from anyone!) to move forward

Projects

Status: Needs SME Triage
Status: Needs SME Triage

Development

Successfully merging this pull request may close these issues.

5 participants