Skip to content

feat: Backport VRAM management patches for dmem cgroup#1889

Open
deepin-wm wants to merge 4 commits into
deepin-community:linux-6.18.yfrom
deepin-wm:vram-mgmt-backport
Open

feat: Backport VRAM management patches for dmem cgroup#1889
deepin-wm wants to merge 4 commits into
deepin-community:linux-6.18.yfrom
deepin-wm:vram-mgmt-backport

Conversation

@deepin-wm

Copy link
Copy Markdown

Summary

Backport VRAM management patches from pixelcluster's dmemcg-aggressive-protect branch to improve VRAM allocation for low-end GPUs.

These patches fix AMDGPU's VRAM management so that applications protected by dmem cgroup limits (dmem.low/dmem.min) are more aggressive about evicting unprotected buffers, preventing protected application buffers from being forced into GTT (system RAM) even when they are within their protection limits.

Changes

Patch 1: cgroup/dmem: Add queries for protection values

Add dmem_cgroup_below_min() and dmem_cgroup_below_low() helpers, counterparts to memcg's mem_cgroup_below_{min,low}. Callers can use these to be more aggressive in making space for allocations of a protected cgroup.

Patch 2: cgroup,cgroup/dmem: Add (dmem_)cgroup_common_ancestor helper

Add a helper to find the common ancestor of two cgroup pool states. This is needed to determine the correct subtree when making eviction decisions about protected buffers.

Patches 3-6 (adapted for 6.18.y):

  • drm/ttm: Extract code for attempting allocation in a place - Introduce struct ttm_bo_alloc_state and ttm_bo_alloc_at_place() for better allocation logic organization.
  • drm/ttm: Split cgroup charge and resource allocation - Separate cgroup charging from resource allocation via ttm_resource_try_charge() to fix race conditions when charge succeeds but allocation fails.
  • drm/ttm: Be more aggressive when allocating below protection limit - When the cgroup's memory usage is below low/min limit and allocation fails, try evicting unprotected buffers to make space.
  • drm/ttm: Use common ancestor of evictor and evictee as limit pool - Use the common ancestor cgroup for correct protection calculation when sibling cgroups compete for memory.

Source

Patches from: https://pixelcluster.github.io/VRAM-Mgmt-fixed/
Original commits by Natalie Vock natalie.vock@gmx.de

Notes

  • Patches 1-2 applied cleanly from upstream
  • Patches 3-6 were adapted for the 6.18.y code structure (minor differences in TTM allocation loop)
  • Targeting linux-6.18.y branch as it already has the dmem cgroup controller infrastructure (the linux-6.6.y branch does not)
  • Userspace utilities (dmemcg-booster, plasma-foreground-booster) are also needed for full functionality but are separate packages

pixelcluster and others added 3 commits June 18, 2026 19:01
Callers can use this feedback to be more aggressive in making space for
allocations of a cgroup if they know it is protected.

These are counterparts to memcg's mem_cgroup_below_{min,low}.

Signed-off-by: Natalie Vock <natalie.vock@gmx.de>
This helps to find a common subtree of two resources, which is important
when determining whether it's helpful to evict one resource in favor of
another.

To facilitate this, add a common helper to find the ancestor of two
cgroups using each cgroup's ancestor array.

Signed-off-by: Natalie Vock <natalie.vock@gmx.de>
Backport the following patches from pixelcluster's dmemcg-aggressive-protect
branch, adapted for kernel 6.18.y:

- drm/ttm: Extract code for attempting allocation in a place
  Introduce ttm_bo_alloc_state and ttm_bo_alloc_at_place() to better
  organize allocation logic. Move limit_pool from ttm_bo_evict_walk
  to the new alloc_state structure.

- drm/ttm: Split cgroup charge and resource allocation
  Separate cgroup charging from resource allocation to fix race
  conditions when charge succeeds but allocation fails. Add
  ttm_resource_try_charge() for pre-charging cgroups before
  resource allocation attempts.

- drm/ttm: Be more aggressive when allocating below protection limit
  When the cgroup's memory usage is below the low/min limit and
  allocation fails, try evicting unprotected buffers to make space.
  This prevents application buffers from being forced into GTT even
  though usage is below the protection limit.

- drm/ttm: Use common ancestor of evictor and evictee as limit pool
  When checking whether to skip protected buffers, use the common
  ancestor of evictor and evictee cgroups as the limit pool. This
  ensures correct protection calculation for sibling cgroups.

Original patches by Natalie Vock <natalie.vock@gmx.de>
Adapted for deepin-community/kernel linux-6.18.y branch.

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @deepin-wm, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@deepin-ci-robot

Copy link
Copy Markdown

Hi @deepin-wm. Thanks for your PR. 😃

@deepin-ci-robot

Copy link
Copy Markdown

Hi @deepin-wm. Thanks for your PR.

I'm waiting for a deepin-community member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@deepin-ci-robot

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign avenger-285714 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Backports VRAM (TTM) eviction/allocation changes and adds dmem-cgroup helpers to make allocations for cgroups under dmem.low/dmem.min more likely to stay in VRAM by preferentially evicting unprotected buffers.

Changes:

  • Add dmem-cgroup helpers to query whether a pool is below effective min/low, and to compute the common ancestor pool for correct subtree protection calculations.
  • Introduce ttm_resource_try_charge() and change ttm_resource_alloc() to separate cgroup charging from resource allocation.
  • Refactor TTM allocation/eviction flow (ttm_bo_alloc_at_place() and updated eviction walk) to be more aggressive when the allocating cgroup is protected.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
kernel/cgroup/dmem.c Adds below_min/low queries and a common-ancestor helper used by TTM’s protection-aware eviction decisions.
include/linux/cgroup.h Adds cgroup_common_ancestor() helper to support common-ancestor queries.
include/linux/cgroup_dmem.h Exposes new dmem-cgroup helper APIs to other subsystems (TTM).
include/drm/ttm/ttm_resource.h Updates TTM resource API to separate charging (try_charge) from allocation.
drivers/gpu/drm/ttm/ttm_resource.c Implements ttm_resource_try_charge() and updates ttm_resource_alloc() to accept a pre-charged pool.
drivers/gpu/drm/ttm/ttm_bo.c Refactors allocation/eviction and integrates dmem-cgroup protection logic into the eviction walk.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread kernel/cgroup/dmem.c
Comment on lines +706 to +710
bool dmem_cgroup_below_min(struct dmem_cgroup_pool_state *root,
struct dmem_cgroup_pool_state *test)
{
if (root == test || !pool_parent(test))
return false;

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bd7e681. Added if (!test) return false; at the start of dmem_cgroup_below_min() to treat NULL test as 'not protected'.

Comment thread kernel/cgroup/dmem.c
Comment on lines +737 to +741
bool dmem_cgroup_below_low(struct dmem_cgroup_pool_state *root,
struct dmem_cgroup_pool_state *test)
{
if (root == test || !pool_parent(test))
return false;

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bd7e681. Added if (!test) return false; at the start of dmem_cgroup_below_low() to treat NULL test as 'not protected'.

Comment thread kernel/cgroup/dmem.c
Comment on lines +769 to +786
struct dmem_cgroup_pool_state *dmem_cgroup_get_common_ancestor(struct dmem_cgroup_pool_state *a,
struct dmem_cgroup_pool_state *b)
{
struct cgroup *ancestor_cgroup;
struct cgroup_subsys_state *ancestor_css;

if (!a || !b)
return NULL;

ancestor_cgroup = cgroup_common_ancestor(a->cs->css.cgroup, b->cs->css.cgroup);
if (!ancestor_cgroup)
return NULL;

ancestor_css = cgroup_e_css(ancestor_cgroup, &dmem_cgrp_subsys);
css_get(ancestor_css);

return get_cg_pool_unlocked(css_to_dmemcs(ancestor_css), a->region);
}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bd7e681. Added IS_ERR_OR_NULL check on get_cg_pool_unlocked return value, with css_put(ancestor_css) on error path. Returns NULL on failure.

Comment thread kernel/cgroup/dmem.c
Comment on lines +761 to +764
* dmem_cgroup_get_common_ancestor(): Find the first common ancestor of two pools.
* @a: First pool to find the common ancestor of.
* @b: First pool to find the common ancestor of.
*

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bd7e681. Changed @b: First pool to @b: Second pool in the kernel-doc.

Comment on lines +582 to +587
if (!alloc_state->in_evict) {
may_evict |= dmem_cgroup_below_min(NULL, alloc_state->charge_pool);
alloc_state->may_try_low = may_evict;

may_evict |= dmem_cgroup_below_low(NULL, alloc_state->charge_pool);
}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bd7e681. Guarded the protection checks with if (!alloc_state->in_evict && alloc_state->charge_pool) so they only run when a pool was actually charged.

Comment thread drivers/gpu/drm/ttm/ttm_bo.c Outdated
Comment on lines +643 to +645
if (!evict_walk->alloc_state->may_try_low &&
bo->resource->css == evict_walk->alloc_state->charge_pool)
return 0;

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bd7e681. Added alloc_state->charge_pool check before the same-cgroup comparison, so uncharged BOs are not incorrectly skipped.

Comment thread drivers/gpu/drm/ttm/ttm_bo.c Outdated
Comment on lines +667 to +669
evict_valuable = dmem_cgroup_state_evict_valuable(limit_pool, bo->resource->css,
evict_walk->try_low,
&evict_walk->hit_low);

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bd7e681. Added early return with evict_valuable = true when bo->resource->css is NULL, treating uncharged resources as unprotected and evictable.

Fix issues identified by Copilot code review:

dmem.c:
- Add NULL test pointer check in dmem_cgroup_below_min/low to prevent
  NULL dereference when called with uncharged pools
- Fix dmem_cgroup_get_common_ancestor ERR_PTR handling: check for
  IS_ERR_OR_NULL return from get_cg_pool_unlocked and release
  css reference on failure
- Fix kernel-doc: @b parameter was incorrectly described as 'First pool'

ttm_bo.c:
- Guard dmem_cgroup_below_min/low calls with charge_pool check to
  prevent NULL dereference when resource manager has no dmem cgroup
- Only skip same-cgroup BOs during eviction when charge_pool is
  non-NULL to avoid blocking all uncharged resource eviction
- Handle NULL css in eviction callback: treat uncharged resources
  as unprotected/evictable instead of passing NULL to
  dmem_cgroup_state_evict_valuable
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants