Skip to content

[Deepin-Kernel-SIG] [linux 6.6-y] [Upstream] [Intel] Intel-SIG: backport cluster scheduler wakeup optimization#827

Merged
opsiff merged 3 commits into
deepin-community:linux-6.6.yfrom
Avenger-285714:intel-cluster-scheduler
Jun 1, 2025
Merged

[Deepin-Kernel-SIG] [linux 6.6-y] [Upstream] [Intel] Intel-SIG: backport cluster scheduler wakeup optimization#827
opsiff merged 3 commits into
deepin-community:linux-6.6.yfrom
Avenger-285714:intel-cluster-scheduler

Conversation

@Avenger-285714
Copy link
Copy Markdown
Member

@Avenger-285714 Avenger-285714 commented Jun 1, 2025

bugzilla: [https://bugzilla.openanolis.cn/show_bug.cgi?id=8001]

Backport the follow-up work to support cluster scheduler. Previously
we have added cluster level in the scheduler for both ARM64[1] and
X86[2] to support load balance between clusters to bring more memory
bandwidth and decrease cache contention. This patchset, on the other
hand, takes care of wake-up path by giving CPUs within the same cluster
a try before scanning the whole LLC to benefit those tasks communicating
with each other.

Barry Song (2):
sched: Add cpus_share_resources API
sched/fair: Scan cluster before scanning LLC in wake-up path

Yicong Yang (1):
sched/fair: Use candidate prev/recent_used CPU if scanning failed for cluster wakeup

Link: https://gitee.com/anolis/cloud-kernel/pulls/2678

Summary by Sourcery

Backport cluster scheduler wakeup optimizations to prioritize CPUs within the same cluster on task wakeup paths.

New Features:

  • Introduce SD_CLUSTER scheduling domain flag and static key sched_cluster_active to enable cluster-level scheduling
  • Add cpus_share_resources API to detect CPUs sharing cluster-level resources

Enhancements:

  • Modify select_idle_cpu to scan cluster members before scanning the entire LLC on wakeup
  • Update select_idle_sibling to prefer previous or recent-used CPU within the same cluster when no idle CPU is found
  • Extend scheduler topology to assign per-CPU sd_share_id and activate cluster scheduling paths

Barry Song and others added 3 commits June 1, 2025 21:42
ANBZ: #8001

commit b95303e upstream.

Add cpus_share_resources() API. This is the preparation for the
optimization of select_idle_cpu() on platforms with cluster scheduler
level.

On a machine with clusters cpus_share_resources() will test whether
two cpus are within the same cluster. On a non-cluster machine it
will behaves the same as cpus_share_cache(). So we use "resources"
here for cache resources.

Intel-SIG: commit b95303e sched: Add cpus_share_resources API.
Cluster based task wakeup optimization backport.

Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20231019033323.54147-2-yangyicong@huawei.com
[ Aubrey Li: amend commit log ]
Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
ANBZ: #8001

commit 8881e16 upstream.

For platforms having clusters like Kunpeng920, CPUs within the same cluster
have lower latency when synchronizing and accessing shared resources like
cache. Thus, this patch tries to find an idle cpu within the cluster of the
target CPU before scanning the whole LLC to gain lower latency. This
will be implemented in 2 steps in select_idle_sibling():
1. When the prev_cpu/recent_used_cpu are good wakeup candidates, use them
   if they're sharing cluster with the target CPU. Otherwise trying to
   scan for an idle CPU in the target's cluster.
2. Scanning the cluster prior to the LLC of the target CPU for an
   idle CPU to wakeup.

Testing has been done on Kunpeng920 by pinning tasks to one numa and two
numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs.

With this patch, We noticed enhancement on tbench and netperf within one
numa or cross two numa on top of tip-sched-core commit
9b46f1abc6d4 ("sched/debug: Print 'tgid' in sched_show_task()")

tbench results (node 0):
            baseline                     patched
  1:        327.2833        372.4623 (   13.80%)
  4:       1320.5933       1479.8833 (   12.06%)
  8:       2638.4867       2921.5267 (   10.73%)
 16:       5282.7133       5891.5633 (   11.53%)
 32:       9810.6733       9877.3400 (    0.68%)
 64:       7408.9367       7447.9900 (    0.53%)
128:       6203.2600       6191.6500 (   -0.19%)
tbench results (node 0-1):
            baseline                     patched
  1:        332.0433        372.7223 (   12.25%)
  4:       1325.4667       1477.6733 (   11.48%)
  8:       2622.9433       2897.9967 (   10.49%)
 16:       5218.6100       5878.2967 (   12.64%)
 32:      10211.7000      11494.4000 (   12.56%)
 64:      13313.7333      16740.0333 (   25.74%)
128:      13959.1000      14533.9000 (    4.12%)

netperf results TCP_RR (node 0):
            baseline                     patched
  1:      76546.5033      90649.9867 (   18.42%)
  4:      77292.4450      90932.7175 (   17.65%)
  8:      77367.7254      90882.3467 (   17.47%)
 16:      78519.9048      90938.8344 (   15.82%)
 32:      72169.5035      72851.6730 (    0.95%)
 64:      25911.2457      25882.2315 (   -0.11%)
128:      10752.6572      10768.6038 (    0.15%)

netperf results TCP_RR (node 0-1):
            baseline                     patched
  1:      76857.6667      90892.2767 (   18.26%)
  4:      78236.6475      90767.3017 (   16.02%)
  8:      77929.6096      90684.1633 (   16.37%)
 16:      77438.5873      90502.5787 (   16.87%)
 32:      74205.6635      88301.5612 (   19.00%)
 64:      69827.8535      71787.6706 (    2.81%)
128:      25281.4366      25771.3023 (    1.94%)

netperf results UDP_RR (node 0):
            baseline                     patched
  1:      96869.8400     110800.8467 (   14.38%)
  4:      97744.9750     109680.5425 (   12.21%)
  8:      98783.9863     110409.9637 (   11.77%)
 16:      99575.0235     110636.2435 (   11.11%)
 32:      95044.7250      97622.8887 (    2.71%)
 64:      32925.2146      32644.4991 (   -0.85%)
128:      12859.2343      12824.0051 (   -0.27%)

netperf results UDP_RR (node 0-1):
            baseline                     patched
  1:      97202.4733     110190.1200 (   13.36%)
  4:      95954.0558     106245.7258 (   10.73%)
  8:      96277.1958     105206.5304 (    9.27%)
 16:      97692.7810     107927.2125 (   10.48%)
 32:      79999.6702     103550.2999 (   29.44%)
 64:      80592.7413      87284.0856 (    8.30%)
128:      27701.5770      29914.5820 (    7.99%)

Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so the SMT branch
in the code has not been tested but it supposed to work.

Chen Yu also noticed this will improve the performance of tbench and
netperf on a 24 CPUs Jacobsville machine, there are 4 CPUs in one
cluster sharing L2 Cache.

Intel-SIG: commit 8881e16 sched/fair: Scan cluster before scanning LLC in wake-up path.
Cluster based task wakeup optimization backport.

[https://lore.kernel.org/lkml/Ytfjs+m1kUs0ScSn@worktop.programming.kicks-ass.net]
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Link: https://lkml.kernel.org/r/20231019033323.54147-3-yangyicong@huawei.com
[ Aubrey Li: amend commit log ]
Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
… cluster wakeup

ANBZ: #8001

commit 22165f6 upstream.

Chen Yu reports a hackbench regression of cluster wakeup when
hackbench threads equal to the CPU number [1]. Analysis shows
it's because we wake up more on the target CPU even if the
prev_cpu is a good wakeup candidate and leads to the decrease
of the CPU utilization.

Generally if the task's prev_cpu is idle we'll wake up the task
on it without scanning. On cluster machines we'll try to wake up
the task in the same cluster of the target for better cache
affinity, so if the prev_cpu is idle but not sharing the same
cluster with the target we'll still try to find an idle CPU within
the cluster. This will improve the performance at low loads on
cluster machines. But in the issue above, if the prev_cpu is idle
but not in the cluster with the target CPU, we'll try to scan an
idle one in the cluster. But since the system is busy, we're
likely to fail the scanning and use target instead, even if
the prev_cpu is idle. Then leads to the regression.

This patch solves this in 2 steps:
o record the prev_cpu/recent_used_cpu if they're good wakeup
  candidates but not sharing the cluster with the target.
o on scanning failure use the prev_cpu/recent_used_cpu if
  they're recorded as idle

[1] https://lore.kernel.org/all/ZGzDLuVaHR1PAYDt@chenyu5-mobl1/

Intel-SIG: commit 22165f6 Use candidate prev/recent_used CPU if scanning failed for cluster wakeup.
Cluster based task wakeup optimization backport

Closes: https://lore.kernel.org/all/ZGsLy83wPIpamy6x@chenyu5-mobl1/
Reported-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20231019033323.54147-4-yangyicong@huawei.com
[ Aubrey Li: amend commit log ]
Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
@Avenger-285714 Avenger-285714 requested a review from Copilot June 1, 2025 13:48
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented Jun 1, 2025

Reviewer's Guide

This PR backports cluster scheduler wakeup optimizations by introducing a new SD_CLUSTER flag with a static key, a cpus_share_resources API, and enhancing the fair scheduler’s wakeup paths (select_idle_cpu and select_idle_sibling) to prioritize CPUs within the same cluster before falling back to a full LLC scan, thus improving task locality and reducing cross-cluster cache contention.

File-Level Changes

Change Details Files
Cluster-first wakeup path in fair scheduler
  • Added a static_branch_unlikely guard to scan CPUs in the same cluster before the LLC in select_idle_cpu
  • Enhanced select_idle_sibling to prefer prev_affinity or recent_used CPU if resources are shared
  • Adjusted wakeup loop conditions (nr decrement checks) for correctness
kernel/sched/fair.c
New cpus_share_resources API for cluster-aware resource checks
  • Implemented cpus_share_resources in core.c to compare sd_share_id values
  • Declared inline cpus_share_resources stub in include/linux/sched/topology.h
  • Replaced cpumask/cache checks with cpus_share_resources in sibling selection paths
kernel/sched/core.c
include/linux/sched/topology.h
SD_CLUSTER scheduling domain flag and static key activation
  • Defined SD_CLUSTER flag in include/linux/sched/sd_flags.h
  • Added per-CPU sd_share_id and sched_cluster_active static key in sched headers
  • Updated topology.c to assign sd_share_id, track has_cluster, and inc/dec sched_cluster_active
  • Adjusted cpu_cluster_flags to include SD_CLUSTER
include/linux/sched/sd_flags.h
include/linux/sched/topology.h
include/linux/sched/sched.h
kernel/sched/topology.c

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@deepin-ci-robot
Copy link
Copy Markdown

deepin pr auto review

代码审查意见:

  1. cpus_share_resources函数中,return true;是一个过于宽泛的返回值,应该根据具体的逻辑判断是否两个CPU共享资源。如果两个CPU不在同一个集群中,应该返回false

  2. select_idle_cpu函数中,cpumask_andnot(cpus, cpus, sched_group_span(sg));这行代码可能会影响cpus的原始值,应该考虑是否有必要这样做,或者是否有更好的方式来处理这种情况。

  3. select_idle_sibling函数中,prev_affrecent_used_cpu的初始化值应该根据实际情况进行调整,以确保它们在所有情况下都能正确地表示未找到合适的CPU。

  4. select_idle_sibling函数中,if ((unsigned int)prev_aff < nr_cpumask_bits)if ((unsigned int)recent_used_cpu < nr_cpumask_bits)这两行代码的顺序可能会影响最终的返回值,应该确保它们的逻辑是正确的。

  5. build_sched_domains函数中,has_cluster变量在初始化时应该设置为false,以确保在没有任何集群存在的情况下,sched_cluster_active静态键不会被错误地增加。

  6. build_sched_domains函数中,static_branch_inc_cpuslocked(&sched_cluster_active);这行代码应该在has_clustertrue时才执行,以避免不必要的静态键增加。

  7. detach_destroy_domains函数中,static_branch_dec_cpuslocked(&sched_cluster_active);这行代码应该在static_branch_unlikely(&sched_cluster_active)true时才执行,以避免不必要的静态键减少。

  8. update_top_cache_domain函数中,per_cpu(sd_share_id, cpu) = id;这行代码的注释应该更详细地说明为什么需要将sd_share_id设置为集群ID,而不是LLC ID。

  9. update_top_cache_domain函数中,if (lowest_flag_domain(i, SD_CLUSTER))这行代码应该在sdNULL时才执行,以避免潜在的空指针解引用。

  10. update_top_cache_domain函数中,id = cpumask_first(sched_domain_span(sd));这行代码应该在sdNULL时才执行,以避免潜在的空指针解引用。

总体来说,代码的修改应该更加细致地考虑逻辑和边界情况,以确保代码的正确性和健壮性。

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR backports wake‐up path optimizations for cluster schedulers on Intel platforms to improve inter-CPU communication by preferentially waking up CPUs within the same cluster.

  • Introduces a new per-CPU variable (sd_share_id) and static key (sched_cluster_active) to manage cluster-related scheduling data.
  • Adjusts idle CPU selection in the fair scheduler to prefer CPUs within the same cluster and updates related header declarations and flag definitions.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
kernel/sched/topology.c Adds per-CPU sd_share_id and initializes it based on cluster or LLC id.
kernel/sched/sched.h Declares the new per-CPU variable and extern for sched_cluster_active.
kernel/sched/fair.c Updates idle CPU selection to account for cluster boundaries.
kernel/sched/core.c Implements cpus_share_resources using the new sd_share_id.
include/linux/sched/topology.h Provides an inline cpus_share_resources that currently returns true.
include/linux/sched/sd_flags.h Introduces the SD_CLUSTER flag for indicating cluster sharing.
Comments suppressed due to low confidence (2)

include/linux/sched/topology.h:236

  • The inline implementation of cpus_share_resources always returning true may conflict with the proper implementation in kernel/sched/core.c. Consider removing or updating this inline definition to ensure consistent behavior.
static inline bool cpus_share_resources(int this_cpu, int that_cpu) {

kernel/sched/fair.c:7573

  • [nitpick] The use of -1 as an initial value for prev_aff may be less clear; consider using a named constant (e.g., INVALID_CPU) to improve code readability.
int i, recent_used_cpu, prev_aff = -1;

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Avenger-285714 - I've reviewed your changes and they look great!

Here's what I looked at during the review
  • 🟡 General issues: 1 issue found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟢 Complexity: all looks good
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread kernel/sched/fair.c
struct sched_domain *sd;
unsigned long task_util, util_min, util_max;
int i, recent_used_cpu;
int i, recent_used_cpu, prev_aff = -1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Initialize recent_used_cpu to a safe default

Explicitly initialize recent_used_cpu to -1 at declaration to clarify its default state and prevent accidental use before assignment.

Suggested change
int i, recent_used_cpu, prev_aff = -1;
int i, recent_used_cpu = -1, prev_aff = -1;

@opsiff opsiff merged commit 917d468 into deepin-community:linux-6.6.y Jun 1, 2025
6 of 7 checks passed
@deepin-ci-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: opsiff

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Avenger-285714 Avenger-285714 deleted the intel-cluster-scheduler branch June 1, 2025 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants